How to parse "SecciÃ³n" to "Sección"? (string accutes encoding issue)

How to parse "SecciÃ³n" to "Sección"? (string accutes encoding issue) - java

I have a string with this value "SecciÃ³n"
I need to parse it to UTF-8, so the string gets transformed to "Sección"
I tried with line = new String(line.getBytes("UTF-8"), "UTF-8"); but this does not work.
Edit
I'm reading the string with this method:
public static String loadLine(InputStream is) {
if (is == null)
return null;
final short TAM_LINE = 256;
String line;
char[] buffer = new char[TAM_LINE];
short i;
int ch;
try {
line = "";
i = 0;
do {
ch = is.read();
if ((ch != '\n') && (ch != -1)) {
buffer[i++] = (char)(ch & 0xFF);
if (i >= TAM_LINE) {
line += new String(buffer, 0, i);
i = 0;
}
}
} while ((ch != '\n') && (ch != -1));
// Si no hemos llegado a leer ning�n caracter, devolvemos null
if (ch == -1 && i == 0)
return null;
// A�adimos el �ltimo trozo de l�nea le�do
line += new String(buffer, 0, i);
} catch (IOException e) {
e.printStackTrace();
return null;
}
return line;
}

The character ó is encoded as 0xc3 0xb3 in UTF-8. It appears that whichever program read that UTF-8-encoded string in the first place read it assuming the wrong encoding, for example windows-1252, where 0xc3 encodes Ã and 0xb3 encodes ³.
In your case, your edit shows that (as far as I can tell, I don't know Java), you're reading the input byte by byte, building the string one character at a time, one from each byte. This is not a good idea if the encoding UTF-8 uses multiple bytes to encode certain characters such as ó.
You should read the input into a bytes array first, then build a String using the correct encoding:
line = new String(byteArray, "UTF-8")

Related

how to read a byte from a text file as an actual byte in hex instead of characters?

Im really unsure how to phrase my question, but here is the situation.
I have data in a text file, for example: 0x7B 0x01 0x2C 0x00 0x00 0xEA these values are a hex representation of ASCII symbols. I need to read this data and be able to parse and translate accordingly.
My problem so far is that ive tried using a scanner via something like scan.getNextByte() and was directed towards the post: [using java.util.Scanner to read a file byte by byte]
After changing the file input format to a fileinputstream i found that while doing something like fis.read(), this is returning 48, the ascii value for the character 0 in 0x7B.
I am looking for a way to interpret the data being read in has hex so 0x7B will be equivalent to "{" in ASCII.
Hope this is clear enough to all,
Thanks,

Since your bytes are delimited by spaces, you can just use a Scanner to read them:
try (Scanner scanner = new Scanner(Paths.get(filename))) {
while (scanner.hasNext()) {
int byteValue = Integer.decode(scanner.next());
// Process byteValue ...
}
}
I encourage you to read about the Integer.decode method and the Scanner class.

If you need scalable solution, try to write your own InputStream
Basic example:
class ByteStringInputStream extends InputStream {
private final InputStream inputStream;
public ByteStringInputStream(InputStream inputStream) {
this.inputStream = inputStream;
}
private boolean isHexSymbol(char c) {
return (c >= '0' && c <= '9')
|| (c >= 'A' && c <= 'F')
|| (c == 'x');
}
#Override
public int read() throws IOException {
try {
int readed;
char[] buffer = new char[4];
int bufferIndex = 0;
while ((readed = inputStream.read()) != -1 && bufferIndex < 4) {
if (isHexSymbol((char) readed)) {
buffer[bufferIndex] = (char) readed;
}
bufferIndex++;
}
String stringBuffer = new String(buffer);
if (!stringBuffer.matches("^0x[0-9A-F]{2}$")) {
throw new NumberFormatException(stringBuffer);
}
return Integer.decode(stringBuffer);
} catch (Exception ex) {
inputStream.close();
throw new IOException("<YOUR_EXCEPTION_TEXT_HERE>", ex);
}
}
}
Usage:
ByteStringInputStream bsis = new ByteStringInputStream(new BufferedInputStream(System.in));
//you can use any InputStream instead
while (true) {
System.out.println(bsis.read());
}
Demo:
>0x7B 0x01 0x2C 0x00 0x00 0xEA
123
1
44
0
0
234

If you're in a position to use external libraries, the Apache Commons Codec library has a Hex utility class that can turn a character-array representation of hex bytes into a byte array:
final String hexChars = "0x48 0x45 0x4C 0x4C 0x4F";
// to get "48454C4C4F"
final String plainHexChars = hexChars.replaceAll("(0x| )", "");
final byte[] hexBytes = Hex.decodeHex(plainHexChars.toCharArray());
final String decodedBytes = new String(hexBytes, Charset.forName("UTF-8"));

Normalize Text - Read Each Character and remove spaces - Bad Enconding

I am trying to have a program that normalizes my text, it removes multiple empty spaces, it prints the other characters from the original file, and also put spaces and start and ending symbols.
So the conversion, after I write the txt file and open it, I see this content:
numa situaã § ã £ o de emergãªncia mã © dica
as you can see there are some weird characters that I don't want, maybe it's because of Encoding??
This is a text in my language, Portuguese.
Here is my code, how can I fix it?
public static void main(String[] args) throws IOException {
Charset encoding = Charset.defaultCharset();
InputStream in = new FileInputStream(new File("data.txt"));
Reader reader = new InputStreamReader(in, encoding);
Reader buffer = new BufferedReader(reader);
StringBuilder normalizedLanguage = new StringBuilder("<");
int r;
while ((r = buffer.read()) != -1) {
char ch = (char) r;
boolean newline = false;
boolean hasLetterBefore = false;
boolean hasLetterAfter = false;
char symbol = '-';
int lines = 0;
if (newline)
{
normalizedLanguage.append("\n<");
}
if (ch == '\r' || ch == '\n' )
{
lines++;
normalizedLanguage.append(">");
newline = true;
hasLetterBefore = false;
}
else if (Character.isLetterOrDigit(ch))
{
if (hasLetterBefore == true)
{
normalizedLanguage.append(Character.toString(symbol) + Character.toString(Character.toLowerCase(ch)));
}else{
normalizedLanguage.append(Character.toString(Character.toLowerCase(ch)));
}
newline = false;
hasLetterBefore = true;
}
else if (ch == ' ')
{
normalizedLanguage.append(Character.toString(ch));
newline = false;
hasLetterBefore = false;
}
else if (ch == '\t')
{
System.out.println("Tab detected: " + ch);
newline = false;
hasLetterBefore = false;
}
else
{
//Símbolos, entre outros..
if (!hasLetterBefore)
{
normalizedLanguage.append(" " + Character.toString(ch) + " ");
}
else
{
symbol = ch;
}
newline = false;
}
}
String normalizedLanguageString = normalizedLanguage.toString().trim().replaceAll(" +", " ");
PrintWriter out = new PrintWriter("data_after.txt");
out.println(normalizedLanguageString);
out.close();
buffer.close();
reader.close();
in.close();
}
Thank you very much in advance ;)

The problem got solved using another Charset Encoding :)
Change this line:
Charset encoding = Charset.defaultCharset();
To:
Charset encoding = Charset.forName("UTF8");
Thank you very much anyways

Reading binary file byte by byte

I've been doing research on a java problem I have with no success. I've read a whole bunch of similar questions here on StackOverflow but the solutions just doesn't seem to work as expected.
I'm trying to read a binary file byte by byte.
I've used:
while ((data = inputStream.read()) != -1)
loops...
for (int i = 0; i < bFile.length; i++) {
loops...
But I only get empty or blank output. The actual content of the file I'm trying to read is as follows:
¬í sr assignment6.PetI¿Z8kyQŸ I ageD weightL namet Ljava/lang/String;xp > #4 t andysq ~ #bÀ t simbasq ~ #I t wolletjiesq ~
#$ t rakker
I'm merely trying to read it byte for byte and feed it to a character array with the following line:
char[] charArray = Character.toChars(byteValue);
Bytevalue here represents an int of the byte it's reading.
What is going wrong where?

Since java 7 it is not needed to read byte by byte, there are two utility function in Files:
Path path = Paths.get("C:/temp/test.txt");
// Load as binary:
byte[] bytes = Files.readAllBytes(path);
String asText = new String(bytes, StandardCharset.ISO_8859_1);
// Load as text, with some Charset:
List<String> lines = Files.readAllLines(path, StandardCharsets.ISO_8859_1);
As you want to read binary data, one would use readAllBytes.
String and char is for text. As opposed to many other programming languages, this means Unicode, so all scripts of the world may be combined. char is 16 bit as opposed to the 8 bit byte.
For pure ASCII, the 7 bit subset of Unicode / UTF-8, byte and char values are identical.
Then you might have done the following (low-quality code):
int fileLength = (int) path.size();
char[] chars = new char[fileLength];
int i = 0;
int data;
while ((data = inputStream.read()) != -1) {
chars[i] = (char) data; // data actually being a byte
++i;
}
inputStream.close();
String text = new String(chars);
System.out.println(Arrays.toString(chars));
The problem you had, probably concerned the unwieldy fixed size array in java, and that a char[] still is not a String.
For binary usage, as you seem to be reading serialized data, you might like to dump the file:
int i = 0;
int data;
while ((data = inputStream.read()) != -1) {
char ch = 32 <= data && data < 127 ? (char) data : ' ';
System.out.println("[%06d] %02x %c%n", i, data, ch);
++i;
}
Dumping file position, hex value and char value.

it is simple example:
public class CopyBytes {
public static void main(String[] args) throws IOException {
FileInputStream in = null;
FileOutputStream out = null;
try {
in = new FileInputStream("xanadu.txt");
out = new FileOutputStream("outagain.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
} finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
If you want to read text(characters) - use Readers, if you want to read bytes - use Streams

Why not using Apache Commons:
byte[] bytes = IOUtils.toByteArray(inputStream);
Then you can convert it to char:
String str = new String(bytes);
Char[] chars = str.toCharArray();
Or like you did:
char[] charArray = Character.toChars(bytes);
To deserialize objects:
List<Object> results = new ArrayList<Object>();
FileInputStream fis = new FileInputStream("your_file.dat");
ObjectInputStream ois = new ObjectInputStream(fis);
try {
while (true) {
results.add(ois.readObject());
}
} catch (OptionalDataException e) {
if (!e.eof) throw e;
} finally {
ois.close();
}

Edit:
Use file.length() for they array size, and make a byte array. Then inputstream.read(b).
Edit again: if you want characters, use inputstreamreader(fileinputstream(file),charset), it even comes with charset.

Java: reading utf-8 file page by page using FileInputStream

I need some code that will allow me to read one page at a time from a UTF-8 file.
I've used the code;
File fileDir = new File("DIRECTORY OF FILE");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
in.close();
}
After surrounding it with a try catch block it runs but outputs the entire file!
Is there a way to amend this code to just display ONE PAGE of text at a time?
The file is in UTF-8 format and after viewing it in notepad++, i can see the file contains FF characters to denote the next page.

You will need to look for the form feed character by comparing to 0x0C.
For example:
char c = in.read();
while ( c != -1 ) {
if ( c == 0x0C ) {
// form feed
} else {
// handle displayable character
}
c = in.read();
}
EDIT added an example of using a Scanner, as suggested by Boris
Scanner s = new Scanner(new File("a.txt")).useDelimiter("\u000C");
while ( s.hasNext() ) {
String str = s.next();
System.out.println( str );
}

If the file is valid UTF-8, that is, the pages are split by U+00FF, aka (char) 0xFF, aka "\u00FF", 'ÿ', then a buffered reader can do. If it is a byte 0xFF there would be a problem, as UTF-8 may use a byte 0xFF.
int soughtPageno = ...; // Counted from 0
int currentPageno = 0;
try (BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(fileDir), StandardCharsets.UTF_8))) {
String str;
while ((str = in.readLine()) != null && currentPageno <= soughtPageno) {
for (int pos = str.indexOf('\u00FF'; pos >= 0; )) {
if (currentPageno == soughtPageno) {
System.out.println(str.substring(0, pos);
++currentPageno;
break;
}
++currentPageno;
str = str.substring(pos + 1);
}
if (currentPageno == soughtPageno) {
System.out.println(str);
}
}
}
For a byte 0xFF (wrong, hacked UTF-8) use a wrapping InputStream between FileInputStream and the reader:
class PageInputStream implements InputStream {
InputStream in;
int pageno = 0;
boolean eof = false;
PageInputSTream(InputStream in, int pageno) {
this.in = in;
this.pageno = pageno;
}
int read() throws IOException {
if (eof) {
return -1;
}
while (pageno > 0) {
int c = in.read();
if (c == 0xFF) {
--pageno;
} else if (c == -1) {
eof = true;
in.close();
return -1;
}
}
int c = in.read();
if (c == 0xFF) {
c = -1;
eof = true;
in.close();
}
return c;
}
Take this as an example, a bit more work is to be done.

You can use a Regex to detect form-feed (page break) characters. Try something like this:
File fileDir = new File("DIRECTORY OF FILE");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8"));
String str;
Regex pageBreak = new Regex("(^.*)(\f)(.*$)")
while ((str = in.readLine()) != null) {
Match match = pageBreak.Match(str);
bool pageBreakFound = match.Success;
if(pageBreakFound){
String textBeforeLineBreak = match.Groups[1].Value;
//Group[2] will contain the form feed character
//Group[3] will contain the text after the form feed character
//Do whatever logic you want now that you know you hit a page boundary
}
System.out.println(str);
}
in.close();
The parenthesis around portions of the Regex denote capture groups, which get recorded in the Match object. The \f matches on the form feed character.
Edited Apologies, for some reason I read C# instead of Java, but the core concept is the same. Here's the Regex documentation for Java: http://docs.oracle.com/javase/tutorial/essential/regex/

Reading an inputStream all at once [duplicate]

This question already has answers here:
How do I read / convert an InputStream into a String in Java?
(62 answers)
Closed 9 years ago.
I have developed a j2me application that connects to my webhosting server through sockets. I read responses from the server using my own extended lineReader class that extends the basic InputStreamReader. If the server sends 5 lines of replies, the syntax to read the server replies line by line is:
line=input.readLine();
line = line + "\n" + input.readLine();
line = line + "\n" + input.readLine();
line = line + "\n" + input.readLine();
line = line + "\n" + input.readLine();
In this case, i can write this syntax because i know that there is a fixed number of replies. But if I dont know the number of lines, and want to read the whole inputStream at once, how should I modify the current readLine() function. Here's the code for the function:
public String readLine() throws IOException {
StringBuffer sb = new StringBuffer();
int c;
while ((c = read()) > 0 && c != '\n' && c != '\r' && c != -1) {
sb.append((char)c);
}
//By now, buf is empty.
if (c == '\r') {
//Dos, or Mac line ending?
c = super.read();
if (c != '\n' && c != -1) {
//Push it back into the 'buffer'
buf = (char) c;
readAhead = true;
}
}
return sb.toString();
}

What about Apache Commons IOUtils.readLines()?
Get the contents of an InputStream as a list of Strings, one entry per line, using the default character encoding of the platform.
Or if you just want a single string use IOUtiles.toString().
Get the contents of an InputStream as a String using the default character encoding of the platform.
[update] Per the comment about this being avaible on J2ME, I admit I missed that condition however, the IOUtils source is pretty light on dependencies, so perhaps the code could be used directly.

If I understand you correctly, You can use a simple loop:
StringBuffer sb = new StringBuffer();
String s;
while ((s = input.readLine()) != null)
sb.append(s);
Add a counter in your loop, and if your counter = 0, return null:
int counter = 0;
while ((c = read()) > 0 && c != '\n' && c != '\r' && c != -1) {
sb.append((char)c);
counter++;
}
if (counter == 0)
return null;

Specifically for web server !
String temp;
StringBuffer sb = new StringBuffer();
while (!(temp = input.readLine()).equals("")){
sb.append(line);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse "SecciÃ³n" to "Sección"? (string accutes encoding issue) - java

Related

how to read a byte from a text file as an actual byte in hex instead of characters?

Normalize Text - Read Each Character and remove spaces - Bad Enconding

Reading binary file byte by byte

Java: reading utf-8 file page by page using FileInputStream

Reading an inputStream all at once [duplicate]

Categories

Resources