JAVA Unrecognized Character of the first character in the first line - java

I have lines of code to read the content of the file in Java. Basically I am using FileReader and BufferedReader. I am reading the lines correctly, however, the first character of the first line seems to be an undefined symbol. I have no idea where I got this symbol since the content of the input file is correct.
Here is the code:
FileReader readFile = new FileReader(chosenFile);
BufferedReader input = new BufferedReader(readFile);
while((line = input.readLine()) != null) {
System.out.println(line);
}

If it apears only in the first line, this is probably BOM (Byte Order Mark). All modern Text editors recognize this and do not present it as part of the text file. When you save the text file, there should be option to save with or without it.
If you wish to read the BOM marker in java, see here Reading UTF-8 - BOM marker

Related

Converting from windows-1256 to UTF-8 causes punctuation issue

I have an Arabic subtitle I've trying to convert from SRT to VTT. The subtitles seems to be using windows-1256 according to the character encoding detector on ICU (Java). The final VTT file is on UTF-8.
The subtitle converts fine and it all looks right except for the punctuation moves from the left side to the right side. I am using this subtitle on the Chromecast so at first I thought it was an issue with the Chromecast but even gedit on Linux has the issue. However LibreOffice does not have the issue. Nor does the console output on IntelliJ.
I wrote a simple piece of code to recreate the issue without actually converting from SRT to VTT, just by converting from windows-1256 to UTF-8.
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("arabic sub.srt"), "windows-1256")
);
String line = null;
BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("bad punctuation.srt"), "UTF-8")
);
while((line = reader.readLine())!= null){
System.out.println(line);
writer.write(line);
writer.write("\r\n");
}
writer.close();
reader = new BufferedReader(
new InputStreamReader(new FileInputStream("bad punctuation.srt"), "UTF-8")
);
line = null;
while((line = reader.readLine())!= null){
System.out.println(line);
}
Here is the output from the IntelliJ console:
As you can see the dot is on the left side which I guess is correct.
Here is what gedit shows:
Most of the text is to the right which I guess is correct but the period is on the right, which I guess is wrong.
Here is LibreOffice:
Which is mostly correct, the punctuation is to the left, however the text is also on the left and I guess it should be on the right.
This is the subtitle I'm testing https://www.opensubtitles.org/en/subtitles/5168225/game-of-thrones-fire-and-blood-ar
I also tried a different SRT that was originally encoded as UTF-8 and that one worked fine without issues. So my guess is that the conversion from windows-1256 is the issue.
So what is the issue with the way I'm re-encoding the file?
Thanks.
Edit: Forgot a chromecast picture.
As you can see the punctuation is on the wrong side.
EDIT: I just noticed that Linux chardet says it is MacCyrillic not windows-1256. But the Java ICU library says windows-1256. Anyways, if I use MacCyrillic then the punctuation looks fine on gEdit but the text itself doesn't look right, like it is now using garbage characters.
Looking at the original subtitles file, I can tell for sure that it is badly formatted. The full-stops seem to appear before the text even when it is displayed with a left-to-right character set. I believe the correct character set is windows-1256 though.
The only way this would display correctly is if the punctuation at the beginning of the line is displayed LTR while the rest of the line is displayed RTL. You could try to force this by adding a UTF-8 left-to-right mark right after the punctuation.
If you prefer to fix the original file instead, you would need to move any punctuation from the beginning of the line to the end. The brackets at the beginning of the line would also need to be reversed.
As the encoding has nothing to do with text orientation (LTR vs. RTL) I think you should leverage the UTF-8 marks specifically created for this purpose.
left-to-right mark: ‎ or ‎ (U+200E)
right-to-left mark: ‏ or ‏ (U+200F)
In a nutshell: the text file does not have the information of the text orientation, it's just a text file.
Cf. https://www.w3.org/TR/WCAG-TECHS/H34.html

UTF-16BE and UTF-16 issue in java

I have a file, when displayed with geanny * shows UTF-16BE. If I try to convert this file in Java to a different encoding (let's say ISO-8859-1), assuming it is UTF-16BE, a question mark (?) appears every time at the beginning of new created file. If instead I assume it is in UTF-16 (something that's not true), the converted file gets converted ok, without any question mark at the beginning.
Can anybody clarify why this behavior?
Bellow is a snippet from my used code:
StringBuilder sb = new StringBuilder();
BufferedReader buff = new BufferedReader(new InputStreamReader(inputStream, utf16beCharset));
String line = null;
while ( (line = buff.readLine()) != null) {
sb.append(line);
sb.append('\n');
}
String output = new String(sb.toString().getBytes(neededCharset), neededCharset);
System.out.println(output);
* geanny is a text editor
Your Problem is the BOM (Byte Order Mark).
If you define the character set as UTF-16 then Java recognises the BOM and removes it after reading. The BOM then tells Java that the character stream is (UTF-16)BE.
If you define UTF-16BE then you tell Java to ignore the BOM and Java ignores it and writes it to your target file.

Cannot find ZERO WIDTH NO-BREAK SPACE when reading file

I've run into a problem when trying to parse a JSON string that I grab from a file. My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. I don't want to use regex because of the chance there may be other hidden characters with different unicodes.
Here's what I have:
StringBuilder content = new StringBuilder();
try {
BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
String currentLine;
while((currentLine = br.readLine()) != null) {
content.append(currentLine);
}
br.close();
} catch(Exception e) {
Assert.fail();
}
And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid):
{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...
Here's what I've tried so far:
Copying the JSON file to notepad++ and showing all characters
Copying file to notepad++ and converting to UFT-8 without BOM, and ISO 8859-1
Opened JSON file in other text editors such as sublime and saved as UFT-8
Copied the JSON file to a txt file and read that in
Tried using Scanner instead of BufferedReader
In intellij I tried view -> active editor -> show whitespaces
How can I read this file in without having the Zero width no-break space character at the beginning of the string?
0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).
If you read the FileReader documentation, it says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:
(from https://stackoverflow.com/a/13988345/65863)
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}

BufferedReader, read chars in an edittext gives strange chars

Ok, I am reading a .docx file via a BufferedReader and want to store the text in an edittext. The .docx is not in english language but in a different one (greek). I use:
File file = new File(file_Path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
StringBuilder text = new StringBuilder();
while ((line = br.readLine()) != null) {
text.append(line);
}
et1.setText(text);
And the result I get is this:
If the characters are in english language, it works fine. But in my case they aren't. How can I fix this? Thanks a lot
Ok, I am reading a .docx file via a BufferedReader
Well that's the first problem. BufferedReader is for plain text files. docx files are binary files in a specific format (assuming you mean the kind of file that Microsoft Word saves). You can't just read them like text files. Open the file up in Notepad (not Wordpad) and you'll see what what I mean.
You might want to look at Apache POI.
From comments:
Testing to read a .txt file with the same text gave same results too
That's probably due to using the wrong encoding. FileReader always uses the platform default encoding, which is annoying. Assuming you're using Java 7 or higher, you'd be better off with Files.newBufferedReader:
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...
}
Adjust the charset to match the one you used when saving your text file, of course - if you have the option of using UTF-8, that's a pretty good choice. (Aside from anything else, pretty much everything can handle UTF-8.)

Why am i getting ?? when i try to read ä character from a text file in java?

I am trying to read text from a text file. There are some special characters like å,ä and ö. When i make a string and print out that string then i get ?? from these special characters. I am using the following code:
File fileDir = new File("files/myfile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String strLine;
while ((strLine = br.readLine()) != null) {
System.out.println("strLine: "+strLine);
}
Can anybody tell me whats the problem. I want strLine to show and save å, ä and ö as they are in text file. Thanks in advance.
The problem might not be with the file but with the console where you are trying to print. I suggest you follow the following steps
Make sure the file you are reading is encoded in UTF-8.
Make sure the console you are printing to has the proper encoding/charset to display these characters
Finally, this article Unicode - How to get characters right? is a must read.
Check here for the lists of Java supported encodings
Most common single-byte encoding that includes non-ascii characters is ISO8859_1; maybe your file is that, and you should specifiy that encoding for your FileInputStream

Categories

Resources