Converting from windows-1256 to UTF-8 causes punctuation issue - java

I have an Arabic subtitle I've trying to convert from SRT to VTT. The subtitles seems to be using windows-1256 according to the character encoding detector on ICU (Java). The final VTT file is on UTF-8.
The subtitle converts fine and it all looks right except for the punctuation moves from the left side to the right side. I am using this subtitle on the Chromecast so at first I thought it was an issue with the Chromecast but even gedit on Linux has the issue. However LibreOffice does not have the issue. Nor does the console output on IntelliJ.
I wrote a simple piece of code to recreate the issue without actually converting from SRT to VTT, just by converting from windows-1256 to UTF-8.
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("arabic sub.srt"), "windows-1256")
);
String line = null;
BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("bad punctuation.srt"), "UTF-8")
);
while((line = reader.readLine())!= null){
System.out.println(line);
writer.write(line);
writer.write("\r\n");
}
writer.close();
reader = new BufferedReader(
new InputStreamReader(new FileInputStream("bad punctuation.srt"), "UTF-8")
);
line = null;
while((line = reader.readLine())!= null){
System.out.println(line);
}
Here is the output from the IntelliJ console:
As you can see the dot is on the left side which I guess is correct.
Here is what gedit shows:
Most of the text is to the right which I guess is correct but the period is on the right, which I guess is wrong.
Here is LibreOffice:
Which is mostly correct, the punctuation is to the left, however the text is also on the left and I guess it should be on the right.
This is the subtitle I'm testing https://www.opensubtitles.org/en/subtitles/5168225/game-of-thrones-fire-and-blood-ar
I also tried a different SRT that was originally encoded as UTF-8 and that one worked fine without issues. So my guess is that the conversion from windows-1256 is the issue.
So what is the issue with the way I'm re-encoding the file?
Thanks.
Edit: Forgot a chromecast picture.
As you can see the punctuation is on the wrong side.
EDIT: I just noticed that Linux chardet says it is MacCyrillic not windows-1256. But the Java ICU library says windows-1256. Anyways, if I use MacCyrillic then the punctuation looks fine on gEdit but the text itself doesn't look right, like it is now using garbage characters.

Looking at the original subtitles file, I can tell for sure that it is badly formatted. The full-stops seem to appear before the text even when it is displayed with a left-to-right character set. I believe the correct character set is windows-1256 though.
The only way this would display correctly is if the punctuation at the beginning of the line is displayed LTR while the rest of the line is displayed RTL. You could try to force this by adding a UTF-8 left-to-right mark right after the punctuation.
If you prefer to fix the original file instead, you would need to move any punctuation from the beginning of the line to the end. The brackets at the beginning of the line would also need to be reversed.

As the encoding has nothing to do with text orientation (LTR vs. RTL) I think you should leverage the UTF-8 marks specifically created for this purpose.
left-to-right mark: ‎ or ‎ (U+200E)
right-to-left mark: ‏ or ‏ (U+200F)
In a nutshell: the text file does not have the information of the text orientation, it's just a text file.
Cf. https://www.w3.org/TR/WCAG-TECHS/H34.html

Related

character serialization,BufferedReader in Java

if I have a delimited text file with apostrophes in, like ' as in:
BB;Art’s Tavern;6487 Western Ave., Glen Arbor, MI 49636;
what do I need to do to allow those to be parsed correctly through a BufferedReader in Java?
the code Im currently using to open the file for reading is thus in an android application:
StringBuffer buf = new StringBuffer();
InputStream is = context.getResources().openRawResource(R.raw.lvpa);
BufferedReader reader = new BufferedReader(new InputStreamReader(is,"UTF-8"));
Currently the apostrophes are being returned as question marks ? in a black box.
The contents of the file are then parsed into a model.
any help would be appreciated:)
Thanks
The file you are reading is not recorded in UTF-8. You need to know which encoding your file is in before you attempt to read it. If possible open it in whatever text editor you use to examine it and save it off in UTF-8 and try reading it again. (Some text editors will give the option of setting the encoding when you save the file.)

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters
It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)
You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).
You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Why am i getting ?? when i try to read ä character from a text file in java?

I am trying to read text from a text file. There are some special characters like å,ä and ö. When i make a string and print out that string then i get ?? from these special characters. I am using the following code:
File fileDir = new File("files/myfile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String strLine;
while ((strLine = br.readLine()) != null) {
System.out.println("strLine: "+strLine);
}
Can anybody tell me whats the problem. I want strLine to show and save å, ä and ö as they are in text file. Thanks in advance.
The problem might not be with the file but with the console where you are trying to print. I suggest you follow the following steps
Make sure the file you are reading is encoded in UTF-8.
Make sure the console you are printing to has the proper encoding/charset to display these characters
Finally, this article Unicode - How to get characters right? is a must read.
Check here for the lists of Java supported encodings
Most common single-byte encoding that includes non-ascii characters is ISO8859_1; maybe your file is that, and you should specifiy that encoding for your FileInputStream

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?
Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.
Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Java linux character encoding issue

I'm facing an issue with character encoding in linux. I'm retrieving a content from amazon S3, which was saved using UTF-8 encoding. The content is in Chinese and I'm able to see the content correctly in the browser.
I'm using amazon SDK to retrieve the content and do some update to it.Here's the code I'm using:
StringBuilder builder = new StringBuilder();
S3Object object = client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(new
InputStreamReader(object.getObjectContent(), "utf-8"));
while (true) {
String line = reader.readLine();
if (line == null)
break;
builder.append(line);
}
This piece of code works fine in Windows environment as I was able to update the content and save it back without messing up any chinese characters in it.
But, its acting differently in linux enviroment. The code is unable to translate the characters properly, the chinese characters are rendered as ???
I'm not sure what's going wrong here. Any pointers will be appreciated.
-Thanks
The default charset is different for the 2 OS's your using.
To start off, you can confirm the difference by printing out the default charset.
Charset.defaultCharset.name()
Somewhere in your code, I think this default charset is being used for some String conversion. The correct procedure should be to track that down, and specify UTF-8.
Without seeing that code, I can only suggest the 'cheating' way to do it: set the default charset explicitly, near the beginning of your code, or at Java startup. See here for changing default charset: Setting the default Java character encoding?
HTH

Categories

Resources