Character encoding - java

I get html file which I need to read and parse, this file can be in plain English, japenese, or any language with associated character encoding required for that language. The problem occurs when file is in Japenese with any of these encodings
Shift JIS
EUC-JP
ISO-2022-JP
I tried reading file with FileReader but resulting file is all garbage characters. I also tried using FileInputStream with just hard coding japenese encoding to check if Japanese file is read correctly but result is not as expected.
FileInputStream fis = new FileInputStream(htmlFile);
InputStreamReader isr = new InputStreamReader(fis, " ISO-2022-JP");
I don’t have much experience with character encoding and internationalization, any suggestions on how I can read/write files with different encodings?
one more thing, I don't know how to get the character encoding of the html file I am reading, I understand that I need to write file in same encoding but not sure how to get original file's encoding
Thanks,

Forget that FileReader exists, it implicitly uses the platform default encoding, which makes it pretty much useless.
Your code with the hardcoded encoding is correct except for the encoding itself, which has a leading space. If you remove it, the code should correctly read ISO-2022-JP encoded files
As for getting the character encoding of the HTML file, there are a number of ways it can be transmitted
on the HTTP level in a Content-Type HTTP header - but this is only available when you read the file from the webserver, not when it's saved as a file
as a corresponding META HTML tag: <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
or, if the document type is XHTML, in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>

Related

Unzip files that contain chinese characters

I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?
The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.
ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1

How to read file with saving encoding?

So, I have file in ISO8859-1 encoding. I do the next:
InputStreamReader isr = new InputStreamReader(new FileInputStream(fileLocation));
System.out.println(isr.getEncoding());
And I get UTF8... Looks like FileInputStream or InputStreamReader convert it to UTF8.
Yes, I know about the next one way:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileLocation), "ISO-8859-1");
But I don't know beforehand what encoding my file will have.
How can I read file with saving encoding?
Binary files (bytes) that are actually text in some encoding for those bytes, unfortunately do not store the encoding (charset) somewhere.
Sometimes there is an encoding somewhere: Unicode text could have an optional BOM character at the begin of the file. HTML and XML can specify the charset.
If you downloaded the file from the internet in the header lines the charset could be mentioned. Say it were an HTML file, and Content-Type: text/html; charset=Windows-1251. Then you could read the file with Windows-1251, and always store it as UTF-8, modifying/adding a <meta charset="UTF-8">.
But in general there is no solution for determining some file's encoding. You could do:
read the bytes
if convertible to UTF-8 without error in the multibyte sequences, it is UTF-8
otherwise it is a single byte encoding, default to Windows-1252 (rather than ISO-8859-1)
maybe use word frequency tables of some languages together with encodings, and try those
write the bytes in the determined encoding to file as UTF-8
There might be a library doing such a thing; combining language recognition and charset recognition.

How to detect the encoding of a PPTX file?

My question is, how can I get the encoding of a pptx file in Java?
(I'm using apache poi)
File f = new File(filename);
XMLSlideShow ppt = new XMLSlideShow(new FileInputStream(f));
The reason why I need to know the encoing is that later on, I post some data of the file which I have saved in a json string and It is at this stage my problem occurs.
When doing a http POST the encoding is changed, and I figured this problem could be solved If I knew the encoding of the data in my json string. Then I could set this encoding in my http POST.
EDIT/CLARIFICATION:
The problem is the swedish letters å,ä and ö.
å becomes Ã¥
ä becomes ä
ö becomes ö
Java and POI aside, to get to the encoding of a PowerPoint PPTX file, you have to examine the underlying XML for the slides:
Unzip the pptx file (for manually looking, any zip utility like 7-zip will do).
Under the zip root, find the ppt/slides directory.
Typically each slide is slide#.xml; open the one you want to examine.
Read the first line: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
In most cases, I would expect the encoding to be the same across all slides (meaning that you could probably use the root-level "[Content_Types].xml" file as a proxy for encoding of the entire archive).

Parse XML file containing umlaute using SAX parser

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:
File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handlerConfig);
But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>
What I'm doing wrong?
First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:
InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);
If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.
Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?
Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84, if the encoding is UTF-8.
Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts.
UPDATE:
The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3is the letter Ã, and the second byte is the letter ¤, or currency sign (Unicode U+00A4), which may look like a circle.
Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

How will append a utf-8 string to a properties file

How will append a utf-8 string to a properties file. I have given the code below.
public static void addNewAppIdToRootFiles() {
Properties properties = new Properties();
try {
FileInputStream fin = new FileInputStream("C:\Users\sarika.sukumaran\Desktop\root\root.properties");
properties.load(new InputStreamReader(fin, Charset.forName("UTF-8")));
String propertyStr = new String(("قسيمات").getBytes("iso-8859-1"), "UTF-8");
BufferedWriter bw = new BufferedWriter(new FileWriter(directoryPath + rootFiles, true));
bw.write(propertyStr);
bw.newLine();
bw.flush();
bw.close();
fin.close();
} catch (Exception e) {
System.out.println("Exception : " + e);
}
}
But when I open the file, the string I have written "قسيمات" to the file shows as "??????". Please help me.
OK, your first mistake is getBytes("iso-8859-1"). You should not do these manipulations at all. If you want to write unicode text to file you should open the file and write text. The internal representations of strings in java is unicdoe, so everything will be writter correctly.
You have to care about charset when you are reading file. BTW you do it correctly.
But you do not have to use file manipulation tools to append something to properites file. You can just call prop.setProperty("yourkey", "yourvalue") and then call prop.store(new FileOutputStream(youfilename)).
Ok, I have checked the specification for Properties class. If you use following methods: load() for input stream or store() for output stream, the input/output stream for the file is assumed a iso-8859-1 encoding by default. Therefore, you have to be cautious with a few things:
Some characters in French, German and Portuguese are iso-8859-1 (Latin1) compatible, which they normally work fine in iso-8859-1. So, you don't have to worry that much. But, others like Arabic and Hebrew characters are not Latin1 compatible, so you need to be careful with the choice of encoding for these characters. If you have a mix of characters of French and Arabic, you have no choice but to use Unicode.
What is your current input file's encoding if it already exists to be used with Properties's load() method? If it is not the default iso-8859-1, then you need to figure out what it is first before opening the file. If infile file encoding is UTF-8, then use properties.load(new InputStreamReader(new FileInputStream("infile"), "UTF8"))); Then, stick to this encoding till the end. Match the file encoding with the character encoding as well.
If it is a new input file to be used with Properties's load() method, choose the file encoding that works with your character's encoding. Then, stick to this encoding till the end.
Your expected output file's encoding shall be the same with what is used from Properties's load() method before you use the store() method. If it is not the default iso-8859-1, then you need to figure out what it is first before saving the file. Stick to this encoding till the end. Match the file encoding with the character encoding as well. If outfile file encoding is UTF-8, then specifically use UTF-8 encoding when saving the file. But, if the store() method still ends up with an outfile in iso-8859-1 encoding, then you need to do what is suggested next...
If you stick to the default iso-8859-1, it works fine for characters like French. But, if the characters are not iso-8859-1 or Latin1 encoding compatible, you need to use Unicode escape characters instead as an alternative: for example:\uFE94 for the Arabic ﺔ character. For me, this escaping is too tedious and normally we use native2ascii utility provided in JRE or JDK to convert a properties file from one encoding to another encoding. Of course, there are other ways...just check the references below...For me, it is better to use a properties file in XML format since by default it is UTF-8...
References:
Java properties UTF-8 encoding in Eclipse
Setting the default Java character encoding?

Categories

Resources