java - detect and change encoding

java - detect and change encoding - java

i have a small java app. i develop it in eclipse. it takes text from xml file and uploads it to a website. in eclipse everithing works fine. but when i compile my app to executable jar and start it from cmd a big trouble appears - all non latin characters transform into unreadable symbols.
i've tryed to put <?xml version="1.0" encoding="windows-1251" ?> or <?xml version="1.0" encoding="utf-8" ?> however it doesn't help.
how can i fix this problem?
any help appriciated!

You could try specifying the UTF-8 Charset (or any other supported charset, for that matter) explicitly in your output writer's constructor.
For example, when using the PrintWriter class for outputting data:
Writer writer = new PrintWriter("myfile.txt", "UTF-8");
writer.write("Hällo Wörld!");
writer.close();
An equivalent example when using the OutputStreamWriter class:
Writer writer = new OutputStreamWriter(System.out, "UTF-8");
writer.write("Hällo Wörld!");
writer.close();
(Note that in both cases the Charset has to be specified via its textual name (i.e. "UTF-8"), not by direct instantiation of the respective class.)
Thus, a likely explanation for your problem is that as the charset is not given explicitly, the system tries to fallback to the default encoding of your OS (which is probably not UTF-8).

Related

Problems with JAXB and UTF-16 encoding

Hi I have a small APP that reads content from an xml file and put it into a corresponding Java object.
Here is the XML:
<?xml version="1.0" encoding="UTF-16"?>
<Marker>
<TimePosition>2700</TimePosition>
<SamplePosition>119070</SamplePosition>
</Marker>
here is the corresponding Java code:
JAXBContext jaxbContext = JAXBContext.newInstance(MarkerDto.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new FileInputStream("D:/marker.xml");
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16.toString());
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(reader);
If I run this code I get an "Content is not allowed in prolog." exception. If I run the same with UTF-8 everything works fine. Does anyone have a clue what might be the problem?

There's several things wrong here (ranging from slightly suboptimal, to potentially very wrong). In increasing order of likelihood of causing the problem:
When constructing an InputStreamReader, there's no need to call toString() on the Charset, because that class has a constructor that takes a Charset, so simply remove the .toString():
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16);
This is a tiny nitpick and has no effect on functionality.
Don't construct a Reader at all! XML is a format that's self-describing when it comes to encoding: Valid XML files can be parsed without knowing the encoding up-front. So instead of creating a Reader, simply pass the InputStream directly into your XML-handling code. Delete the line that creates the Reader and change the next one to this:
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(inputStream);
This may or may not fix your problem, depending on whether the input is well-formed.
Your XML file might have encoding="UTF-16" in the header and not actually be UTF-16 encoded. If that's the case, then it is malformed and a conforming parser will decline to parse it. Verify this by opening the file with the advanced text editor of your choice (I suggest Notepad++ on Windows, Linux users probably know what their preference is) and check if it shows "UTF-16" as encoding (and the content is readable).
If I run the same with UTF-8 everything works fine.
This line suggests that that's what's actually happening here: the XML file is mis-labeling itself. This needs to be fixed at the point where the XML file is created.
Notably, this demo code provides exactly the same Content is not allowed in prolog. exception message that is reported in the question:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<foo />";
JAXBContext jaxbContext = JAXBContext.newInstance();
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
jaxbUnmarshaller.unmarshal(inputStream);
Note that the XML encoding attribute claims UTF-16, but the actual data handed to the XML parser is UTF-8 encoded.

How can I change the Standard Out to "UTF-8" in Java

I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support

The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.

The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.

Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Character encoding

I get html file which I need to read and parse, this file can be in plain English, japenese, or any language with associated character encoding required for that language. The problem occurs when file is in Japenese with any of these encodings
Shift JIS
EUC-JP
ISO-2022-JP
I tried reading file with FileReader but resulting file is all garbage characters. I also tried using FileInputStream with just hard coding japenese encoding to check if Japanese file is read correctly but result is not as expected.
FileInputStream fis = new FileInputStream(htmlFile);
InputStreamReader isr = new InputStreamReader(fis, " ISO-2022-JP");
I don’t have much experience with character encoding and internationalization, any suggestions on how I can read/write files with different encodings?
one more thing, I don't know how to get the character encoding of the html file I am reading, I understand that I need to write file in same encoding but not sure how to get original file's encoding
Thanks,

Forget that FileReader exists, it implicitly uses the platform default encoding, which makes it pretty much useless.
Your code with the hardcoded encoding is correct except for the encoding itself, which has a leading space. If you remove it, the code should correctly read ISO-2022-JP encoded files
As for getting the character encoding of the HTML file, there are a number of ways it can be transmitted
on the HTTP level in a Content-Type HTTP header - but this is only available when you read the file from the webserver, not when it's saved as a file
as a corresponding META HTML tag: <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
or, if the document type is XHTML, in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java - detect and change encoding - java

Related

Problems with JAXB and UTF-16 encoding

How can I change the Standard Out to "UTF-8" in Java

Error when reading non-English language character from file

Reading Arabic chars from text file

Character encoding

Categories

Resources