I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.
I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.
Thanks
There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.
Typical problematic spots in the Java API are:
new String(byte[])
String.getBytes()
FileReader, FileWriter
All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).
However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.
What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.
You can control which encoding your JVM will run with by supplying f,ex
-Dfile.encoding=utf-8
for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:
java -Dfile.encoding=utf-8 my.MainClass
Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.
Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)
If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.
I had to make sure that the OutputStreamWriter uses the correct encoding
OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);
Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");
Related
I have a problem in writing a xml file with UTF-8 in JAVA.
Problem: I have a file with filename having an interpunct(middot)(·) in it. When im trying to write the filename inside a xml tag, using java code i get some junk number like in filename instead of ·
OutputStreamWriter osw =new OutputStreamWriter(file_output_stream,"UTF8");
Above is the java code i used to write the xmlfile. Can anybody tell me why to understand and sort the problem ? thanks in advance
Java sources are UTF-16 by default.
If your character is not in it, then use an escape:
String a = "\u00b7";
Or tell your compiler to use UTF-8 and simply write it to the code as-is.
That character is ASCII 183 (decimal), so you need to escape the character to ·. Here is a demonstration: If I type "·" into this answer, I get "·"
The browser is printing your character because this web page is XML.
There are utility methods that can do this for you, such as apache commons-lang library's StringEscapeUtils.escapeXml() method, which will correctly and safely escape the entire input.
In general it is a good idea to use UTF-8 everywhere.
The editor has to know that the source is in UTF-8. You could use the free programmers editor JEdit which can deal with many encodings.
The javac compiler has to know that the java source is in UTF-8. In Java you can use the solution of #OndraŽižka.
This makes for two settings in your IDE.
Don't try to create XML by hand. Use a library for the purpose. You are just scratching the surface of the heap of special cases that will break a hand-made solution.
One way, using core Java classes, is to create a DOM, then serialize that using an no-op XSL transform that writes to a StreamResult. (if your document is large, you can do something similar by driving a SAX event handler.)
There are many third party libraries that will help you do the same thing very easily.
I am try to pull out byte data from a source, encrypt it, and then store it in the file system.
For encryption, I am using jasypt and the BasicTextEncryptor class. And for storing on to the file system, I am using Apache's Commons IOUtils class.
When required, these files will be decrypted and then sent to the user's browser. This system works on my local machine where the default charset is MacRoman, but it fails on the server where the default charset is UTF-8.
When I explicitly set the encoding at each stage of the process to use MacRoman it works on the server as well, but I am skeptical about doing this as rest of my code uses UTF8.
Is there a way that I can work the code without conversion to MacRoman?
You should just use UTF8 everywhere.
As long as you use the same encoding at each end of an operation (and as long as the encoding can handle all of the characters you need), you'll be fine.
In your comments on another answer, you claim you're not using an encoding, but that's impossible. You're using the BasicTextEncryptor class, which according to this documentation only works on Strings and char arrays. That means that, at some point, you're converting from an encoding-agnostic byte array to an encoding-specific String or char array. That means that you're relying upon an encoding somewhere, whether you realize it or not. You need to track down where that conversion is happening and ensure it has the correct encoding.
Your question states, "When I explicitly set the encoding at each stage of the process", so you will need to know how it's encoded in the database. If that doesn't make sense, read on.
It's also possible that you are simply trying to encrypt a file that you're getting out of the database, and you don't care about the string representation; you want to treat it as plain bytes, not as text. In that case, BasicTextEncrypter ("Utility class for easily performing normal-strength encryption of texts.") is not a good fit for this task. It encrypts strings. The BasicBinaryEncryptor ("Utility class for easily performing normal-strength encryption of binaries (byte arrays).") is what you need.
I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards,
Karl
EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).
I am writing into a java file through
FileWriter fstream = new FileWriter("someFile.java");
BufferedWriter out = new BufferedWriter(fstream);
out.write(strContents);
// Close the output stream
out.close();
but after writing I found it appended some special characters in shape of box like [], but those special characters are only visible in specific text editor like EditPlus.
How to avoid those special characters while writing or is it specific to some editors only.
My advice would be to avoid using FileWriter completely. It always uses the platform default encoding, which is rarely a good idea.
I would suggest using FileOutputStream wrapped in an OutputStreamWriter - then you just need to specify an appropriate encoding, such as UTF-8. Obviously you'll still need to use an editor which supports UTF-8, and you may need to tell it the encoding... but at least you'll have code which always writes in the same way, regardless of OS and system properties.
Normal notepad application can't show the special character wriiten in the file. There is no any problem with your code. It is limitation of notepad.
I know UTF file has BOM for determining encoding but what about other encoding that has
no clue how to guess that encoding.
I am new java programmer.
I have written code for guessing UTF encoding using UTF BOM.
but I have problem with other encoding. How do I guess them.
Anybody can help me?
thanks in Advance.
This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
GuessEncoding
jchardet (Java port of the algorithm used by mozilla firefox)
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.
Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.
If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.