UTF-8 write xml successful - java

today I faced with very interesting problem. When I try to rewrite xml file.
I have 3 ways to do this. And I want to know the best way and reason of problem.
I.
File file = new File(REAL_XML_PATH);
try {
FileWriter fileWriter = new FileWriter(file);
XMLOutputter xmlOutput = new XMLOutputter();
xmlOutput.output(document, System.out);
xmlOutput.output(document, fileWriter);
fileWriter.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
In this case I have a big problem with my app. After writing in file in my own language I can't read anything. Encoding file was changed on ANSI javax.servlet.ServletException: javax.servlet.jsp.JspException: Invalid argument looking up property: "document.rootElement.children[0].children"
II.
File file = new File(REAL_XML_PATH);
XMLOutputter output=new XMLOutputter();
try {
output.output(document, new FileOutputStream(file));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
In this case I haven't problems. Encoding wasn't change. No problem with reading and writing.
And this article http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html
And I want to know the best way and reason of problem.

Well, this looks like the problem:
FileWriter fileWriter = new FileWriter(file);
That will always use the platform default encoding, which is rarely what you want. Suppose your default encoding is ISO-8859-1. If your document declares itself to be encoded in UTF-8, but you actually write everything in ISO-8859-1, then your file will be invalid if you have any non-ASCII characters - you'll end up writing them out with the ISO-8859-1 single byte representation, which isn't valid UTF-8.
I would actually provide a stream to XMLOutputter rather than a Writer. That way there's no room for conflict between the encoding declared by the file and the encoding used by the writer. So just change your code to:
FileOutputStream fileOutput = new FileOutputStream(file);
...
xmlOutput.output(document, fileOutput);
... as I now see you've done in your second bit of code. So yes, this is the preferred approach. Here, the stream makes no assumptions about the encoding to use, because it's just going to handle binary data. The XML writing code gets to decide what that binary data will be, and it can make sure that the character encoding it really uses matches the declaration at the start of the file.
You should also clean up your exception handling - don't just print a stack trace and continue on failure, and call close in a finally block instead of at the end of the try block. If you can't genuinely handle an exception, either let it propagate up the stack directly (potentially adding throws clauses to your method) or catch it, log it and then rethrow either the exception or a more appropriate one wrapping the cause.

If I remember correctly, you can force your xmlOutputter to use a "pretty" format with:
new XMLOutputter(Format.getPrettyFormat()) so it should work with I too
pretty is:
Returns a new Format object that performs whitespace beautification
with 2-space indents, uses the UTF-8 encoding, doesn't expand empty
elements, includes the declaration and encoding, and uses the default
entity escape strategy. Tweaks can be made to the returned Format
instance without affecting other instances.

Related

How to avoid parsing strange characters

While I am processing XML file, the Stax parser encountered the following line:
<node id="281224530" lat="48.8975614" lon="8.7055191" version="8" timestamp="2015-06-07T22:47:39Z" changeset="31801740" uid="272351" user="Krte�?ek">
and as you see there is a strange character at the end of the line, and when the parser reaches that line the program stops and gives me the following error:
Exception in thread "main" javax.xml.stream.XMLStreamException: ParseError
at [row,col]:[338019,145]
Message: Ungültiges Byte 2 von 2-Byte-UTF-8-Sequenz.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown
Source)
at com.example.Main.main(Main.java:46)
Is there any thing I should change in the settings of Eclipse to avoid that error?
Update
code:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = null;
try {
parser = factory.createXMLStreamReader(in);
} catch (XMLStreamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
Log.d(TAG, "newParser",
"e/createXMLStreamReader: " + e.getMessage());
}
It is not about eclipse, but it is about encoding of your file. There are two cases:
1) file is corrupted, i.e. it contains incorrect symbols, not from defined encoding
2) file is not in utf-8 encoding and it is defined in xml header. So you should check, that you are reading file contents appropriately.
If you edited and saved your XML file in eclipse, this can be a problem in case your eclipse is not configured to use UTF-8. Check this question: How to support UTF-8 encoding in Eclipse
Otherwise you probably don't need to do anything about your code. You just need a correctly UTF-8-encoded content.

Freemarker converting HTML ISO tags when reading ftl file

I am trying to output curly quotes in an HTML file that I am generating in Freemarker. The template file contains:
Kevin’s
When the HTML file is generated, it comes out as:
Kevin?s
At first I thought that the issue was happening during the generation of the HTML file. But I was able to track down the conversion to when the template was read in. Does anyone know how to prevent Freemarker from doing this conversion when reading the template? My code for the conversion:
// Freemarker configuration object
Configuration cfg = new Configuration(new Version(2, 3, 21));
try
{
cfg.setDirectoryForTemplateLoading(new File("templates"));
cfg.setDefaultEncoding("UTF-8");
cfg.setTemplateExceptionHandler(TemplateExceptionHandler.HTML_DEBUG_HANDLER);
// Load template from source folder
Template template = cfg.getTemplate("curly.html");
template.setEncoding("UTF-8");
// Build the data-model
Map<String, Object> data = new HashMap<String, Object>();
// Console output
Writer out = new OutputStreamWriter(System.out);
template.process(data, out);
out.flush();
}
catch (IOException e)
{
e.printStackTrace();
}
catch (TemplateException e)
{
e.printStackTrace();
}
If the template file indeed contains Kevin’s, then the out would be Kevin’s too (as FreeMarker doesn't resolve HTML entities), so I suppose you mean that the character with that code is there as one character. In that case, the most probable culprit has nothing to do with FreeMarker: new OutputStreamWriter(System.out). You have omitted the encoding parameter of the constructor there, so it will use the system default encoding. Even if you do specify that, your console have a fixed encoding (which is not necessarily the system default BTW). So try to write the output into a file by explicitly specifying UTF-8 for the OutputStreamWriter. If the output will be still wrong, then check if you have indeed used UTF-8 to create the template file, and for reading the output file.
BTW, that template.setEncoding is not necessary. Remove it.

Special character encoding (PC8) in Java file writing

I need to write a file with Java in PC8 character encoding. How can a 'custom' character set be applied to a (text) file?
This is what I'm trying to do:
try {
writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("test.txt"), "utf-8")); // obviously need to change this
String info = "#TEST \"test åäö\"";
writer.write(info);
writer.close();
} catch (Exception e) {
System.out.println(e);
}
So I'd need to know if it is even possible to write in a special character encoding, and what do I need to do? Specifying "PC-8" or "PC8" in the encoding did not work.
I found the answer while writing the question itself. Here is the list of supported character encodings for Java, and how to specify them in the code block I provided in the question: http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
So this is what works for me:
writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("test.txt"), "ibm-437"));
What caused problems in my case was that Google was littered with questions for UTF-8 character encoding. Furthermore, PC8 is not the official name for the character encoding, so I couldn't find the needed information with that name. Hope this helps generally with encoding problems.

How to write formatted integers to file in i18n-friendly way in Java?

I'm currently using
try {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("open_sites_20,txt"), "UTF-8"));
writer.write(String.format("%4d%4d%n", i, j));
writer.close();
} catch (IOException e) {
System.err.print("error: " + e.toString() + "\n");
};
where i, j are integers.
FindBugs reports that the above has the following bad practice
Reliance on default encoding
Found a call to a method which will perform a byte to String (or String to byte) conversion, and will assume that the default platform encoding is suitable. This will cause the application behaviour to vary between platforms. Use an alternative API and specify a charset name or Charset object explicitly.
Any suggestion how this can be improved?
Platform: IntelliJ IDEA 13.1.1 + FindBugs-IDEA 0.9.992.
In this case, FindBugs seems to be wrong. Please keep in mind that its rules are not carved into stone, so it is necessary to apply your own judgment.
As for improving things. There are few way you can improve this code. First, let's deal with character encoding. Since Java 1.4, OutputStreamWriter contains constructor with the following signature:
OutputStreamWriter(OutputStream stream, Charset charEncoding)
It's better to use this, instead of passing the encoding name as string. Why? Starting with Java 7, you can use StandardCharsets enum to create a Charset class instance. Therefor you can write:
new OutputStreamWriter(new FileOutputStream("open_sites_20,txt"), StandardCharsets.UTF_8)
I don't think FindBugs would argue about that.
Another issue I see here, is the way you close writer. In some circumstances, this method will not be called. The best way to deal with it (if you are using Java 7 and above) is to use try-with-resources and let Java close streams/writers:
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("open_sites_20,txt"), StandardCharsets.UTF_8));) {
writer.write(String.format("%4d%4d%n", i, j));
// It always make sense to flush the stream before closing
writer.flush();
} catch (IOException e) {
System.err.print("error: " + e.toString() + "\n");
};
Now, all you do is write single value to file. If I were you, I would try to avoid all the overhead of creating streams, wrapping it with writers and so on. It's complicated. Fortunately, Java 7 has one fantastic class to help you write things to text files: Files.
The class has two methods that come in handy when you need to write something to a text file:
write(Path path, byte[] bytes, OpenOption... options)
write(Path path, Iterable<? extends CharSequence> lines, Charset cs, OpenOption... options)
The first one could also be used to write binary files. Second could be used to write a collection of strings (array, list, set, ...). Your program could be rewritten as:
try {
Path outputPath = Paths.get("open_sites_20,txt");
String nmbrStr = String.format("%4d%4d%n", i, j);
byte[] outputBytes = nmbrStr.getBytes(StandardCharsets.UTF_8));
Files.write(outputPath, outputBytes, StandardOpenOption.CREATE);
} catch (IOException ioe) {
LOGGER.severe(ioe);
}
That's it!
Simple and elegant. What I used here:
Paths
String.getBytes(Charset charEncoding) - I can't guarantee that FindBugs won't complain about this one
StandardOpenOption
Java Logging API - instead writing exceptions to System.out

Generating fatal error in Java

Suppose we are writing a Java library, which provides some I/O ulitity functions, for example, a convenient method to read text files as Strings:
public class StringReader {
private static final Logger log = LoggerFactory.getLog(StringReader.class);
/**
* Returns the contents of file <b>fileName</b> as String.
* #param fileName file name to read
* #return null on IO error
*/
public static String readString(String fileName) {
FileInputStream fis = null;
try {
fis = new FileInputStream(fileName);
byte[] data = new byte[fis.available()];
fis.read(data);
return new String(data, "ISO-8859-1"); // may throw UnsupportedEncodingException!
} catch (IOException e) {
log.error("unable to read file", e);
} catch (UnsupportedEncodingException e) {
log.fatal("JRE does not support ISO-8859-1!", e);
// ???
} finally {
closeQuiet(fis);
}
return null;
}
}
This code reads a text file into a String using ISO-8859-1 encoding and returns the String to user.
The String(byte[], String) constructor throws an UnsupportedEncodingException when specified encoding is not supported. But, as we know, ISO-8859-1 must be supported by JRE, as said here (see the Standard charsets section).
Hence, we expect the block
catch (UnsupportedEncodingException e) {
log.fatal("encoding is unsupported", e);
// ???
}
is never reached if JRE distribution conforms the standard.
But what if it doesn't? How to handle this exception in the most correct way?
The question is, how to alert properly about such error?
The suggestions are:
Throw some kind of RuntimeException.
Do not disable the logger in production code, write an exception details in log and ignore it.
Put the assert false here, so it produce AssertionError if user launched VM with -ea.
Throw an AssertionError manually.
Add an UnsupportedEncodingException in method declaration and allow user to choose. Not very convenient, I think.
Call System.exit(1).
Thanks.
But what if it doesn't?
Then you're in a really bad situation, and you should probably get out of it as quickly as possible. When a JRE is violating its own promises, what would you want to depend on?
I'd feel happy using AssertionError in this case.
It's important to note that not all unchecked exceptions are treated equally - it's not unusual for code to catch Exception at the top level of the stack, log an error and then keep going... if you just throw RuntimeException, that will be caught by such a scheme. AssertionError would only be caught if the catch block specified Throwable (or specifically Error or AssertionError, but that's much rarer to see). Given how impossible this should be, I think it's reasonable to abort really hard.
Also note that in Java 7, you can use StandardCharsets.ISO_8859_1 instead of the string name, which is cleaner and removes the problem.
There are other things I'd change about your code, by the way:
I would avoid using available() as far as possible. That tells you how many bytes are available right now - it doesn't tell you how long the file is, necessarily.
I would definitely not assume that read() will read the whole file in one go. Call read() in a loop, ideally until it says there's no more data.
I would personally accept a Charset as a parameter, rather than hard-coding ISO-8859-1. - I would let IOException bubble up from the method rather than just returning null. After all, unless you're really going to check the return value of every call for nullity, you're just going to end up with a NullPointerException instead, which is harder to diagnose than the original IOException.
Alternatively, just use Guava's Files.toString(File, Charset) to start with :) (If you're not already using Guava, now is a good time to start...)
This is a rather common occurrence in code.
Unchecked exceptions are made for this. They shouldn't happen (which is why they are unchecked), but if they do, there is still an exception.
So, throw a RuntimeException that has the original Exception as the cause.
catch (UnsupportedEncodingException e) {
throw new RuntimeException(e); //should not happen
}
assert(false); also throws an unchecked exception, but it assertions can be turned off, so I would recommend RuntimeException.

Categories

Resources