I have a problem in writing a xml file with UTF-8 in JAVA.
Problem: I have a file with filename having an interpunct(middot)(·) in it. When im trying to write the filename inside a xml tag, using java code i get some junk number like in filename instead of ·
OutputStreamWriter osw =new OutputStreamWriter(file_output_stream,"UTF8");
Above is the java code i used to write the xmlfile. Can anybody tell me why to understand and sort the problem ? thanks in advance
Java sources are UTF-16 by default.
If your character is not in it, then use an escape:
String a = "\u00b7";
Or tell your compiler to use UTF-8 and simply write it to the code as-is.
That character is ASCII 183 (decimal), so you need to escape the character to ·. Here is a demonstration: If I type "·" into this answer, I get "·"
The browser is printing your character because this web page is XML.
There are utility methods that can do this for you, such as apache commons-lang library's StringEscapeUtils.escapeXml() method, which will correctly and safely escape the entire input.
In general it is a good idea to use UTF-8 everywhere.
The editor has to know that the source is in UTF-8. You could use the free programmers editor JEdit which can deal with many encodings.
The javac compiler has to know that the java source is in UTF-8. In Java you can use the solution of #OndraŽižka.
This makes for two settings in your IDE.
Don't try to create XML by hand. Use a library for the purpose. You are just scratching the surface of the heap of special cases that will break a hand-made solution.
One way, using core Java classes, is to create a DOM, then serialize that using an no-op XSL transform that writes to a StreamResult. (if your document is large, you can do something similar by driving a SAX event handler.)
There are many third party libraries that will help you do the same thing very easily.
Related
I have a text file and i need to convert this text file all data in xml format to make more readable.
Text file
how can i convert it in xml format.
Any java library or any way that i can do it.
Your question is rather vague (and you could probably find the answer yourself with just a little research), but I'll give you a hint.
Your sample appears to be an INI file (as traditionally used for configuration files on Windows & DOS). So, look for an "INI file parser." If you can't find one, you should be able to write a simple parser yourself using regular expressions. It's a simple file format, consisting of section headings like [SectionTitle] and data fields like Key=Value. That's all.
As for generating XML ... it shouldn't be hard, but "xml format" is not a useful description. Can you be more specific? E.g., what will the XML be used for?
Try this: http://www.smooks.org/mediawiki/index.php?title=Main_Page. I've used it and it's great.
A more sophisticated solution would be to use Mule Data Mapper. On the server side, obviously.
This is the beginning -- I have a file on disk which is HTML page. When I open it with regular web browser it displays as it should -- i.e. no matter what encoding is used, I see correct national characters.
Then I come -- my task is to load the same file, parse it, and print out some pieces on the screen (console) -- let's say, all <hX> texts. Of course I would like to see only correct characters, not some mambo-jumbo. The last step is changing some of text, and save the file.
So the parser has to parse and handle encoding in both ways as well. So far I am unaware of parser which is even capable of loading data correctly.
Question
What parser would you recommend?
Details
HTML page in general has the encoding given in header (in meta tag), so parser should use it. The scenario I have to look in advance and check the encoding, and then manually set the encoding in code is no-go. For example, this is taken from JSoup tutorials:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
I cannot do such thing, parser has to handle encoding detection by itself.
In C# I faced similar problem with loading html. I used HTMLAgilityPack and first executed encoding detection, then using it I encoded the data stream, and after that I parsed the data. So, I did both steps explicitly, but since the library delivers both methods it is fine with me.
Such explicit separation might be even better, because it would be possible to use in case of missing header probabilistic encoding detection method.
The Jsoup API reference says for that parse method that if you provide null as the second argument (the encoding one), it'll use the http-equiv meta-tag to determine the encoding. So it looks like it already does the "parse a bit, determine encoding, re-parse with proper encoding" routine. Normally such parsers should be capable of resolving the encoding themselves using any means available to them. I know that SAX parsers in Java are supposed to use byte-order marks and the XML declaration to try and establish an encoding.
Apparently Jsoup will default to UTF-8 if no proper meta-tag is found. As they say in the documentation, this is "usually safe" since UTF-8 is compatible with a host of common encodings for the lower code points. But I take it that "usually safe" might not really be good enough in this case.
If you don't sufficiently trust Jsoup to detect the encoding, I see two alternatives:
Should you somehow be ascertained that the HTML is always in fact XHTML, then an XML parser might prove a better fit. But that would only work if the input is definitely XML compliant.
Do a heuristic encoding detection yourself by trying to use byte-order marks, parsing a portion using common encodings and finding a meta-tag, detecting the encoding by byte patterns you'd expect in header tags and finally, all else failing, use a default.
I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?
How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.
Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.
Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.
There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.
We have a process that outputs the contents of a large XML file to System.out.
When this output is pretty printed (ie: multiple lines) everything works. But when it's on one line Eclipse crashes with an OutOfMemory error. Any ideas how to prevent this?
Sounds like it is the Console panel blowing up. Consider limiting its buffer size.
EDIT: It's in Preferences. Search for Console.
How do you print it on one line?
using several System.out.print(String s)
using System.out.println(String verybigstring)
in the second case, you need a lot more memory...
If you want more memory for eclipse, could try to increase eclipses memory by changing the -Xmx value in eclipse.ini
I'm going to assume that you're building an org.w3c.Document, and writing it using a serializer. If you're hand-building an XML string, you're all but guaranteed to be producing something that's almost-but-not-quite XML, and I strongly suggest fixing that first.
That said, if you're writing to a stream from the serializer (and System.out is a stream), then you should be writing directly to the stream rather than writing to a string and printing that (which you'd do with a StringWriter). The reason for this is that the XML serializer will properly handle character encodings, while serializer to String to stream may not.
If you're not currently building a DOM, and are concerned about the memory requirements of doing so, then I suggest looking at the Practical XML library (which I maintain), in particular the builder package. It uses lightweight nodes, that are then output via a serializer using a SAX transform.
Edit in response to comment:
OK, you've got the serializer covered with XStream. I'm next going to assume that you are calling XStream.toXML(Object) to produce the string, and recommend that you call the variant toXML(Object, OutputStream), and pass it the actual output. The reason for this is that XML is very sensitive to character encoding, which is something that often breaks when converting strings to streams.
This may, of course, cause issues with building your POST request, particularly if you're using a library that doesn't provide you an OutputStream.
I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.
I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.
Thanks
There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.
Typical problematic spots in the Java API are:
new String(byte[])
String.getBytes()
FileReader, FileWriter
All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).
However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.
What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.
You can control which encoding your JVM will run with by supplying f,ex
-Dfile.encoding=utf-8
for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:
java -Dfile.encoding=utf-8 my.MainClass
Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.
Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)
If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.
I had to make sure that the OutputStreamWriter uses the correct encoding
OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);
Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");