in XML making it unparseable - java

So I have a value in my database which has a non breaking space in the form   in it. I have a legacy service which reads this string from the database and creates an XML using this string. The issue I am facing is that the XML returned for this message is un-parseable. When I open it in notepad++ I see the character xA0 in the place of the non breaking space, and on removing this character the XML becomes parseable. Furthermore I have older revisions of this XML file from the same service which have the character "Â " in place of the non breaking space. I recently changed the tomcat server on which the service was running, and something has gone wrong because of it. I found this post according to which my XML is encoded to ISO-8859-1; but the code which I use to convert the XML to string does not use ISO-8859-1;. Below is my code
private String nodeToString(Node node) {
StringWriter sw = new StringWriter();
try {
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.transform(new DOMSource(node), new StreamResult(sw));
} catch (TransformerException te) {
LOG.error("Exception during String to XML transformation ", te);
}
return sw.toString();
}
I want to know why is my XML un-parseable and why is there a "Â " in the older revisions of the XML file.
Here is the image of the problematic character in notepad++
image in notepad++
Also when I open my XML in notepad and try to save it I see the encoding type is ANSI, when I change it to UTF-8 and then save it the XML becomes parseable.
New Info - Enforcing UTF-8 with transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); did not work I am still getting the xA0 in my XML.

The issue was that my version of java was somehow saving my file in ANSI file format. I saw this when I opened my file in notepad, and tried to save it. The older files were in UTF-8 format. So all I did was specify UTF-8 encoding while writing my file.
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileName.trim()), StandardCharsets.UTF_8));
try {
out.write(data);
} finally {
out.close();
}

Related

How to change text encoding to utf-8 while using apache tika text parsing (most specifically for .txt files)

I'm using apache tika for text extraction. It was working fine over almost all filetypes unless I tried testing it over a Chinese machine with a .txt document written in Chinese. I did not save the file in utf-8 encoding format. Tika started parsing wrong string characters. This seems to be an encoding issue, I tried setting encoding type like this
metadata.add(Metadata.CONTENT_ENCODING, "UTF_8")
still no luck. I've seen some methods in java that convert text from one encoding type to another but only if the source encoding type is known. In my case, I'm not sure about the client's encoding type and can't force him to use utf-8. kindly help me with this!!
Thanks in advance:)
I had the same issue but when converting Powerpoint to text and I found out that by using the correct OutputStream which you can specify the encoding, the encoding is working well.
The metadata you try to add changes nothing for the conversion but just add the line in the headers of the html file.
Here is my code:
public String tranformPowerpointToText(File file) throws IOException, TikaException {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
ToTextContentHandler toTextContentHandler= new ToTextContentHandler(byteArrayOutputStream, "UTF-8");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(file)) {
parser.parse(stream, toTextContentHandler, metadata);
return byteArrayOutputStream.toString();
} catch (SAXException e) {
e.printStackTrace();
}
}

DOM4J utf-8 encoding Umlaute(Ä,ü,ß) incorrectly

I'm using DOM4j for parsing and writing an XML-Tree which is always in UTF-8.
My XML file includes German Special-Characters. Parsing them is not a problem, but when I'm writing the tree to a file, the special characters are getting converted to � characters.
I can't change the encoding of the XML file as it is restricted to UTF-8.
Code
SAXReader xmlReader = new SAXReader();
xmlReader.setEncoding("UTF-8");
Document doc = xmlReader.read(file);
doc.setXMLEncoding("UTF-8");
Element root = doc.getRootElement();
// manipulate doc
OutputFormat format = new OutputFormat();
format.setEncoding("UTF-8");
XMLWriter writer = new XMLWriter(new FileWriter(file), format);
writer.write(doc);
writer.close();
Expected output
...
<statementText>This is a test!Ä Ü ß</statementText>
...
Actual output
...
<statementText>This is a test!� � �</statementText>
...
You are passing a FileWriter to the XMLWriter. A Writer already handles String or char[] data, so it already handles the encoding, which means the XMLWriter has no chance of influencing it.
Additionally FileWriter is an especially problematic Writer type, since you can never specify which encoding it should use, instead it always uses the platform default encoding (which is often something like ISO-8859-1 on Windows and UTF-8 on Linux). It should basically never be used for this reason.
To let the XMLWriter apply what it is given as configuration pass it an OutputStream instead (which handles byte[]). The most obvious one to use here would be FileOutputStream:
XMLWriter writer = new XMLWriter(new FileOutputStream(file), format);
This is even documented in the JavaDoc for XMLWriter:
Warning: using your own Writer may cause the writer's preferred character encoding to be ignored. If you use encodings other than UTF8, we recommend using the method that takes an OutputStream instead.
Arguably the warning is a bit misleading, as the Writer can be problematic even if you intend to write UTF-8 data.

How to avoid parsing strange characters

While I am processing XML file, the Stax parser encountered the following line:
<node id="281224530" lat="48.8975614" lon="8.7055191" version="8" timestamp="2015-06-07T22:47:39Z" changeset="31801740" uid="272351" user="Krte�?ek">
and as you see there is a strange character at the end of the line, and when the parser reaches that line the program stops and gives me the following error:
Exception in thread "main" javax.xml.stream.XMLStreamException: ParseError
at [row,col]:[338019,145]
Message: Ungültiges Byte 2 von 2-Byte-UTF-8-Sequenz.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown
Source)
at com.example.Main.main(Main.java:46)
Is there any thing I should change in the settings of Eclipse to avoid that error?
Update
code:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = null;
try {
parser = factory.createXMLStreamReader(in);
} catch (XMLStreamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
Log.d(TAG, "newParser",
"e/createXMLStreamReader: " + e.getMessage());
}
It is not about eclipse, but it is about encoding of your file. There are two cases:
1) file is corrupted, i.e. it contains incorrect symbols, not from defined encoding
2) file is not in utf-8 encoding and it is defined in xml header. So you should check, that you are reading file contents appropriately.
If you edited and saved your XML file in eclipse, this can be a problem in case your eclipse is not configured to use UTF-8. Check this question: How to support UTF-8 encoding in Eclipse
Otherwise you probably don't need to do anything about your code. You just need a correctly UTF-8-encoded content.

Troubles with XML encoding in Java

I have a problem with XML encoding.
When i created XML on localhost with cp1251 encoding all cool
But when i deploy my module on server, xml file have incorrect symbols like "ФайлПФР"
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
DOMSource source = new DOMSource(doc);
transformer.setOutputProperty(OutputKeys.ENCODING, "cp1251");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
String attach = writer.toString();
How i can fix it?
I tried to read an XML Document which was UTF-8 encoded, and attempted to transform it with a different encoding, which had no effect at all (the existing encoding of the document was used instead of the one I specified with the output property). When creating a new Document in memory (encoding is null), the output property was used correctly.
Looks like when transforming an XML Document, the output property OutputKeys.ENCODING is only used when the org.w3c.dom.Document does not have an encoding yet.
Solution
To change the encoding of a XML Document, don't use the Document as the source, but its root node (the document element) instead.
// use doc.getDocumentElement() instead of doc
DOMSource source = new DOMSource(doc.getDocumentElement());
Works like a charm.
Source document:
<?xml version="1.0" encoding="UTF-8"?>
<foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>
Output with "cp1251":
<?xml version="1.0" encoding="WINDOWS-1251"?><foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>
A (String)Writer will not be influenced from an output encoding (only from the used input encoding), as Java maintains all text in Unicode. Either write to binary, or output the string as Cp1251.
Note that the encoding should be in the <?xml encoding="Windows-1251"> line. And I guess "Cp1251" is a bit more java specific.
So the error probably lies in the writing of the string; for instance
response.setCharacterEncoding("Windows-1251");
response.write(attach);
Or
attach.getBytes("Windows-1251")

Use FileOutputStream to Create a UTF-8 PDF File

I am using JasperReports and DynamicReports with this piece of java code to create a report in pdf format which contains utf-8 characters, the problem is generated pdf file does not contain utf-8 characters at all, like if they have been replaced with "". is there any thing that i should be aware of when using OutputStream to create a utf-8 file?
public void toPdf(String path){
OutputStream outHtml;
try {
outHtml = new FileOutputStream(path);
jasperBuilder.toPdf(outHtml);
} catch (Exception e1) {
logger.error("failed to create PDF", e1);
}
}
this may be notable that creating XLS and HTML file faces no such problem.
note that there are lots of lines of code under jasperBuilder.toPdf(outHtml); that i have traced and no where in those lines my utf-8 characters are being eliminated. so i guess the devil is in outHtml = new FileOutputStream(path);
I managed to solve it. It was a font and encoding problem. Just followed tutorial here, but change <pdfEncoding>UTF-8</pdfEncoding> to <pdfEncoding>Identity-H</pdfEncoding> in fonts.xml
<fontFamilies>
<fontFamily name="FreeUniversal">
<normal>/home/moien/tahoma.ttf</normal>
<bold>/home/moien/tahoma.ttf</bold>
<italic>/home/moien/tahoma.ttf</italic>
<boldItalic>/home/moien/tahoma.ttf</boldItalic>
<pdfEncoding>Identity-H</pdfEncoding>
<pdfEmbedded>true</pdfEmbedded>
</fontFamily>
</fontFamilies>
Now I have another challenge to solve, making font URL relative!
A FileOutputStream is completely agnostic of the "stuff" that gets written to it. It just writes bytes. If characters are being eliminated or mangled, then this is being caused by whatever is generating the bytes to be written to the stream.
In this case, my money would be on the way that you have configured / used the jasperBuilder object prior to running this code.

Categories

Resources