How to avoid parsing strange characters

How to avoid parsing strange characters - java

While I am processing XML file, the Stax parser encountered the following line:
<node id="281224530" lat="48.8975614" lon="8.7055191" version="8" timestamp="2015-06-07T22:47:39Z" changeset="31801740" uid="272351" user="Krte�?ek">
and as you see there is a strange character at the end of the line, and when the parser reaches that line the program stops and gives me the following error:
Exception in thread "main" javax.xml.stream.XMLStreamException: ParseError
at [row,col]:[338019,145]
Message: Ungültiges Byte 2 von 2-Byte-UTF-8-Sequenz.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown
Source)
at com.example.Main.main(Main.java:46)
Is there any thing I should change in the settings of Eclipse to avoid that error?
Update
code:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = null;
try {
parser = factory.createXMLStreamReader(in);
} catch (XMLStreamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
Log.d(TAG, "newParser",
"e/createXMLStreamReader: " + e.getMessage());
}

It is not about eclipse, but it is about encoding of your file. There are two cases:
1) file is corrupted, i.e. it contains incorrect symbols, not from defined encoding
2) file is not in utf-8 encoding and it is defined in xml header. So you should check, that you are reading file contents appropriately.

If you edited and saved your XML file in eclipse, this can be a problem in case your eclipse is not configured to use UTF-8. Check this question: How to support UTF-8 encoding in Eclipse
Otherwise you probably don't need to do anything about your code. You just need a correctly UTF-8-encoded content.

Related

Reading UTF-16 XML files with JCabi Java

I have found this JCabi snippet code that works well with UTF-8 xml encoded files, it basically reads the xml file and then prints it as a string.
XML xml;
try {
xml = new XMLDocument(new File("test8.xml"));
String xmlString = xml.toString();
System.out.println(xmlString);
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
However I need this to run this same code on a UTF-16 encoded xml it gives me the following error:
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Can't parse, most probably the XML is invalid
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
I have read about this error and this means that the parser it is not recognizing the prolog because it's seeing characters that are not supposed to be there because of the encoding.
I have tried other libraries that offer a way to "tell" the class which encoding the source file is encoded in, but the only library I was able to get it to work to some degree was JCabi, but I was not able to find a way to tell it that my source file is encoded in UTF-16.
Thanks, any help is appreciated.

The jcabi XMLDocument has various constructors including one which takes a string. So one approach is to use:
Path path = Paths.get("test16_LE_with_bom.xml");
XML xml = new XMLDocument(Files.readString(path, StandardCharsets.UTF_16LE));
String xmlString = xml.toString();
System.out.println(xmlString);
This makes use of java.nio.charset.StandardCharsets and java.nio.file.Files.
In my first test, my XML file was encoded as UTF-16-LE (and with a BOM at the start: FF FE for little-endian). The above approach handled the BOM OK.
My test file's prolog is as follows (with no explicit encoding - maybe that's a bad thing, here?):
<?xml version="1.0"?>
In my second test I removed the BOM and re-ran with the updated file - which also worked.
I used Notepad++ and a hex editor to verify/select encodings & to edit the test files.
Your file may be different from my test files (BE vs. LE).

in XML making it unparseable

So I have a value in my database which has a non breaking space in the form   in it. I have a legacy service which reads this string from the database and creates an XML using this string. The issue I am facing is that the XML returned for this message is un-parseable. When I open it in notepad++ I see the character xA0 in the place of the non breaking space, and on removing this character the XML becomes parseable. Furthermore I have older revisions of this XML file from the same service which have the character "Â " in place of the non breaking space. I recently changed the tomcat server on which the service was running, and something has gone wrong because of it. I found this post according to which my XML is encoded to ISO-8859-1; but the code which I use to convert the XML to string does not use ISO-8859-1;. Below is my code
private String nodeToString(Node node) {
StringWriter sw = new StringWriter();
try {
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.transform(new DOMSource(node), new StreamResult(sw));
} catch (TransformerException te) {
LOG.error("Exception during String to XML transformation ", te);
}
return sw.toString();
}
I want to know why is my XML un-parseable and why is there a "Â " in the older revisions of the XML file.
Here is the image of the problematic character in notepad++
image in notepad++
Also when I open my XML in notepad and try to save it I see the encoding type is ANSI, when I change it to UTF-8 and then save it the XML becomes parseable.
New Info - Enforcing UTF-8 with transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); did not work I am still getting the xA0 in my XML.

The issue was that my version of java was somehow saving my file in ANSI file format. I saw this when I opened my file in notepad, and tried to save it. The older files were in UTF-8 format. So all I did was specify UTF-8 encoding while writing my file.
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileName.trim()), StandardCharsets.UTF_8));
try {
out.write(data);
} finally {
out.close();
}

Preflight validate() is invalid but in Console valid

Currently I'm decoding a Base64 with Console:
base64 -di "myfile.txt" > mypdf.pdf
Which returns a valid pdf file.
But when I try this
DataSource dataSource = new ByteArrayDataSource(
new ByteArrayInputStream(Base64.getDecoder().decode(pdf.getEncodedContent())));
PreflightParser parser = new PreflightParser(dataSource);
parser.parse();
try (PreflightDocument document = parser.getPreflightDocument()) {
document.validate();
return !document.isEncrypted();
}
catch (ValidationException ex) {
return false;
}
I always get a validationException (pdf is not valid).
I think I need to change the configuration. I've already tried the following but that doesn't seem to help:
PreflightConfiguration config = document.getContext().getConfig();
config.setLazyValidation(true);
Stacktrace:
test.pdf is not valid: Unable to parse font metadata due to : Excepted xpacket 'end' attribute (must be present and placed in first)

I've solved this ticket. For those who are interested:
The validation worked perfectly and the pdf files where not correct even if the reader /browser could open it (the pdf reader/browser did not show any warnings or error messages).
Try to convert your pdfs to binary text and check at least if your first two lines and the last line are 'pdf default' like:
%PDF-1.7
%µµµµ
...
%%EOF
if not, then the pdf has been generated wrong and the validation will fail.

Jena- Writing onto owl file- Unexpected result result

I created a file system that stores metadata of files and folders in an owl file.
For file system, I am using java binding of FUSE i.e. FUSE-JNA
For OWL, I am using Jena:
Initially my file system runs ok with no error. But after sometime my program stops reading .owl file and throws some errors. One of the error is below:
Errors I get while reading .owl file:
SEVERE: Exception thrown: org.apache.jena.riot.RiotException: [line: 476, col: 52] The value of attribute "rdf:about" associated with an element type "File" must not contain the '<' character.
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:136)
org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.fatalError(LangRDFXML.java:252)
com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:48)
com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:209)
com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:239)
org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
...
I open my .owl file, what I found is the Jena is not writing correctly. In picture below if you see number 3 highlighted error in blue color, its incomplete, there is some code missing there.
Secondly, number 2 blue highlighted error is also written wrongly.In my ontology is property of File. It should be written as of number 1 blue highlighted code.
Although both the number 1 and number 2 code is written by jena. Most of the owl code is written correctly by Jena as similar to number 1 but some time jena writes it wrongly as similar to number 2 in picture. I do not know why.
(to see the picture in full size, open it in new tab or save it on your computer)
This is how I am writing to .owl file using jena api:
public void setDataTypeProperty(String resourceURI, String propertyName, String propertyValue) //create new data type property. Accept four arguments: URI of resource as string, property name (i.e #hasPath), old value as string and new value as string.
{
Model model = ModelFactory.createDefaultModel();
//read model from file
InputStream in = FileManager.get().open(inputFileName);
if (in == null)
{
throw new IllegalArgumentException( "File: " + inputFileName + " not found");
}
model.read(in, "");
try {
in.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// Add property to Model
Resource resource = model.createResource(resourceURI);
resource.addProperty(model.createProperty(baseURI+propertyName), model.createLiteral(propertyValue));
//Writing model to file
try {
FileWriter out = new FileWriter( inputFileName );
model.write( out, "RDF/XML-ABBREV" );
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Please guide me how to fix the number 2 and number 3 blue highlighted errors of Jena.

There is an issue of input-sanitation to your method. I cannot be certain that your input data is invalid, but it is certainly something that should be tested in any method that is programmatically constructing URIs or literals.
URIs
For example, the following two lines are dangerous because they can allow characters that are not allowed in a URI, or they can allow characters for literal values that cannot be serialized as XML.
Resource resource = model.createResource(resourceURI);
resource.addProperty(model.createProperty(baseURI+propertyName), model.createLiteral(propertyValue));
To fix the problem of URIs, use URLEncoder to sanitize the uris themselves:
final String uri = URLEncoder.encode(resourceURI, "UTF-8");
final String puri = URLEncoder.encode(baseURI+propertyName);
final Resource resource = model.createResource(uri);
resource.addProperty(model.createProperty(puri), model.createLiteral(propertyValue));
To test for the problem us URIs, you can use Jena's IRIFactory types in order to validate that the URI you are constructing adheres to some particular specification.
Literals
To solve the problem of literals is a little more tricky. You are not getting an exception that indicates that you have a bad value for a literal, but I am including this for completeness (so you can sanitize all inputs, and not only the ones that may be causing a problem now).
Jena's writers do not test the values of literals until they are being serialized as XML. The pattern that they use to detect invalid XML characters is focused only on the characters that are required to replace as part of the RDF XML specification. Jena delegates the final validation (and exception throwing) to the underlying XML library. This makes sense, because there could exist a future RDF serialization that allows the expression of all characters. I was recently bit by it (for example, a string that contains a backspace character), so I created a more strict pattern in order to eagerly detect this situation at runtime.
final Pattern elementContentEntities = Pattern.compile( "[\0-\31&&[^\n\t\r]]|\127|[\u0080-\u009F]|[\uD800-\uDFFF]|\uFFFF|\uFFFE" );
final Matcher m = elementContentEntities.matcher( propertyValue );
if( m.find() ) {
// TODO sanitise your string literal, it contains invalid characters
}
else {
// TODO your string is good.
}

The nature of the truncation at #3 - "admi" - leads me to think that maybe this is a problem with your underlying data transport and storage, and has nothing to do with XML, RDF, Jena, or anything else up at this level. Maybe an ignored exception?

My main program was some times passing resourceURI argument as blank/null to setDataTypeProperty method. That's why it was creating problem.
So I have modified my code and added two lines at start of the method:
public void setDataTypeProperty(String resourceURI, String propertyName, String propertyValue) //create new data type property. Accept four arguments: URI of resource as string, property name (i.e #hasPath), old value as string and new value as string.
{
if (resourceURI==null)
return;
...
...
Now I am running it since few days but did not face the above mentioned errors yet.

UTF-8 write xml successful

today I faced with very interesting problem. When I try to rewrite xml file.
I have 3 ways to do this. And I want to know the best way and reason of problem.
I.
File file = new File(REAL_XML_PATH);
try {
FileWriter fileWriter = new FileWriter(file);
XMLOutputter xmlOutput = new XMLOutputter();
xmlOutput.output(document, System.out);
xmlOutput.output(document, fileWriter);
fileWriter.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
In this case I have a big problem with my app. After writing in file in my own language I can't read anything. Encoding file was changed on ANSI javax.servlet.ServletException: javax.servlet.jsp.JspException: Invalid argument looking up property: "document.rootElement.children[0].children"
II.
File file = new File(REAL_XML_PATH);
XMLOutputter output=new XMLOutputter();
try {
output.output(document, new FileOutputStream(file));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
In this case I haven't problems. Encoding wasn't change. No problem with reading and writing.
And this article http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html
And I want to know the best way and reason of problem.

Well, this looks like the problem:
FileWriter fileWriter = new FileWriter(file);
That will always use the platform default encoding, which is rarely what you want. Suppose your default encoding is ISO-8859-1. If your document declares itself to be encoded in UTF-8, but you actually write everything in ISO-8859-1, then your file will be invalid if you have any non-ASCII characters - you'll end up writing them out with the ISO-8859-1 single byte representation, which isn't valid UTF-8.
I would actually provide a stream to XMLOutputter rather than a Writer. That way there's no room for conflict between the encoding declared by the file and the encoding used by the writer. So just change your code to:
FileOutputStream fileOutput = new FileOutputStream(file);
...
xmlOutput.output(document, fileOutput);
... as I now see you've done in your second bit of code. So yes, this is the preferred approach. Here, the stream makes no assumptions about the encoding to use, because it's just going to handle binary data. The XML writing code gets to decide what that binary data will be, and it can make sure that the character encoding it really uses matches the declaration at the start of the file.
You should also clean up your exception handling - don't just print a stack trace and continue on failure, and call close in a finally block instead of at the end of the try block. If you can't genuinely handle an exception, either let it propagate up the stack directly (potentially adding throws clauses to your method) or catch it, log it and then rethrow either the exception or a more appropriate one wrapping the cause.

If I remember correctly, you can force your xmlOutputter to use a "pretty" format with:
new XMLOutputter(Format.getPrettyFormat()) so it should work with I too
pretty is:
Returns a new Format object that performs whitespace beautification
with 2-space indents, uses the UTF-8 encoding, doesn't expand empty
elements, includes the declaration and encoding, and uses the default
entity escape strategy. Tweaks can be made to the returned Format
instance without affecting other instances.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to avoid parsing strange characters - java

Related

Reading UTF-16 XML files with JCabi Java

in XML making it unparseable

Preflight validate() is invalid but in Console valid

Jena- Writing onto owl file- Unexpected result result

UTF-8 write xml successful

Categories

Resources