JAXB fails to load file having name containing URL encoded characters - java

I have a file whose path looks like this /home/jwayne/test/55-0388%25car.xml. I try to unmarshall the XML back to an object using JAXB as follows.
File file = new File("/home/jwayne/test/55-0388%25car.xml");
JAXBContext context = JAXBContext.newInstance(Rectangle.class);
Unmarshaller unmarshaller = context.createUnmarshaller();
Rectangle rectangle = (Rectangle) unmarshaller.unmarshal(file);
However, I get a FileNotFoundException (FNFE) with the stacktrace as follows.
[java.io.FileNotFoundException: /home/jwayne/test/55-0388%car.xml (No such file or directory)]
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:246)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:214)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:157)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:162)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:171)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:189)
...
Note that somehow (as suggested by the stacktrace), the unmarshaller has modified the file name from 55-0388%25car.xml to 55-0388%car.xml.
Stepping through the code, however, the problem is actually pretty deep: sun.net.www.protocol.file.Handler has a method openConnection that does the following.
File var4 = new File(ParseUtil.decode(var1.getPath()));
That sun.net.www.ParseUtil.decode method actually transforms the file path.
Any idea on how to quickly fix this problem (besides renaming the file)? (Note I am using JDK v1.8.0_191).

The root cause of your problem is that there is a % that is used to URL encode special characters. For % it is %25.
What JAXB does internally is that it decodes %25 to plain % and thus the file cannot be found.
Quick (and dirty) solution is to do some string replacing, like:
String fileName = "/home/jwayne/test/55-0388%25car.xml":
fileName = fileName.replace("%25", "%2525");
File file = new File(fileName);
This applies whenever there is %25 in a file name. But I guess this happens to any URL encoded characters. So if there are any other special characters you need some handling for each or some clever regexp solution.
Update:
to get around this JAXB behaviour provide it with InputStream instead of File. So like:
FileInputStream fis = new FileInputStream(fileName);
Rectangle r2 = (Rectangle) unmarshaller.unmarshal(fis);
Then there is no means for JAXB to alter any URI / filename.

Related

Problems with JAXB and UTF-16 encoding

Hi I have a small APP that reads content from an xml file and put it into a corresponding Java object.
Here is the XML:
<?xml version="1.0" encoding="UTF-16"?>
<Marker>
<TimePosition>2700</TimePosition>
<SamplePosition>119070</SamplePosition>
</Marker>
here is the corresponding Java code:
JAXBContext jaxbContext = JAXBContext.newInstance(MarkerDto.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new FileInputStream("D:/marker.xml");
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16.toString());
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(reader);
If I run this code I get an "Content is not allowed in prolog." exception. If I run the same with UTF-8 everything works fine. Does anyone have a clue what might be the problem?
There's several things wrong here (ranging from slightly suboptimal, to potentially very wrong). In increasing order of likelihood of causing the problem:
When constructing an InputStreamReader, there's no need to call toString() on the Charset, because that class has a constructor that takes a Charset, so simply remove the .toString():
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16);
This is a tiny nitpick and has no effect on functionality.
Don't construct a Reader at all! XML is a format that's self-describing when it comes to encoding: Valid XML files can be parsed without knowing the encoding up-front. So instead of creating a Reader, simply pass the InputStream directly into your XML-handling code. Delete the line that creates the Reader and change the next one to this:
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(inputStream);
This may or may not fix your problem, depending on whether the input is well-formed.
Your XML file might have encoding="UTF-16" in the header and not actually be UTF-16 encoded. If that's the case, then it is malformed and a conforming parser will decline to parse it. Verify this by opening the file with the advanced text editor of your choice (I suggest Notepad++ on Windows, Linux users probably know what their preference is) and check if it shows "UTF-16" as encoding (and the content is readable).
If I run the same with UTF-8 everything works fine.
This line suggests that that's what's actually happening here: the XML file is mis-labeling itself. This needs to be fixed at the point where the XML file is created.
Notably, this demo code provides exactly the same Content is not allowed in prolog. exception message that is reported in the question:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<foo />";
JAXBContext jaxbContext = JAXBContext.newInstance();
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
jaxbUnmarshaller.unmarshal(inputStream);
Note that the XML encoding attribute claims UTF-16, but the actual data handed to the XML parser is UTF-8 encoded.

FileNotFound while File is there

I am using getClassLoader().getResources to find the path for Jsoup to parse.
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
Document document = Jsoup.parse(new File(path), "utf-8");
Elements names = document.getElementsByTag("name");
System.out.println(names.size());
My student.xml has been placed under the src folder in my module "day11_xml" and this code snippet comes from the class JsoupDemo1 in the package cn.itcast.xml.jsoup under the same module of "day11_xml". The error messages reads as follows:
java.io.FileNotFoundException:/Users/dingshun/Downloads/New%20Java%20Projects/demo/out/production/day11_xml/student.xml (No such file or directory)
I don't get it, as I can find the exact file in the given path. I'm confused, but could you guys help me out? Also, I'm new to both Java programming and this forum and if this question sounds silly or my question format is not right, please let me know.
What you're doing looks good. Maybe use the stream version JSoup.parse.
URL url = JsoupDemo1.class.getClassLoader().getResource("student.xml");
InputStream stream = JsoupDemo1.class.getClassLoader().getResourceAsStream("student.xml");
document = Jsoup.parse(stream, "utf-8", url.toURI()toString());
The documentation linked seems to imply it will work with html not xml, so maybe you need to use the other argument which provides a parser?
Actually, it turned out that Jsoup could not find my file because the path name "New%20Java%20Projects" has spaces between them. When I reload the file in a folder which has no spaces in its name, it works out just fine. So it can parse xml using parse​(File in, String charsetName) method. It seems it cannot parse path name which has spaces in it.

java: Protocol message tag had invalid wire type error when reading .pb file

I try to read .pb extension file.
Specifically, I would like to read this dataset (in .tgz).
I write the following code:
Path path = Paths.get(filename);
byte[] data = Files.readAllBytes(path);
Document document = Document.parseFrom(data);
But then I received the following error.
com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.
The last line of the code caused this error, but I do not know how to solve it.
Your files are actually in "delimited" format: each one contains multiple messages, each with a length prefix.
InputStream stream = new FileInputStream(filename);
Document document = Document.parseDelimitedFrom(steam);
Keep calling parseDelimitedFrom(stream) to read more messages until it returns null (end of file).
Also note that the file I looked at -- testNegative.pb in heldout_relations.tgz -- appeared to contain instances of Relation, not Document. Make sure you are parsing the correct type, because the protobuf implementation can't tell the difference -- you'll get garbage if you parse the wrong type.

XMLStreamException : Parse error

I have a process that parses an xml file with java 5 on apache tomcat 6.
Since, I compiled in java 7 with an execution join apache tomcat 7, I receive the following error:
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,60]
Message: Invalid encoding name "ISO8859-1".
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.setInputSource(XMLStreamReaderImpl.java:219)
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.(XMLStreamReaderImpl.java:189)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.getXMLStreamReaderImpl(XMLInputFactoryImpl.java:262)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLStreamReader(XMLInputFactoryImpl.java:129)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLEventReader(XMLInputFactoryImpl.java:78)
at org.simpleframework.xml.stream.StreamProvider.provide(StreamProvider.java:66)
at org.simpleframework.xml.stream.NodeBuilder.read(NodeBuilder.java:58)
at org.simpleframework.xml.core.Persister.read(Persister.java:543)
at org.simpleframework.xml.core.Persister.read(Persister.java:444)
Here is the xml fragment used:
?xml version="1.0" encoding="ISO8859-1" standalone="no" ?
If I replace ISO8859-1 by UTF-8 the parsing process works but it's not an option for me.
The lib that I use is simple-xml-2.1.8.jar
As someone noticed me, ISO8859-1 is a wrong content type. ISO-8859-1 is the correct one. As I mentioned, it's difficult to ask "producers" to correct their files. I would want to manage the problem in my application.
Get access to the Xerces XMLReader instance from Simple XML and set
reader.setFeature("http://apache.org/xml/features/allow-java-encodings", true)
before parsing the XML.
Since ISO8859-1 "works" in Java, this may just work.
The list of supported "features" of Xerces is available here
Alternatively, a good old regex on encoding="ISO8859-1" to fix the XML should do the trick, prior to processing it.
If you know the file encoding up front (UTF-8, ISO-8859-1 or whatever) then you should create a suitable Reader configured for that encoding, then use the Persister.read method that takes a Reader instead of the one that takes a File or InputStream. That way you are in control of the byte-to-character decoding rather than relying on the XML reader to detect the encoding (and fail, as the file declared it wrongly). So instead of
File f = new File(....);
MyType obj = persister.read(MyType.class, f);
you would do something more like
File f = new File(....);
MyType obj = null;
try( FileInputStream fis = new FileInputStream(f);
InputStreamReader reader = new InputStreamReader(fis, "ISO-8859-1")) { // or UTF-8, ...
obj = persister.read(MyType.class, reader);
}

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Categories

Resources