XMLStreamException : Parse error - java

I have a process that parses an xml file with java 5 on apache tomcat 6.
Since, I compiled in java 7 with an execution join apache tomcat 7, I receive the following error:
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,60]
Message: Invalid encoding name "ISO8859-1".
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.setInputSource(XMLStreamReaderImpl.java:219)
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.(XMLStreamReaderImpl.java:189)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.getXMLStreamReaderImpl(XMLInputFactoryImpl.java:262)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLStreamReader(XMLInputFactoryImpl.java:129)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLEventReader(XMLInputFactoryImpl.java:78)
at org.simpleframework.xml.stream.StreamProvider.provide(StreamProvider.java:66)
at org.simpleframework.xml.stream.NodeBuilder.read(NodeBuilder.java:58)
at org.simpleframework.xml.core.Persister.read(Persister.java:543)
at org.simpleframework.xml.core.Persister.read(Persister.java:444)
Here is the xml fragment used:
?xml version="1.0" encoding="ISO8859-1" standalone="no" ?
If I replace ISO8859-1 by UTF-8 the parsing process works but it's not an option for me.
The lib that I use is simple-xml-2.1.8.jar
As someone noticed me, ISO8859-1 is a wrong content type. ISO-8859-1 is the correct one. As I mentioned, it's difficult to ask "producers" to correct their files. I would want to manage the problem in my application.

Get access to the Xerces XMLReader instance from Simple XML and set
reader.setFeature("http://apache.org/xml/features/allow-java-encodings", true)
before parsing the XML.
Since ISO8859-1 "works" in Java, this may just work.
The list of supported "features" of Xerces is available here
Alternatively, a good old regex on encoding="ISO8859-1" to fix the XML should do the trick, prior to processing it.

If you know the file encoding up front (UTF-8, ISO-8859-1 or whatever) then you should create a suitable Reader configured for that encoding, then use the Persister.read method that takes a Reader instead of the one that takes a File or InputStream. That way you are in control of the byte-to-character decoding rather than relying on the XML reader to detect the encoding (and fail, as the file declared it wrongly). So instead of
File f = new File(....);
MyType obj = persister.read(MyType.class, f);
you would do something more like
File f = new File(....);
MyType obj = null;
try( FileInputStream fis = new FileInputStream(f);
InputStreamReader reader = new InputStreamReader(fis, "ISO-8859-1")) { // or UTF-8, ...
obj = persister.read(MyType.class, reader);
}

Related

Problems with JAXB and UTF-16 encoding

Hi I have a small APP that reads content from an xml file and put it into a corresponding Java object.
Here is the XML:
<?xml version="1.0" encoding="UTF-16"?>
<Marker>
<TimePosition>2700</TimePosition>
<SamplePosition>119070</SamplePosition>
</Marker>
here is the corresponding Java code:
JAXBContext jaxbContext = JAXBContext.newInstance(MarkerDto.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new FileInputStream("D:/marker.xml");
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16.toString());
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(reader);
If I run this code I get an "Content is not allowed in prolog." exception. If I run the same with UTF-8 everything works fine. Does anyone have a clue what might be the problem?
There's several things wrong here (ranging from slightly suboptimal, to potentially very wrong). In increasing order of likelihood of causing the problem:
When constructing an InputStreamReader, there's no need to call toString() on the Charset, because that class has a constructor that takes a Charset, so simply remove the .toString():
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16);
This is a tiny nitpick and has no effect on functionality.
Don't construct a Reader at all! XML is a format that's self-describing when it comes to encoding: Valid XML files can be parsed without knowing the encoding up-front. So instead of creating a Reader, simply pass the InputStream directly into your XML-handling code. Delete the line that creates the Reader and change the next one to this:
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(inputStream);
This may or may not fix your problem, depending on whether the input is well-formed.
Your XML file might have encoding="UTF-16" in the header and not actually be UTF-16 encoded. If that's the case, then it is malformed and a conforming parser will decline to parse it. Verify this by opening the file with the advanced text editor of your choice (I suggest Notepad++ on Windows, Linux users probably know what their preference is) and check if it shows "UTF-16" as encoding (and the content is readable).
If I run the same with UTF-8 everything works fine.
This line suggests that that's what's actually happening here: the XML file is mis-labeling itself. This needs to be fixed at the point where the XML file is created.
Notably, this demo code provides exactly the same Content is not allowed in prolog. exception message that is reported in the question:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<foo />";
JAXBContext jaxbContext = JAXBContext.newInstance();
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
jaxbUnmarshaller.unmarshal(inputStream);
Note that the XML encoding attribute claims UTF-16, but the actual data handed to the XML parser is UTF-8 encoded.

DOM4J utf-8 encoding Umlaute(Ä,ü,ß) incorrectly

I'm using DOM4j for parsing and writing an XML-Tree which is always in UTF-8.
My XML file includes German Special-Characters. Parsing them is not a problem, but when I'm writing the tree to a file, the special characters are getting converted to � characters.
I can't change the encoding of the XML file as it is restricted to UTF-8.
Code
SAXReader xmlReader = new SAXReader();
xmlReader.setEncoding("UTF-8");
Document doc = xmlReader.read(file);
doc.setXMLEncoding("UTF-8");
Element root = doc.getRootElement();
// manipulate doc
OutputFormat format = new OutputFormat();
format.setEncoding("UTF-8");
XMLWriter writer = new XMLWriter(new FileWriter(file), format);
writer.write(doc);
writer.close();
Expected output
...
<statementText>This is a test!Ä Ü ß</statementText>
...
Actual output
...
<statementText>This is a test!� � �</statementText>
...
You are passing a FileWriter to the XMLWriter. A Writer already handles String or char[] data, so it already handles the encoding, which means the XMLWriter has no chance of influencing it.
Additionally FileWriter is an especially problematic Writer type, since you can never specify which encoding it should use, instead it always uses the platform default encoding (which is often something like ISO-8859-1 on Windows and UTF-8 on Linux). It should basically never be used for this reason.
To let the XMLWriter apply what it is given as configuration pass it an OutputStream instead (which handles byte[]). The most obvious one to use here would be FileOutputStream:
XMLWriter writer = new XMLWriter(new FileOutputStream(file), format);
This is even documented in the JavaDoc for XMLWriter:
Warning: using your own Writer may cause the writer's preferred character encoding to be ignored. If you use encodings other than UTF8, we recommend using the method that takes an OutputStream instead.
Arguably the warning is a bit misleading, as the Writer can be problematic even if you intend to write UTF-8 data.

java - detect and change encoding

i have a small java app. i develop it in eclipse. it takes text from xml file and uploads it to a website. in eclipse everithing works fine. but when i compile my app to executable jar and start it from cmd a big trouble appears - all non latin characters transform into unreadable symbols.
i've tryed to put <?xml version="1.0" encoding="windows-1251" ?> or <?xml version="1.0" encoding="utf-8" ?> however it doesn't help.
how can i fix this problem?
any help appriciated!
You could try specifying the UTF-8 Charset (or any other supported charset, for that matter) explicitly in your output writer's constructor.
For example, when using the PrintWriter class for outputting data:
Writer writer = new PrintWriter("myfile.txt", "UTF-8");
writer.write("Hällo Wörld!");
writer.close();
An equivalent example when using the OutputStreamWriter class:
Writer writer = new OutputStreamWriter(System.out, "UTF-8");
writer.write("Hällo Wörld!");
writer.close();
(Note that in both cases the Charset has to be specified via its textual name (i.e. "UTF-8"), not by direct instantiation of the respective class.)
Thus, a likely explanation for your problem is that as the charset is not given explicitly, the system tries to fallback to the default encoding of your OS (which is probably not UTF-8).

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?
How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.
Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save
Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.
I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.
I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.
I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.
I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".
This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png
I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.
I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».
Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.
I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.
You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'
This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Categories

Resources