Reading UTF-16 XML files with JCabi Java

Reading UTF-16 XML files with JCabi Java - java

I have found this JCabi snippet code that works well with UTF-8 xml encoded files, it basically reads the xml file and then prints it as a string.
XML xml;
try {
xml = new XMLDocument(new File("test8.xml"));
String xmlString = xml.toString();
System.out.println(xmlString);
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
However I need this to run this same code on a UTF-16 encoded xml it gives me the following error:
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Can't parse, most probably the XML is invalid
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
I have read about this error and this means that the parser it is not recognizing the prolog because it's seeing characters that are not supposed to be there because of the encoding.
I have tried other libraries that offer a way to "tell" the class which encoding the source file is encoded in, but the only library I was able to get it to work to some degree was JCabi, but I was not able to find a way to tell it that my source file is encoded in UTF-16.
Thanks, any help is appreciated.

The jcabi XMLDocument has various constructors including one which takes a string. So one approach is to use:
Path path = Paths.get("test16_LE_with_bom.xml");
XML xml = new XMLDocument(Files.readString(path, StandardCharsets.UTF_16LE));
String xmlString = xml.toString();
System.out.println(xmlString);
This makes use of java.nio.charset.StandardCharsets and java.nio.file.Files.
In my first test, my XML file was encoded as UTF-16-LE (and with a BOM at the start: FF FE for little-endian). The above approach handled the BOM OK.
My test file's prolog is as follows (with no explicit encoding - maybe that's a bad thing, here?):
<?xml version="1.0"?>
In my second test I removed the BOM and re-ran with the updated file - which also worked.
I used Notepad++ and a hex editor to verify/select encodings & to edit the test files.
Your file may be different from my test files (BE vs. LE).

Related

Is there any handy way to inspect whether a xml file contins invalid characters

I am writing a Java program that parses/unmarshals XML files to Java objects.
This program takes XML files, which are generated by some third party and I do not have any control over of.
Upon getting the files, the program checks whether they are invalid format using their respective XSDs↓
URL schemaFile = this.getClass().getClassLoader().getResource(xsd/some.xsd);
Source xmlFile = new StreamSource(new File(/path/to/xml));
SchemaFactory schemaFactory = SchemaFactory.newInstance(W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(schemaFile);
Validator validator = schema.newValidator();
validator.validate(xmlFile);
then starts parsing/unmarshalling them individually using JAXP.
The problem I am facing is that even after the validation above, sometimes I get the following error. (the validator above does not seem to check whether the XML contains invalid characters, but only compare the input with its XSD)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[xxx,xxx]
Is there any handy way to inspect whether XML file contains invalid characters using programmatically or some tool?
I have extracted the portion(line 245) where the exception occurs using "sed -n '240,250p'".
sample.xml

Do you have a whitelist of allowed characters? Here's one pattern:
For each streamed character, if it is not whitelisted replace it with nothing.
Ask whether your file content after filtering is the same as before (diff pattern)
If the content in both files is not equal then the source file had invalid characters.

Decoding a base64 XML cuts off the last part

I have a base64 encoded string, which represents an XML Schema (xsd). I decode this using Apache's Base64 utilities, put the resulting byte array into an intputsource and let an XMLSchemaCollection read this inputSource:
String base64String = ......
byte[] decoded = Base64.decodeBase64(base64String);
InputSource inputSource = new InputSource(new ByteArrayInputStream(decoded));
xmlSchemaCollection.read(inputSource, new ValidationEventHandler());
This gives an error:
XML document structure must start and end within the same entity
Which usually means the XML structure isn't valid. I performed two tests to see what the base64 actually holds. First is printing it out to the console:
System.out.println(new String(decoded,"UTF-8"));
In eclipse, I see my xml is suddenly cut off, like part of it is missing. However, if I use any online website, such as https://www.base64decode.org/, and I copy/paste my base64, I see the complete full xml. If I validate this xml, the validation succeeds. So I'm a bit confused as to why eclipse seemingly cuts off my xml after decoding?

Errors like this are usually indicative of a badly formatted document:
XML document structures must start and end within the same entity...
A few things you can do to debug this:
1. Print out the XML document to a log and run it through some sort of XML validator.
2. Check to make sure that there are no invalid characters (ex UTF-16 characters in a UTF-8 document)

XML parsing tries to parse more lines then exist

I have a strange problem when parsing an xml request with JAXB: somehow it tries to parse more lines then exists in the string:
String xml; //content with 139 lines in xml format
MyReq req = JAXB.unmarshal(new StringReader(xml), MyReq.class);
Result:
Caused by: org.xml.sax.SAXParseException; lineNumber: 140; columnNumber: 1; Content is not allowed in trailing section.
What might be wrong with this?? The lines does not exist that is supposed to be have an error...
I can copy the xml just as it is to soapUI and execute the request successfully, thus concluding the xml is valid in general.

You should check the xml content. Most of the time Content is not allowed in trailing section error is because the content is not valid, probably some bad characters at the end of the stream.
You should print the content of the xml, with some known delimiters, to ensure that what you received is what you actually tested/expected, something like:
System.out.println("*"+xml+"*");

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?

How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save

Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.

I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.

I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.

I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.

I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".

This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png

I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.

I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».

Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.

I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.

You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'

This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

DOM4J Document: read an ISO-8859-1 xml

I need to read an xml file that is encoded in ISO-8859-1.
I'm using:
Document document = reader.read(new File(sourceFile));
document.setXMLEncoding("ISO-8859-1");
I'm getting a "cannot find symbol" error for setXMLEncoding. This seems like it should be a simple thing, but I can't figure out what I'm doing wrong.

The setXMLEncoding is available since dom4j 1.6. I guess you're using an older version.
Anyway, as the javadoc says:
Sets the encoding of this document as it will appear in the XML
declaration part of the document.
you should use that method if you're writing an xml.
I guess you're reading an existing file, so if it's ISO-8859-1 encoded and its prolog contains the same encoding declaration, you shouldn't have any problem, dom4j should do everything for you.
<?xml version="1.0" encoding="ISO-8859-1"?>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading UTF-16 XML files with JCabi Java - java

Related

Is there any handy way to inspect whether a xml file contins invalid characters

Decoding a base64 XML cuts off the last part

XML parsing tries to parse more lines then exist

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

DOM4J Document: read an ISO-8859-1 xml

Categories

Resources