Reading XML file returns wrong characters

Reading XML file returns wrong characters - java

I have an XML file with thousands of tags to read their text content, as in the screenshot below :
I am trying to read the text content of all the "word" tags using this code :
String filePath = "...";
File xmlFile = new File( filePath );
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
NodeList categoryNodes = domObject.getElementsByTagName( "category" ); // Get all the <category> nodes.
for (int s = 0; s < categoryNodes.getLength(); s++) { //Loop on the <category> nodes.
String categoryName = categoryNodes.item(s).getAttributes().getNamedItem( "name" ).getNodeValue();
if( selectedCategoryName.equals( categoryName ) ) { //get its words.
NodeList wordsNodes = categoryNodes.item(s).getChildNodes();
for( int i = 0; i < wordsNodes.getLength(); i++ ) {
if( wordsNodes.item( i ).getNodeType() != Node.ELEMENT_NODE ) continue;
String word = wordsNodes.item( i ).getTextContent();
categoryWordsList.add( word ); // Some words are read wrong !!
}
break;
}
}
But for some reason many words are being read in wrong manner, examples :
"AMK6780KBU" is read as "9826</word"
"ASSI.ABR30326" is read as "rd>ASSI.AEP26"
"ASSI.25066" is read as "SI.4268</6"
It might be because the file size is big. If i just add some empty lines or remove some empty lines from the XML file, other words will be read wrong than the ones mentioned above, which is a strange thing !
You can download the XML file from here.

Solution
See below :-)
What I tried in the process
Changing the XML version from 1.1 -> 1.0 fixed the problem for me. I'm using Java 1.6.0_33 (as #orique pointed out in the comments).
In my tests there are definitely issues with corruption after a certain number of nodes. I narrowed it down to somewhere around ASSI.MTK69609. Removing everything, including that line fixed the corruption of the previous words.
The corruption is also resolved by simply changing the declaration to:
<?xml version="1.0">
and I saw zero corruption using the entire original source XML.
Similarly if you leave the version at 1.1 but remove whitespace nodes from the source, the result is as expected, for example:
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
results in the desired output and
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
is corrupted.
Removing some end-of-line "nodes" also corrected the problem, for example
<word>ASSI.MTK693115</word><word>ASSI.MTK69609</word>
So it was all pointing towards a bug, but where...? Eventually it clicked! Xerces
The version of Xerces shipped with Java 1.6 (and probably 1.7) is old, old, old and buggy (for example #6760982). In fact, I can break my test class by simply adding:
Document domObject = db.parse( xmlFile );
domObject.normalizeDocument(); // <-- causes following Exception
Exception in thread "main" java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.util.XML11Char.isXML11ValidNCName(XML11Char.java:340)
There have been many defects fixed for XML 1.1, so on a hunch I downloaded the latest version Xerces2 Java 2.11.0.
Simply running with the most recent version resulted in the expected uncorrupted output.
java -classpath .;xercesImpl.jar;xml-apis.jar Foo > foo.txt

We have noticed that getTextContent() is buggy on some Windows implementations.
Our workaround is to do something like this
// getTextContent is buggy on some Java Windows Implementations
if ( n.getNodeType( ) == Node.ELEMENT_NODE ) {
results [ i ] = (String) xPathFunction.evaluate( "./text()", n, XPathConstants.STRING );
} else { //Node.TEXT_NODE
results [ i ] = n.getNodeValue( );
}
xPathFunction is an javax.xml.xpath.XPath. Expensive, but works reliably.
Actually in your case I would directly use an XPath and call something like,
NodeList l = (NodeList) xPathFunction.evaluate( "/categories/category/word/text()", domObject, XPathConstants.NODESET )
EDIT
Beats me! On OSX, Java 1.6.0_43, I get the same behaviour. In case there was any doubt the DOM model is buggy in Java... The wrong values seem to reliably appear at certain intervals, which looks like some bytes buffer overrun. I never got an OOM error.
Here is what I have unsuccessfully tried:
word.getFirstChild().getNodeValue(); instead of word.getTextContent(); -> no change in behaviour
use an InputSource as an input into the DocumentBuilder instead of using a File
run an XPath ("/categories/category[#name='Category1']/word/text()") instead of looping over the nodes and manually traversing their children
run the same Test using Saxon as the XPath engine
check for "strange" characters in the XML file
I believe the DocumentBuilder is the culprit. It is a memory hog.
Your next best chance is to go for a SAX Parser or any other streaming parser. Since your data model is small and very simple, the implementation should be easy. To further ease implementation, you may try XMLDog. We use a slightly modified version to parse gigabyte size XML files successfully.
If you ever find the issue, please update this post.

Related

getDocument() constantly returns a null value

I am trying to parse an XML file using Java that lives on a network drive...I have reviewed lots of XML parsing info here but cannot find the answer I need... the problem is that the getDocument() routine constantly returns a null value even though the parser gets a accurate location and file name.
Here is the code...
String ThisXMLFile = XMLFileData.getPath();
DOMParser myXMLParser = new DOMParser();
myXMLParser.parse(ThisXMLFile);
Document doc = myXMLParser.getDocument();
Some notes:
I had to use getPath() as the getName() function did not return the fully qualified file name and path - the XML file lives on a network directory and that directory is mapped on my PC to the 'V' drive
I have imported all the required class header files for DOM objects
The variable names given above are real and accurate so if I have inadvertently used a reserved keyword in a variable declaration then please offer correction.
I have extensive programming experience in a few languages but this is my first real Java app.
all the lines of code and the variables above work, until I reach the last line and then getDocument() just sets the doc variable to null... which makes the rest of the program break.

I Believe that your are calling the wrong method... according to your code, you're executing: DOMParser.parse(systemId) when you need to call: DOMParser.parse(InputSource) ...
to create an InputSource you can can do this:
InputSource source = new InputSource(new FileInputStream(ThisXMLFile));
myXMLParser.parse(source);
Document doc = myXMLParser.getDocument();
NOTE: remember to close the opened FileInputStream!!!

XMLInputFactory XMLFactory = XMLInputFactory.newInstance();
XMLStreamReader XMLReader = XMLFactory.createXMLStreamReader(myXMLStream);
while(XMLReader.hasNext())
{
if (XMLReader.getEventType() == XMLStreamReader.START_ELEMENT)
{
String XMLTag = XMLReader.getLocalName();
if(XMLTag.equals("value"))
{
String idValue = XMLReader.getAttributeValue(null, "id");
if (idValue.equals(ElementName))
{
System.out.println(idValue);
XMLReader.nextTag();
System.out.println(XMLReader.getElementText());
}
}
}
XMLReader.next();
}
so this is the code I finally got to...it works and solves the issue of retrieving specific XML data fro a XML file. I wanted at first to use nodelists, elements, Documents, etc but those functions never did work for me... this one did - thanks to all for the answers given as they helped me think this one through...

Illegal characters in XML - java

I'm creating a program which checks the legitimacy of a given URL. I've already created my own algorithm for this, but now I want to add PhishTank's services into my program.
They provide services where you can directly query a URL from their website, but they have set a certain quota on the number of queries you can make per day. The other option, which I'm going with, is to simply download their database and work with it locally, without restrictions.
The file you get is in XML, and found some code to test with, but it seems like their XML contains illegal characters (such as unicode 0x07 -- the [BEL] character) inside CDATA, and so the parsing throws me an exception.
<url><![CDATA[http://shaghaf-edu.com/sign-in/??msg=InvalidOnlineIdException&id[BEL]da9ca9b23227a572d1fb5ff4ff91e3&lpOlbResetErrorCounter=0l=&request_locale=en-us]]></url>
I've done a bit of searching and all I've found is solutions that seem fine to rather small XML-files. The one I'm working with is close to 2.7 million lines -- I'm not sure how efficiently a regex would work in this case or a char-to-char comparison.
I should note that their database is updated hourly, and has to be redownloaded. So cleaning the file once manually isn't an option.
So I'm wondering if there is any fast and efficient way of solving this problem?
I don't have the exact code with me, but I use is a very slight variation of this which I found here on StackOverflow:
private void start() throws Exception
{
URL url = new URL("http://localhost:8080/AutoLogin/resource/web.xml");
URLConnection connection = url.openConnection();
Document doc = parseXML(connection.getInputStream());
NodeList descNodes = doc.getElementsByTagName("description");
for(int i=0; i<descNodes.getLength();i++)
{
System.out.println(descNodes.item(i).getTextContent());
}
}
private Document parseXML(InputStream stream)
throws Exception
{
DocumentBuilderFactory objDocumentBuilderFactory = null;
DocumentBuilder objDocumentBuilder = null;
Document doc = null;
try
{
objDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
objDocumentBuilder = objDocumentBuilderFactory.newDocumentBuilder();
doc = objDocumentBuilder.parse(stream);
}
catch(Exception ex)
{
throw ex;
}
return doc;
}

Answering by asking a question ...
Why not write a simple pre-processing utility?
It could read the XML file as is (line by line); and do whatever is required to turn that content into "correct" XML.
In other words: you should explicitly distinguish between the task of "preparing your input", and "actually working that xml input". This will also make it much easier to do fine tuning. If you find that regular expressions are too expensive; then just change the the "pre-processor" to not use them. And afterwards, easily measure the effects on runtime ...

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?

How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save

Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.

I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.

I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.

I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.

I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".

This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png

I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.

I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».

Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.

I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.

You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'

This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

Parsing an XML file without root in Java

I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.

I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.

Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.

One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>

You could use another parser like Jsoup. It can parse XML without a root.

I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.

Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}

Creating xml from with java

I need your expertise once again. I have a java class that searches a directory for xml files (displays the files it finds in the eclipse console window), applies the specified xslt to these and sends the output to a directory.
What I want to do now is create an xml containing the file names and file format types. The format should be something like;
<file>
<fileName> </fileName>
<fileType> </fileType>
</file>
<file>
<fileName> </fileName>
<fileType> </fileType>
</file>
Where for every file it finds in the directory it creates a new <file>.
Any help is truely appreciated.

Use an XML library. There are plenty around, and the third party ones are almost all easier to use than the built-in DOM API in Java. Last time I used it, JDom was pretty good. (I haven't had to do much XML recently.)
Something like:
Element rootElement = new Element("root"); // You didn't show what this should be
Document document = new Document(rootElement);
for (Whatever file : files)
{
Element fileElement = new Element("file");
fileElement.addContent(new Element("fileName").addContent(file.getName());
fileElement.addContent(new Element("fileType").addContent(file.getType());
}
String xml = XMLOutputter.outputString(document);

Have a look at DOM and ECS. The following example was adapted to you requirements from here:
XMLDocument document = new XMLDocument();
for (File f : files) {
document.addElement( new XML("file")
.addXMLAttribute("fileName", file.getName())
.addXMLAttribute("fileType", file.getType())
)
);
}

You can use the StringBuilder approach suggested by Vinze, but one caveat is that you will need to make sure your filenames contain no special XML characters, and escape them if they do (for example replace < with <, and deal with quotes appropriately).
In this case it probably doesn't arise and you will get away without it, however if you ever port this code to reuse in another case, you may be bitten by this. So you might want to look at an XMLWriter class which will do all the escaping work for you.

Well just use a StringBuilder :
StringBuilder builder = new StringBuilder();
for(File f : files) {
builder.append("<file>\n\t<fileName>").append(f.getName).append("</fileName>\n)";
[...]
}
System.out.println(builder.toString());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading XML file returns wrong characters - java

Related

getDocument() constantly returns a null value

Illegal characters in XML - java

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

Parsing an XML file without root in Java

Creating xml from with java

Categories

Resources