Illegal characters in XML - java - java

I'm creating a program which checks the legitimacy of a given URL. I've already created my own algorithm for this, but now I want to add PhishTank's services into my program.
They provide services where you can directly query a URL from their website, but they have set a certain quota on the number of queries you can make per day. The other option, which I'm going with, is to simply download their database and work with it locally, without restrictions.
The file you get is in XML, and found some code to test with, but it seems like their XML contains illegal characters (such as unicode 0x07 -- the [BEL] character) inside CDATA, and so the parsing throws me an exception.
<url><![CDATA[http://shaghaf-edu.com/sign-in/??msg=InvalidOnlineIdException&id[BEL]da9ca9b23227a572d1fb5ff4ff91e3&lpOlbResetErrorCounter=0l=&request_locale=en-us]]></url>
I've done a bit of searching and all I've found is solutions that seem fine to rather small XML-files. The one I'm working with is close to 2.7 million lines -- I'm not sure how efficiently a regex would work in this case or a char-to-char comparison.
I should note that their database is updated hourly, and has to be redownloaded. So cleaning the file once manually isn't an option.
So I'm wondering if there is any fast and efficient way of solving this problem?
I don't have the exact code with me, but I use is a very slight variation of this which I found here on StackOverflow:
private void start() throws Exception
{
URL url = new URL("http://localhost:8080/AutoLogin/resource/web.xml");
URLConnection connection = url.openConnection();
Document doc = parseXML(connection.getInputStream());
NodeList descNodes = doc.getElementsByTagName("description");
for(int i=0; i<descNodes.getLength();i++)
{
System.out.println(descNodes.item(i).getTextContent());
}
}
private Document parseXML(InputStream stream)
throws Exception
{
DocumentBuilderFactory objDocumentBuilderFactory = null;
DocumentBuilder objDocumentBuilder = null;
Document doc = null;
try
{
objDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
objDocumentBuilder = objDocumentBuilderFactory.newDocumentBuilder();
doc = objDocumentBuilder.parse(stream);
}
catch(Exception ex)
{
throw ex;
}
return doc;
}

Answering by asking a question ...
Why not write a simple pre-processing utility?
It could read the XML file as is (line by line); and do whatever is required to turn that content into "correct" XML.
In other words: you should explicitly distinguish between the task of "preparing your input", and "actually working that xml input". This will also make it much easier to do fine tuning. If you find that regular expressions are too expensive; then just change the the "pre-processor" to not use them. And afterwards, easily measure the effects on runtime ...

Related

getDocument() constantly returns a null value

I am trying to parse an XML file using Java that lives on a network drive...I have reviewed lots of XML parsing info here but cannot find the answer I need... the problem is that the getDocument() routine constantly returns a null value even though the parser gets a accurate location and file name.
Here is the code...
String ThisXMLFile = XMLFileData.getPath();
DOMParser myXMLParser = new DOMParser();
myXMLParser.parse(ThisXMLFile);
Document doc = myXMLParser.getDocument();
Some notes:
I had to use getPath() as the getName() function did not return the fully qualified file name and path - the XML file lives on a network directory and that directory is mapped on my PC to the 'V' drive
I have imported all the required class header files for DOM objects
The variable names given above are real and accurate so if I have inadvertently used a reserved keyword in a variable declaration then please offer correction.
I have extensive programming experience in a few languages but this is my first real Java app.
all the lines of code and the variables above work, until I reach the last line and then getDocument() just sets the doc variable to null... which makes the rest of the program break.
I Believe that your are calling the wrong method... according to your code, you're executing: DOMParser.parse(systemId) when you need to call: DOMParser.parse(InputSource) ...
to create an InputSource you can can do this:
InputSource source = new InputSource(new FileInputStream(ThisXMLFile));
myXMLParser.parse(source);
Document doc = myXMLParser.getDocument();
NOTE: remember to close the opened FileInputStream!!!
XMLInputFactory XMLFactory = XMLInputFactory.newInstance();
XMLStreamReader XMLReader = XMLFactory.createXMLStreamReader(myXMLStream);
while(XMLReader.hasNext())
{
if (XMLReader.getEventType() == XMLStreamReader.START_ELEMENT)
{
String XMLTag = XMLReader.getLocalName();
if(XMLTag.equals("value"))
{
String idValue = XMLReader.getAttributeValue(null, "id");
if (idValue.equals(ElementName))
{
System.out.println(idValue);
XMLReader.nextTag();
System.out.println(XMLReader.getElementText());
}
}
}
XMLReader.next();
}
so this is the code I finally got to...it works and solves the issue of retrieving specific XML data fro a XML file. I wanted at first to use nodelists, elements, Documents, etc but those functions never did work for me... this one did - thanks to all for the answers given as they helped me think this one through...

Relace HWPFDocument paragraph text using java results strange output

I require to replace a HWPFDocument paragraph text of .doc file if it contains a particular text using java. It replaces the text. But the process writes the output text in a strange way. Please help me to rectify this issue.
Code snippet used:
public static HWPFDocument processChange(HWPFDocument doc)
{
try
{
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
if (paragraph.text().contains("Place Holder"))
{
String text = paragraph.text();
paragraph.replaceText(text, "*******");
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return doc;
}
Input:
Place Holder
Textvalue1
Textvalue2
Textvalue3
Output:
*******Textvalue1
Textvalue1
Textvalue2
Textvalue3
The HWPF library is not in a perfect state for changing / writing .doc files. (At least at the last time that I looked. Some time ago I developed a custom variant of HWPF for my client which - among many other things - provides correct replace and save operations, but that library is not publicly available.)
If you absolutely must use .doc files and Java you may get away by replacing with strings of exactly same length. For instance "12345" -> "abc__" (_ being spaces or whatever works for you). It might make sense to find the absolute location of the to be replaced string in the doc file (using HWPF) and then changing it in the doc file directly (without using HWPF).
Word file format is very complicated and "doing it right" is not a trivial task. Unless you are willing to spend many man months, it will also not be possible to fix part of the library so that just saving works. Many data structures must be handled very precisely and a single "slip up" lets Word crash on the generated output file.

xPath multiple xml files from different url very slow

I need to check only one node from each file (109 files) that they are stored on different urls (109 urls).
I use this code
public class XPathParserXML {
public String version(String link, String serial) throws SAXException, IOException,
ParserConfigurationException, XPathExpressionException{
String version = new String();
String url = link+serial;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(url);
XPath xPathFactory = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPathFactory.compile("//swVersion/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList node = (NodeList) result;
if (node == null){
version = "!!WORKING!!";
}else{
version = node.item(0).getNodeValue();
}
return version;
}
}
and i call the method "version(link,serial)" in cicle for 109 times
My code take like 20 seconds to elaborate all. Each file weight 0.64KB and i have a 20MB connection.
What can i do to speed up my code?
1. Object caching:
While that's not the only issue, probably, you should definitely cache and reuse all of those objects between calls to version():
DocumentBuilderFactory
DocumentBuilder
XPathFactory
XPathExpression
2. Circumvention of a known JAXP performance issue:
Besides, you should probably activate one of these flags:
-Dorg.apache.xml.dtm.DTMManager=
org.apache.xml.dtm.ref.DTMManagerDefault
or
-Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
See also this question for details:
Java XPath (Apache JAXP implementation) performance
3. Reduce latency impact
Last but not least, you're serially accessing all those XML files over the wire. It may be useful to reduce the impact of your connection latency by parallelising access to those files, e.g. by using multiple threads at the client side. (Note if you choose multi-threading, then beware of thread-safety issues when caching the objects I've mentioned in the first section. Also, avoid creating too many parallel requests at the same time to prevent your server from failing)
Another way to reduce that impact would be to expose those XML files in a ZIP file from the server to avoid multiple connections and transfer all XML files at once.
4. Avoid XML validation if you can trust the source
From your additional comments, I see that you're using XML validation. This is, of course, expensive and should only be done if really needed. Since you run a very arbitrary XPath expression, I take that you don't care too much about XML validation. Best turn it off!
5. If all else fails... Avoid DOM
Since (from your comments) you've measured the parsing to take up most of the CPU, you have two more options to circumvent the whole issue:
Use a SAX parser and abort parsing once you reach the //swVersion element (From your code, I'm assuming that there is only one). SAX is much faster for these use-cases, than DOM.
Avoid XML entirely and search the document for a regex: <swVersion>(.*?)</swVersion>. That should only be your last resort, because it doesn't handle
namespaces
attributes
whitespace

Reading XML file returns wrong characters

I have an XML file with thousands of tags to read their text content, as in the screenshot below :
I am trying to read the text content of all the "word" tags using this code :
String filePath = "...";
File xmlFile = new File( filePath );
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
NodeList categoryNodes = domObject.getElementsByTagName( "category" ); // Get all the <category> nodes.
for (int s = 0; s < categoryNodes.getLength(); s++) { //Loop on the <category> nodes.
String categoryName = categoryNodes.item(s).getAttributes().getNamedItem( "name" ).getNodeValue();
if( selectedCategoryName.equals( categoryName ) ) { //get its words.
NodeList wordsNodes = categoryNodes.item(s).getChildNodes();
for( int i = 0; i < wordsNodes.getLength(); i++ ) {
if( wordsNodes.item( i ).getNodeType() != Node.ELEMENT_NODE ) continue;
String word = wordsNodes.item( i ).getTextContent();
categoryWordsList.add( word ); // Some words are read wrong !!
}
break;
}
}
But for some reason many words are being read in wrong manner, examples :
"AMK6780KBU" is read as "9826</word"
"ASSI.ABR30326" is read as "rd>ASSI.AEP26"
"ASSI.25066" is read as "SI.4268</6"
It might be because the file size is big. If i just add some empty lines or remove some empty lines from the XML file, other words will be read wrong than the ones mentioned above, which is a strange thing !
You can download the XML file from here.
Solution
See below :-)
What I tried in the process
Changing the XML version from 1.1 -> 1.0 fixed the problem for me. I'm using Java 1.6.0_33 (as #orique pointed out in the comments).
In my tests there are definitely issues with corruption after a certain number of nodes. I narrowed it down to somewhere around ASSI.MTK69609. Removing everything, including that line fixed the corruption of the previous words.
The corruption is also resolved by simply changing the declaration to:
<?xml version="1.0">
and I saw zero corruption using the entire original source XML.
Similarly if you leave the version at 1.1 but remove whitespace nodes from the source, the result is as expected, for example:
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
results in the desired output and
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
is corrupted.
Removing some end-of-line "nodes" also corrected the problem, for example
<word>ASSI.MTK693115</word><word>ASSI.MTK69609</word>
So it was all pointing towards a bug, but where...? Eventually it clicked! Xerces
The version of Xerces shipped with Java 1.6 (and probably 1.7) is old, old, old and buggy (for example #6760982). In fact, I can break my test class by simply adding:
Document domObject = db.parse( xmlFile );
domObject.normalizeDocument(); // <-- causes following Exception
Exception in thread "main" java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.util.XML11Char.isXML11ValidNCName(XML11Char.java:340)
There have been many defects fixed for XML 1.1, so on a hunch I downloaded the latest version Xerces2 Java 2.11.0.
Simply running with the most recent version resulted in the expected uncorrupted output.
java -classpath .;xercesImpl.jar;xml-apis.jar Foo > foo.txt
We have noticed that getTextContent() is buggy on some Windows implementations.
Our workaround is to do something like this
// getTextContent is buggy on some Java Windows Implementations
if ( n.getNodeType( ) == Node.ELEMENT_NODE ) {
results [ i ] = (String) xPathFunction.evaluate( "./text()", n, XPathConstants.STRING );
} else { //Node.TEXT_NODE
results [ i ] = n.getNodeValue( );
}
xPathFunction is an javax.xml.xpath.XPath. Expensive, but works reliably.
Actually in your case I would directly use an XPath and call something like,
NodeList l = (NodeList) xPathFunction.evaluate( "/categories/category/word/text()", domObject, XPathConstants.NODESET )
EDIT
Beats me! On OSX, Java 1.6.0_43, I get the same behaviour. In case there was any doubt the DOM model is buggy in Java... The wrong values seem to reliably appear at certain intervals, which looks like some bytes buffer overrun. I never got an OOM error.
Here is what I have unsuccessfully tried:
word.getFirstChild().getNodeValue(); instead of word.getTextContent(); -> no change in behaviour
use an InputSource as an input into the DocumentBuilder instead of using a File
run an XPath ("/categories/category[#name='Category1']/word/text()") instead of looping over the nodes and manually traversing their children
run the same Test using Saxon as the XPath engine
check for "strange" characters in the XML file
I believe the DocumentBuilder is the culprit. It is a memory hog.
Your next best chance is to go for a SAX Parser or any other streaming parser. Since your data model is small and very simple, the implementation should be easy. To further ease implementation, you may try XMLDog. We use a slightly modified version to parse gigabyte size XML files successfully.
If you ever find the issue, please update this post.

Parsing an XML file without root in Java

I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.
I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.
Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.
One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>
You could use another parser like Jsoup. It can parse XML without a root.
I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.
Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}

Categories

Resources