How to efficiently parse concatenated XML documents from a file - java

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.
Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.
Any suggestions or tools? I am working in the Java environment.
Edit: I am not sure if the xml-declaration will be present in documents or not.
Edit: Let's assume that the encoding for all the xml docs is UTF-8.

Don't split! Add one big tag around it! Then it becomes one XML file again:
<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>
Now, using /BIGTAG/SomeData would give you all the XML roots.
If processing instructions are in the way, you can always use a RegEx to remove them. It's easier to just remove all processing instructions than to use a RegEx to find all root nodes.
If encoding differs for all documents then remember this: the whole document itself must have been encoded by some encoding type, thus all those XML documents it includes will be using the same encoding, no matter what each header is telling you. If the big file is encoded as UTF-16 then it doesn't matter if the XML processing instructions say the XML itself is UTF-8. It won't be UTF-8 since the whole file is UTF-16. The encoding in those XML processing instructions is therefor invalid.
By merging them into one file, you've altered the encoding...
By RegEx, I mean regular expressions. You just have to remove all text that's between a <? and a ?> which should not be too difficult with a regular expression and slightly more complicated if you're trying other string manipulation techniques.

As Eamon says, if you know the <?xml> thing will always be there, just break on that.
Failing that, look for the ending document-level tag. That is, scan the text counting how many levels deep you are. Every time you see a tag that begins with "<" but not "</" and that does not end with "/>", add 1 to the depth count. Every time you see a tag that begins "</", subtract 1. Every time you subtract 1, check if you are now at zero. If so, you've reached the end of an XML document.

Since you're not sure the declaration will always be present, you can strip all declarations (a regex such as <\?xml version.*\?> can find these), prepend <doc-collection>, append </doc-collection>, such that the resultant string will be a valid xml document. In it, you can retrieve the separate documents using (for instance) the XPath query /doc-collection/*. If the combined file can be large enough that memory consumption becomes an issue, you may need to use a streaming parser such as Sax, but the principle remains the same.
In a similar scenario which I encountered, I simply read the concatenated document directly using an xml-parser: Although the concatenated file may not be a valid xml document, it is a valid xml fragment (barring the repeated declarations) - so, once you strip the declarations, if your parser supports parsing fragments, then you can also just read the result directly. All top-level elements will then be the root elements of the concatenated documents.
In short, if you strip all declarations, you'll have a valid xml fragment which is trivially parseable either directly or by surrounding it with some tag.

This is my answer for the C# version. very ugly code that works :-\
public List<T> ParseMultipleDocumentsByType<T>(string documents)
{
var cleanParsedDocuments = new List<T>();
var serializer = new XmlSerializer(typeof(T));
var flag = true;
while (flag)
{
if(documents.Contains(typeof(T).Name))
{
var startingPoint = documents.IndexOf("<?xml");
var endingString = "</" +typeof(T).Name + ">";
var endingPoing = documents.IndexOf(endingString) + endingString.Length;
var document = documents.Substring(startingPoint, endingPoing - startingPoint);
var singleDoc = (T)XmlDeserializeFromString(document, typeof(T));
cleanParsedDocuments.Add(singleDoc);
documents = documents.Remove(startingPoint, endingPoing - startingPoint);
}
else
{
flag = false;
}
}
return cleanParsedDocuments;
}
public static object XmlDeserializeFromString(string objectData, Type type)
{
var serializer = new XmlSerializer(type);
object result;
using (TextReader reader = new StringReader(objectData))
{
result = serializer.Deserialize(reader);
}
return result;
}

I don't have a Java answer, but here's how I solved this problem with C#.
I created a class named XmlFileStreams to scan the source document for the XML document declaration and break it up logically into multiple documents:
class XmlFileStreams {
List<int> positions = new List<int>();
byte[] bytes;
public XmlFileStreams(string filename) {
bytes = File.ReadAllBytes(filename);
for (int pos = 0; pos < bytes.Length - 5; ++pos)
if (bytes[pos] == '<' && bytes[pos + 1] == '?' && bytes[pos + 2] == 'x' && bytes[pos + 3] == 'm' && bytes[pos + 4] == 'l')
positions.Add(pos);
positions.Add(bytes.Length);
}
public IEnumerable<Stream> Streams {
get {
if (positions.Count > 1)
for (int i = 0; i < positions.Count - 1; ++i)
yield return new MemoryStream(bytes, positions[i], positions[i + 1] - positions[i]);
}
}
}
To use XmlFileStreams:
foreach (Stream stream in new XmlFileStreams(#"c:\tmp\test.xml").Streams) {
using (var xr = XmlReader.Create(stream, new XmlReaderSettings() { XmlResolver = null, ProhibitDtd = false })) {
// parse file using xr
}
}
There are a couple of caveats.
It reads the entire file into memory for processing. This could be a problem if the file is really big.
It uses a simple brute force search to look for the XML document boundaries.

Related

Extract Reading Order Sequence in a Tagged PDF

I'm currently validating the correct order of the content in a Tagged PDF File.
Is there any way to extract the reading order numbers of Tagged PDF Files programmatically?
I've tried converting the tagged PDF to XML but I can't figure out which tags belong to a certain text.
I've tried the following Libraries:
Syncfusion
IText7
but I can't find any methods that get its reading order numbers.
Is it really possible? Thanks in advance!
You can extract the marked content tree of tagged pdf using the PdfPig (.Net) library. My understanding is that the reading order is indicated by the Marked-content identifier (MCID).
If a marked content element does not contain an MCID (like pagination elements), the MCID is set to -1.
Each MarkedContentElement will contain the letters, images and paths that belong to it:
using UglyToad.PdfPig;
[...]
using (PdfDocument document = PdfDocument.Open(pathToFile))
{
for (int p = 0; p < document.NumberOfPages; p++)
{
var page = document.GetPage(p + 1);
// extract the page's marked content
var markedContents = page.GetMarkedContents();
var orderedMarkedContents = markedContents
.OrderBy(mc => mc.MarkedContentIdentifier);
foreach (var mc in orderedMarkedContents)
{
// do something
}
}
}
If you want to extract the result to XML, you can have a look at the PageXmlTextExporter class. Have a look at the wiki for more information on ITextExporter and IReadingOrderDetector.
Note: I am an active contributer to this library.

Comparing two PDF files text using PDFBox is failing eventhough both files are having same text

I am using PDFBOX as a utility in my selenium automation for export testing . We are comparing actual exported pdf file with the expected ones using pdfbox and then pass/fail test accordingly. This works pretty much smoothly . However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
Expected pdf file
Actual pdf file
Below is the general utility i am using to compare pdf files
private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
PDDocument pdf1 = PDDocument.load(pdfFile1);
PDDocument pdf2 = PDDocument.load(pdfFile2);
PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
try
{
if (pdf1pages.getCount() != pdf2pages.getCount())
{
String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
LOG.debug(message);
throw new TestException(message);
}
PDFTextStripper pdfStripper = new PDFTextStripper();
LOG.debug("pdfStripper is :- " + pdfStripper);
LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
for (int i = 0; i < pdf1pages.getCount(); i++)
{
pdfStripper.setStartPage(i + 1);
pdfStripper.setEndPage(i + 1);
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!pdf1PageText.equals(pdf2PageText))
{
String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
LOG.debug(message);
System.out.println("fff");
LOG.debug("pdf1PageText is " + pdf1PageText);
LOG.debug("pdf2PageText is " + pdf2PageText);
String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
LOG.debug("difference is "+difference);
throw new TestException(message+" [[ Difference is ]] "+difference);
}
}
LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
} finally {
pdf1.close();
pdf2.close();
}
}
Eclipse shows this differences in console
https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png
I can see it is failing because of symbols like (curley braces , {} , hash # , exclamation mark !) however i don't know how to fix this one ..
Can anyone please tell me how to fix this one ?
However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
That this might happen, should not surprise you. After all your test does not compare the looks of the pages in question but the results of text extraction.
While the look of textual data on the pages depends on the drawing instructions for the glyphs in question in the respective (in case of your files) embedded font file, the result of text extraction of the same textual data on the pages depends on the ToUnicode table or Encoding value of the PDF font information structures for that font file.
And indeed, while the textual data of the expected and the actual document use the same glyphs of the respective fonts, the ToUnicode tables in the expected and the actual document for one font claim that certain glyphs represent different Unicode code points.
The font in question has these three glyphs:
The ToUnicode map for that font in your expected document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]
which claim that these three characters correspond to U+0000, U+F125, and U+F128.
The ToUnicode map for that font in your actual document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]
which claim that these three characters correspond to U+0000, U+F126, and U+F129.
Thus, your test correctly has found a difference between expected and actual document, so its failure result is correct. Thus, you don't have to fix anything, the software producing the actual document has an issue!
(One could argue that the differences are inside Unicode private use areas and don't matter. In that case you'd have to update your test to ignore differences of characters from Unicode private use areas. But that should have been told you before you started creating tests.)
This is a tough one, since similar or even the same Unicode characters might have different byte representation, depending on font, encoding and other factors during PDF generation.
A possible solution I can think of if you can safely assume that the relevant text pieces are represented by 8 bit characters:
String stripUnicode(String s) {
StringBuilder sb = new StringBuilder(s.length());
for (char c : s.toCharArray()) {
if (c <= 0xFF) {
sb.append(c);
}
}
return sb.toString();
}
...
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...
If you need Unicode support, you need to implement your own custom comparison algorithm that is able to identify similar characters and treat them as equal.

XML File looses its format after reading and writing in Java

I'm writing a program in Java that it's going to read a XML file and do some modification,and then write the file with the same format.
The following is the code block that reads and writes the XML file:
final Document fileDocument = parseFileAsDocument(file);
final OutputFormat format = new OutputFormat(fileDocument);
try {
final FileWriter out = new FileWriter(file);
final XMLSerializer serializer = new XMLSerializer(out,format);
serializer.serialize(fileDocument);
}
catch (final IOException e) {
System.out.println(e.getMessage());
}
This is the method used to parse the file:
private Document parseFileAsDocument(final File file) {
Document inputDocument = null;
try {
inputDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
}//catching some exceptions{}
return inputDocument;
}
I'm noticing two changes after the file is written:
Before I had a node similar to this:
<instance ref='filter'>
<value></value>
</instance>
After reading and writing, the node looks like this:
<instance ref="filter">
<value/>
</instance>
As you can see from above, the 'filter' has been changed to "filter" with double quote.
The second change is <value></value> has been changed to <value/>. This change happens across the XML file whenever we have a node similar to <tag></tag> with no value in between. So if we have something like <tag>somevalue</tag>, there is no issue.
Any thought please how to get the XML nodes format to be the same after writing?
I'd appreciate it!
You can't, and you shouldn't try. It's a bit like complaining that when you add 0123 and 0234, you get 357 without the leading zeroes. Leading zeroes in integers aren't considered significant, so arithmetic operations don't preserve them. The same happens to insignificant details of your XML, like the distinction between double quotes and single quotes, and the distinction between a self-closing tags and a start/end tag pair for an empty element. If any consumer of the XML is depending on these details, they need to be sent for retraining.
The most usual reason for asking for lexical details to be preserved is that you want to detect changes. But this means you are doing your comparisons the wrong way: you should be comparing at the logical level, not the physical level. One way to do comparisons is to canonicalize the XML, so whenever there is an arbitrary choice to be made between equivalent representations, it is made the same way.

Reading XML file returns wrong characters

I have an XML file with thousands of tags to read their text content, as in the screenshot below :
I am trying to read the text content of all the "word" tags using this code :
String filePath = "...";
File xmlFile = new File( filePath );
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
NodeList categoryNodes = domObject.getElementsByTagName( "category" ); // Get all the <category> nodes.
for (int s = 0; s < categoryNodes.getLength(); s++) { //Loop on the <category> nodes.
String categoryName = categoryNodes.item(s).getAttributes().getNamedItem( "name" ).getNodeValue();
if( selectedCategoryName.equals( categoryName ) ) { //get its words.
NodeList wordsNodes = categoryNodes.item(s).getChildNodes();
for( int i = 0; i < wordsNodes.getLength(); i++ ) {
if( wordsNodes.item( i ).getNodeType() != Node.ELEMENT_NODE ) continue;
String word = wordsNodes.item( i ).getTextContent();
categoryWordsList.add( word ); // Some words are read wrong !!
}
break;
}
}
But for some reason many words are being read in wrong manner, examples :
"AMK6780KBU" is read as "9826</word"
"ASSI.ABR30326" is read as "rd>ASSI.AEP26"
"ASSI.25066" is read as "SI.4268</6"
It might be because the file size is big. If i just add some empty lines or remove some empty lines from the XML file, other words will be read wrong than the ones mentioned above, which is a strange thing !
You can download the XML file from here.
Solution
See below :-)
What I tried in the process
Changing the XML version from 1.1 -> 1.0 fixed the problem for me. I'm using Java 1.6.0_33 (as #orique pointed out in the comments).
In my tests there are definitely issues with corruption after a certain number of nodes. I narrowed it down to somewhere around ASSI.MTK69609. Removing everything, including that line fixed the corruption of the previous words.
The corruption is also resolved by simply changing the declaration to:
<?xml version="1.0">
and I saw zero corruption using the entire original source XML.
Similarly if you leave the version at 1.1 but remove whitespace nodes from the source, the result is as expected, for example:
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
results in the desired output and
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
is corrupted.
Removing some end-of-line "nodes" also corrected the problem, for example
<word>ASSI.MTK693115</word><word>ASSI.MTK69609</word>
So it was all pointing towards a bug, but where...? Eventually it clicked! Xerces
The version of Xerces shipped with Java 1.6 (and probably 1.7) is old, old, old and buggy (for example #6760982). In fact, I can break my test class by simply adding:
Document domObject = db.parse( xmlFile );
domObject.normalizeDocument(); // <-- causes following Exception
Exception in thread "main" java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.util.XML11Char.isXML11ValidNCName(XML11Char.java:340)
There have been many defects fixed for XML 1.1, so on a hunch I downloaded the latest version Xerces2 Java 2.11.0.
Simply running with the most recent version resulted in the expected uncorrupted output.
java -classpath .;xercesImpl.jar;xml-apis.jar Foo > foo.txt
We have noticed that getTextContent() is buggy on some Windows implementations.
Our workaround is to do something like this
// getTextContent is buggy on some Java Windows Implementations
if ( n.getNodeType( ) == Node.ELEMENT_NODE ) {
results [ i ] = (String) xPathFunction.evaluate( "./text()", n, XPathConstants.STRING );
} else { //Node.TEXT_NODE
results [ i ] = n.getNodeValue( );
}
xPathFunction is an javax.xml.xpath.XPath. Expensive, but works reliably.
Actually in your case I would directly use an XPath and call something like,
NodeList l = (NodeList) xPathFunction.evaluate( "/categories/category/word/text()", domObject, XPathConstants.NODESET )
EDIT
Beats me! On OSX, Java 1.6.0_43, I get the same behaviour. In case there was any doubt the DOM model is buggy in Java... The wrong values seem to reliably appear at certain intervals, which looks like some bytes buffer overrun. I never got an OOM error.
Here is what I have unsuccessfully tried:
word.getFirstChild().getNodeValue(); instead of word.getTextContent(); -> no change in behaviour
use an InputSource as an input into the DocumentBuilder instead of using a File
run an XPath ("/categories/category[#name='Category1']/word/text()") instead of looping over the nodes and manually traversing their children
run the same Test using Saxon as the XPath engine
check for "strange" characters in the XML file
I believe the DocumentBuilder is the culprit. It is a memory hog.
Your next best chance is to go for a SAX Parser or any other streaming parser. Since your data model is small and very simple, the implementation should be easy. To further ease implementation, you may try XMLDog. We use a slightly modified version to parse gigabyte size XML files successfully.
If you ever find the issue, please update this post.

Parsing an XML file without root in Java

I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.
I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.
Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.
One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>
You could use another parser like Jsoup. It can parse XML without a root.
I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.
Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}

Categories

Resources