I googled it and didn't find an answer - may be I didn't search it right.
My question is I'm parsing xml from the beginnning to the end of document - one way.
What if somewhere in the middle I need to set parser to go to the Start of Document again?
I know only myXmlPullParser.next(); (or any other next) to move forward, but I need at some condition to start parsing from the beginning of document again.
Is it possible?
Is it possible?
Sure. Create a new pull parser instance, using the same code as you used to create the first one. Or, try calling setInput() on your existing instance, providing it a fresh copy of the data.
If it is reading from a file (or network) I am thinking that caching the data in an input stream and reading it back into the parser may be the fastest route. I will work up an example
Related
I am building a tool to parse huge JSON around 1GB. In that logic, I am creating JsonParser object keep reading till it reaches expected JsonToken. Now I create another JsonParser(called child), which should be starting from previous JsonParser token position without much overhead. Is there a way to do that in JasonParser API for that? I am using skipChildren(), which is also taking time in my scenario.
You can try to call releaseBuffered(...) to get the data that are read but not consumed by the parser, and then prepend these data to the input stream (getInputSource()) to somehow parse the resulting stream (one way to do this might be to use an input stream that supports marks when constructing the parser).
However, since you're already using a stream based API, you probably won't get better performance than with skipChildren().
I would like to know what is the difference between StAX and SAX parsing in Java?
Can someone explain it as easy as possible, I don't understand what does it mean the one is pulling data and the other pushing?
"Push" and "Pull" refer to the style of coding that is used.
For "Push," you register a "handler" that the parser calls as it works its way through the document. So, you register your handlers with the parser and then tell it to parse the document. Your handlers will be called by the parser to tell your code when an element is starting, ending, etc.
For "Pull," your code is driving the step-by-step process of parsing the document. It is like getting an Iterator for the document and your code is going to loop and ask for the next element from the parser. In other words, your "handler" code is calling the parser for the next element to handle.
The different coding styles make different types of interactions with the document easier or harder. The choice of which style to use for a particular project is dependent on the requirements of that project.
I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.
As far as I understand, this is normally due to a character being somewhere it should not be.
Main problem: Cannot manually edit these files due to their very large size.
I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.
I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.
A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).
The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.
This error occurs, if the declaration is anywhere but the beginning of the document. The reason might be
Whitespace before the XML declaration
Any hidden character before the XML declaration
The XML declaration appears anywhere else in the document
You should start checking case #2, see here: http://www.w3.org/International/questions/qa-byte-order-mark#remove
If that doesn't help, you should remove leading whitespace from the document. You could do that by wrapping the original InputStream with another InputStream and use that to remove the whitespace.
The same can be done if you are facing case #3, but the implementation would be a bit more complex.
Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?
I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.
XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)
e.g:
CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " #" + countingReader.getByteCount()) ;
}
}
xmlStreamReader.close();
Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?
You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise. And it looks like it gives the endpoint of the tag, not the starting location.
I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.
You could use a wrapper input stream around the actual input stream, simply deferring to the wrapped stream for actual I/O operations but keeping an internal counting mechanism with assorted code to retrieve current offset?
Unfortunatly Aalto doesn't implement the LocationInfo interface.
The last java VTD-XML ximpleware implementation, currently 2.11
on sourceforge or on github
provides some code maintaning a byte offset after each call to
the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings
are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
Updating IReader with a getCharOffset() method
and implementing it
by adding a charCount member along to the offset member of the
VTDGen and VTDGenHuge classes
and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.
I think I've found another option. If you replace your switch block with the following, it will dump the position immediately after the end element tag.
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " end#" + xmlStreamReader.getLocation().getCharacterOffset()) ;
}
This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.
I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader), but I always saw a consistent increase in the location as the reader moved through the content.
Hope this helps!
I recently worked out a solution for a similar question on How to find character offsets in big XML files using java?. I think it provides a good solution based on a ANTLR generated XML-Parser.
I just burned a day long weekend on this, and arrived at the solution partially thanks to some clues here. Remarkably I don't think this has gotten much easier in the 10 years since the OP posted this question.
TL;DR Use Woodstox and char offsets
The first problem to contend with is that most XMLStreamReader implementations seem to provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. Unfortunately it seems that you have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset, then start extracting. There may be a more efficient way to do this that I haven't though of, but the performance is acceptable for my case. 500MB files are pretty snappy.
[edit] So this turned into one of those splinter-in-my-mind things, and I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
[/edit]
There is another similar question on SO about this (but the accepted answer frightened and confused me), and some people commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML (still going strong in 2020), you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution. God help you if you're finding this in 2030, trying to solve the same problem.
I'd like to parse large XML files and read in a complete node at a time from Java. The files are to large to put in a tree. I'd like to use a pull parser if possible since it appears to be easier to program for. Given the following XML data
Instead of having to check every event while using the StAX parser I'd like each call to hasNext or some similar function to return an object containing the complete info on a record node. When using Perl XML::LibXML::Reader allows me to do this using it's read method so I'm looking for an equivalent in Java.
Commons Digester is really good for this type of problem. It allows you to configure parsing rules whereby when the parser encounters certain tags it performs some action (e.g. calls a factory method to create an object). You don't have to write any parsing code, making development fast and lightweight.
Below is a simple example pattern you could use:
<pattern value="myConfigFile/foos/foo">
<factory-create-rule classname="FooFactory"/>
<set-next-rule methodname="processFoo" paramtype="com.foo.Foo"/>
</pattern>
When the parser encounters the "foo" tag it will call createObject(Attributes) on FooFactory, which will create a Foo object. The parser will then call processFoo on the object at the top of the Digester stack (you would typically push this onto the stack before commencing parsing). You could therefore implement processFoo to either add these objects to a collection, or if your file is too big simply process each object as it arrives and then throw it away.
Try XML Pull Parser