parsing large XML file in Java [duplicate]

parsing large XML file in Java [duplicate] - java

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looping over a large XML file
What is a better way to parse large XML data which is essentially a collection of XML data in Java and Java based frameworks? We get data from a webservice call which runs into few MB (typically 25MB+). This data essentially corresponds to an unmarshalled list of Objects. My objective is to create the list of objects from the XML.
I tried using the SAX parser and it takes a good 45 seconds to parse these 3000 objects.
What are the other recommended approaches?

Try pull parsing instead, use StAX?
First search hit on comparing:
http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html
Have you profiled and seen where the bottlenecks are?
StAX is built into java (since java 6), but some recommend the woodstox StAX implementation for even better performance. I have not tried it though. http://woodstox.codehaus.org/

I tried using the SAX parser and it takes a good 45 seconds to parse
these 3000 objects. What are the other recommended approaches?
There are only the following options:
DOM
SAX
StAX
SAX is the fastest SAXvsDOMvsStax so if you switch to different style, I don't think you'll get any benefit.
Unless you are doing something wrong now
Of course there are also the marshalling/demarshalling frameworks such as JAXB etc but IMO (not done any measurements) they could be slower since the add an extra layer of abstraction on the XML processing

SAX doesn't provide random access to the structure of the XML file, this means that SAX provides a relatively fast and efficient method of parsing. Because the SAX parser deals with only one element at a time, implementations can be extremely memory-efficient, making it often the one choice for dealing with large files.

Parsing 25Mb of XML should not take 45 seconds. There is something else going on. Perhaps most of the time is spent waiting for an external DTD to be fetched from the web, I don't know. Before changing your approach, you need to understand where the costs are coming from and therefore what part of the system will benefit from changes.
However, if you really do want to convert the XML into Java objects (not the application architecture I would choose, but never mind), then JAXB sounds a good bet. I haven't used JAXB much since I prefer to stick with XML-oriented languages like XSLT and XQuery, but when I did try JAXB I found it pretty fast. Of course it uses a SAX or StAX parser underneath.

Related

Memory efficient way to modify XML in Java

I need to modify a single information in XML file . XML file is about 100 lines . For modifying a single element in whole XML file what would be the most memory efficient way in JAVA ?
JAXB is better ?
Simple SAX parser ?
or any other way .....Kindly suggest .....

SAX parser gives more control on parsing and is faster than DOM parser. JAXB will be easy from the sense of less code writing. XStream is also another option but that is similar to JAXB which is a high level API, so it has some overhead task so it will be bit slower then SAX.
I will not suggest for direct string manipulation (applying String.indexOf() and String.replace()) although would be fastest way for updating a unique tag in the XML but its risky as your XML might not be valid and if xml structure is not simple then there will be risk of updating wrong level tag :-)
Therefore, SAX parser looks the best bet to me.

Your files are not big. The memory used to hold a 100-line XML file costs about as much as 5 milliseconds of a programmer's time. I would question your requirement: why do you need to do it in "the most memory efficient way"? I would use XSLT or JDOM2, unless there is clear quantified information that this will not meet externally-imposed performance requirement, which cannot be solved by buying a bit more memory.

Which one to choose between SAX and STAX to read large xml files

I have a requirement where I can have 100 mb or bigger xml file having list of companies for which I need to add each company into a table from that xml file.
I was thinking of using SAX parser however I was also thinking of using stax parser. Can someone pls help me know which one should I use.
thx

StAX has a much more easier to use API, so I think it is a better choice. SAX has a low-level push API, which is not very nice to use (e.g. working with char[]). StAX has a much nicer to use pull API.
Another potential advantage: using StAX you don't have read the whole document, you may stop if you have what you needed.
There is a nice - though quite old - comparison of the Java XML parsing APIs found here.

Using StAX will allow you to minimize the amount of data kept in memory to only the most recently parsed record. Once you insert that record into your table, you no longer need to keep it in memory.
If you use SAX you would (likely) have to parse the entire xml content into memory before inserting records into your table. While it would be possible to insert as you go (when encountering the closing element for a record), that is more complicated with SAX than StAX.

XML vs JSON. Which one is better for storing small chunk of data? [duplicate]

This question already has answers here:
What are the pros and cons of XML and JSON? [closed]
(17 answers)
Closed 9 years ago.
I want to store some small chunk of data and don't want to go for any database, we have two choices XML and JSON, now anyone can please suggest which one should I select from performance and architecture point of view.
1. which is better to use? XML or JSON for storing data?
2. What are the pros and cons for both JSON and XML?
any help would be greatly appreciated.
EDIT
We are not using any web service, our application is a stand alone app. We want to use XML or JSON for storing some local data which will be used in the application. The data would be like details of questions and answers, static userdetails etc.

Please keep in mind that JSON is only smaller if the tags are longer than the data.
Probably the fact that the XML is a lot easier to read, and that JSON has a smaller footprint.
XML Pros
Easier to read
Used a lot more than JSON
One of the main industry standards
Versioning possible
Namespace support
Multiple elements with the same name
Validation
XML Cons
Takes up more space
Increased bandwidth because of the size
JSON Pros
Doesn't take up a lot of space
Uses less bandwidth because of it's size (footprint)
Rising in the ranks as one of the main industry standards
JSON Cons
Harder to read
Versioning breaks client/data
If you are sending more data than you send tags then they are about the same and you would have been better off using XML for the fast parsing speeds. I would also argue that people expect slow mobile load times and fast app running times so try and not slow down the app time by using a slower format to parse.
Finally I say JSON, The small footprint will speed up transactions between your app and the web services you're trying to send/receive data to/from.

JSON is the best way to design any mobile application development, because parsing JSON is very light weight operation compare to XML. while XML parsing is always leads to complex memory problem. and JSON can be easily build/parse with GSON library which is again very light weight.
XML Parsing will be head ache if you have different versions of parsers to use. so go for JSON.

Extensible Markup Language (XML) is a text format derived from Standard Generalized Markup Language (SGML).
Most of the excitement around XML is around a new role as an interchangeable data serialization format. XML provides two enormous advantages as a data representation language:
It is text-based.
It is position-independent.
These together encouraged a higher level of application-independence than other data-interchange formats. The fact that XML was already a W3C standard meant that there wasn't much left to fight about (or so it seemed).
Unfortunately, XML is not well suited to data-interchange, much as a wrench is not well-suited to driving nails. It carries a lot of baggage, and it doesn't match the data model of most programming languages. When most programmers saw XML for the first time, they were shocked at how ugly and inefficient it was. It turns out that that first reaction was the correct one. There is another text notation that has all of the advantages of XML, but is much better suited to data-interchange. That notation is JavaScript Object Notation (JSON).
The most informed opinions on XML (see for example xmlsuck.org) suggest that XML has big problems as a data-interchange format, but the disadvantages are compensated for by the benefits of interoperability and openness.
JSON promises the same benefits of interoperability and openness, but without the disadvantages.
Rest of the comparison is here.

XML transformation in Java with best Performance

I want to do some manipulation on xml content in Java. See below xml
From Source XML:
<ns1:Order xmlns:ns1="com.test.ns" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</ns1:Order>
Target XML:
<OrderData>
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</OrderData>
As shown, I have Source xml and I want target xml for that .. The only difference we can observe is root_element "ns1:Order" is replace with "OrderData" in target xml.
Fyi, OrderHeader has one sub-element Image which holds binary image of 250KB (so this xml going to be large one) .. also root element of target xml "OrderData" is well-known in advance.
Now, I want to achieve above result in java with best performance .. I have Source xml content already as byte[] and I want target xml content also as byte[] .. I am open to use Sax parser too.
Please provide the solution which has best performance for doing above stuff.
Thanks in advance,
Nurali

Do you mean machine performance or human performance? Spending an infinite amount of programmer time to achieve a microscopic gain in machine performance is a strange trade-off to make these days, when a powerful computer costs about the same as half a day of a contract programmer's time.
I would recommend using XSLT. It might not be fastest, but it will be fast enough. For a simple transformation like this, XSLT performance will be dominated by parsing and serialization costs, and those won't be any worse than for any other solution.

Not much will beat direct bytes/String manipulation, for instance, a regular expression.
But be warned, manipulating XML with Regex is always a hot debate

I used XLST to transform XML documents. That's another way to do it. There are several Java implementations of XLST processors.

The fastest way to manipulate strings in Java is using direct manipulation and the StringBuilder for the results. I wrote code to modify 20 mb strings that built a table of change locations and then copied and modified the string into a new StringBuilder. For Strings XSLT and RegEx are much slower than direct manipulation and SAX/DOM parsers are slower still.

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing large XML file in Java [duplicate] - java

Related

Memory efficient way to modify XML in Java

Which one to choose between SAX and STAX to read large xml files

XML vs JSON. Which one is better for storing small chunk of data? [duplicate]

XML transformation in Java with best Performance

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

Categories

Resources