Reducing memory footprint while using large XML DOM's in Java

Reducing memory footprint while using large XML DOM's in Java - java

Our application is required to take client data presented in XML format (several files) and parse this into our common XML format (a single file with schema). For this purpose we are using apache's XMLBeans data binding framework. The steps of this process are briefly described below.
First, we take raw java.io.File objects pointing to the client XML files on-disk and load these into a collection. We then iterate over this collection creating a single apache.xmlbeans.XmlObject per file. After all files have been parsed into XmlObjects, we create 4 collections holding the individual objects from the XML documents that we are interested in (to be clear, these are not hand-crafted objects but what I can only describe as 'proxy' objects created by apache's XMLBeans framework). As a final step, we then iterate over these collections to produce our XML document (in memory) and then save this to disk.
For the majority of use cases, this process works fine and can easily run in the JVM when given the '-Xmx1500m' command-line argument. However, issues arise when we are given 'large datasets' by the client. Large in this instance is 123Mb of client XML spread over 7 files. Such datasets result in our in-code collections being populated with almost 40,000 of the aforementioned 'proxy objects'. In these cases the memory usage just goes through the roof. I do not get any outofmemory exceptions the program just hangs until garbage collection occurs, free-ing up a small amount of memory, the program then continues, uses up this new space and the cycle repeats. These parsing sessions currently take 4-5 hours. We are aiming to bring this down to within an hour.
Its important to note that the calculations required to transform client xml into our xml require all of the xml data to cross-reference. Therefore we cannot implement a sequential parsing model or batch this process into smaller blocks.
What I've tried so far
Instead of holding all 123Mb of client xml in memory, on each request for data, load the files, find the data and release the references to these objects. This does seem to reduce the amount of memory consumed during the process but as you can imagine, the amount of time the constant I/O takes removes the benefit of the reduced memory footprint.
I suspected an issue was that we are holding an XmlObject[] for 123Mb worth of XML files as well as the collections of objects taken from these documents (using xpath queries). To remedy, I altered the logic so that instead of querying these collections, the documents were queried directly. The idea here being that at no point does there exist 4 massive Lists with 10's of 1000's of objects in, just the large collection of XmlObjects. This did not seem to make a difference at all and in some cases, increases the memory footprint even more.
Clutching at straws now, I considered that the XmlObject we use to build our xml in-memory before writing to disk was growing too large to maintain alongside all the client data. However, doing some sizeOf queries on this object revealed that at its largest, this object is less than 10Kb. After reading into how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and as a result, is managing this object quite well.
So now I am out of ideas; Can't use SAX approaches instead of memory intensive DOM approaches as we need 100% of the client data in our app at any one time, cannot hold off requesting this data until we absolutely need it as the conversion process requires a lot of looping and the disk I/O time is not worth the saved memory space and I cannot seem to structure our logic in such a way as to reduce the amount of space the internal java collections occupy. Am I out of luck here? Must I just accept that if I want to parse 123Mb worth of xml data into our Xml format that I cannot do it with the 1500m memory allocation? While 123Mb is a large dataset in our domain, I cannot imagine others have never had to do something similar with Gb's of data at a time.
Other information that may be important
I have used JProbe to try and see if that can tell me anything useful. While I am a profiling noob, I ran through their tutorials for memory leaks and thread locks, understood them and there doesn't appear to be any leaks or bottlenecks in our code. After running the application with a large dataset, we quickly see a 'sawblade' type shape on the memory analysis screen (see attached image) with PS Eden space being taken over with a massive green block of PS Old Gen. This leads me to believe that the issue here is simply sheer amount of space taken up by object collections rather than a leak holding onto unused memory.
I am running on a 64-Bit Windows 7 platform but this will need to run on a 32 Bit environment.

The approach I'd take would be make two passes on the files, using SAX in both cases.
The first pass would parse the 'cross-reference' data, needed in the calculations, into custom objects and store them Maps. If the 'cross-reference' data is large then look at using distributed cache (Coherence is the natural fit if you've started with Maps).
The second pass would parse the files, retreive the 'cross-reference' data to perform calculations as needed and then write the output XML using the javax.xml.stream APIs.

Related

Reading Big File in Java

I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output.
The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).
Note:
I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)
For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.
Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.
Please suggest to reduce Memory utilization as I need to process 2GB file.

Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.
From their website, the following are the features they support:
Arbitrary numbers of values per line
Ignoring commas in quoted elements
Handling quoted entries with embedded carriage returns (ie entries
that span multiple lines)
Configurable separator and quote characters (or use sensible
defaults)
Read all the entries at once, or use an Iterator style model
Creating csv files from String[] (ie. automatic escaping of embedded
quote chars)

Use best practices to upgrade your program
Write Multithread in program to get better cpu utilization.
Set heap minimum and maximum heap size to get better use of ram.
Use proper data structure and design.

Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.
I think you should revise your design. Given your requirements
The Upper part of output screen shows each record from file in that
order in JTable, whereas lower part shows statistics computed based on
that data
you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator

Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.
Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.

Size of jdom document vs single variables

I planning on reading profile data from a xml file in JSP.
Now i can either read it and store the important information in single session variables or just put the whole section from the xml file in a jdom document and put that into a single session variable.
In your experience, will the data size impact the big or negligible ?

JDOM 2.0.0 - released yesterday (I am the maintainer) has an improved memory footprint... it uses less than 10x as much memory as the input document. Additionally, if you use the 'SlimJDOMFactory' when parsing the XML, you use even less. A typical 275KB document is parsed in 1.5Meg using the SlimJDOMFactory. See the performance metrics for JDOM 2.0.0 at http://hunterhacker.github.com/jdom/jdom2/performance.html and search for SlimJDOMFactory to get the results using the (slower) but more efficient factory.
This in no way answers your question, because in reality, it all depends on your input data size. My experience, and I am biased, is that for small documents it's easier to just load it all in memory, and only 'sweat it' for the big ones.

There is a rule of thumb when it comes down to storing XML in memory with DOM. The size of the XML-file * 10. So if you XML-File is 1MB big then you will need 10MB memory to store it.
But in my experience i would never ever store a DOM document in memory. Once some student tried to do that, but the XML-File was 50MB big, so guess what happens? We ran out of Memory.
For your case i would create a class which can hold all the relevant information, fill the class by reading the xml.
Do you really need to store the profile data in the session? Do you really need it all the time? Sometime it is enough to just the an id or a small class.

Well, the more objects you store in memory the less free heap memory you'll have, there isn't a way around that.
It depends on your application domain really.
But in general people use cache solutions (ehCache is one, from the top of my head) between the datasource (your xml file/s) and their application domain model.
The cache expires or is cleared on demand, therefore you have reasonable control of your objects, and the heap memory they occupy.

Scaling application that reads large XML files

I have an application which reads large set of XML files (multiple around 20-30) periodically, like once every 10 minutes. Now each XML file can be approximated to at least 40-100 MB in size. Once each XML has read, a map is created out of the file, and then the map is passed across a processor chain (10-15), each processor using the data, performing some filter or writing to database, etc.
Now the application is running in 32 bit JVM. No intention on moving to 64 bit JVM right now. The memory foot-print as expected is very high... nearing the threshold of a 32 bit JVM. For now when we receive large files, we serialize the generated map into disk and run through the processor chain maximum of 3-4 map concurrently as if we try to process all the maps at the same time, it would easily go OutOfMemory. Also garbage collection is pretty high.
I have some ideas but wanted to see if there are some options which people have already tried/evaluated. So what are the options here for scaling this kind of application?

Yea, to parrot #aaray and #MeBigFatGuy, you want to use some event based parser for this, the dom4j mentioned, or SAX or StAX.
As a simple example, that 100MB XML is consuming a minimum of 200MB of RAM if you load it wholesale, as each character is immediately expanded to a 16 bit character.
Next, any tag of elements that you're not using is going to consume extra memory (plus all of the other baggage and bookkeeping of the nodes) and it's all wasted. If you're dealing with numbers, converting the raw string to a long will be a net win if the number is larger than 2 digits.
IF (and this is a BIG IF) you are using a lot of a reasonably small set of Strings, you can save some memory by String.intern()'ing them. This is a canonicalization process that makes sure if the string already exists in the jvm, its shared. The downside of this is that it pollutes your permgen (once interned, always interned). PermGen is pretty finite, but on the other hand it's pretty much immune to GC.
Have you considered being able to run the XML through an external XSLT to remove all of the cruft that you don't want to process before it even enters your JVM? There are several standalone, command line XSL processors that you can use to pre-process the files to something perhaps more sane. It really depends on how much of the data that is coming in you're actually using.
By using an event based XML processing model, the XSLT step is pretty much redundant. But the event based models are all basically awful to use, so perhaps using the XSLT step would let you re-use some of your existing DOM logic (assuming that's what you're doing).
The flatter your internal structures, the cheaper they are in terms of memory. You actually have a little bit of an advantage running a 32b vm, since instance pointers are half the size. But still, when you're talking 1000's or millions of nodes, it all adds up, and quickly.

We had a similar problem processing large XML files (around 400Mb). We greatly reduced the memory footprint of the application using this:
http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

You can insert the contents of each XML file into a temporary DB table and each chain link would fetch the data it needs. You will probably lose performance, but gain scalability.

Java deserialization speed

I am writing a Java application that among other things needs to read a dictionary text file (each line is one word) and store it in a HashSet. Each time I start the application this same file is being read all over again (6 Megabytes unicode file).
That seemed expensive, so I decided to serialize resulting HashSet and store it to a binary file. I expected my application to run faster after this. Instead it got slower: from ~2,5 seconds before to ~5 seconds after serialization.
Is this expected result? I thought that in similar cases serialization should increase speed.

It's not a question of one serialization mechanism or another, it's a question of the data structure you are serializing.
You have one very efficient, natural representation of these words: a simple list, in the text file. That's fast to read.
You have created a data structure to store them which is different: a hash table. It takes more memory to represent a hash table. However the benefit is that it's very fast to look for a word, compared to a simple list.
But that tradeoff means serialization gets slower as well, since the naive serialization of a hash table will serialize more data and be larger, and therefore slower.
I think you should stick with the simple reading of the text file.

#Sean's answer is correct. Java serialization/deserialization has significant performance overheads. If you need to make the dictionary loading faster (or ...), consider the following approaches:
Using the java.nio.* classes to read the file may speed things up.
If the application doesn't necessarily need the dictionary to be loaded instantly on startup, consider using a separate thread to do the dictionary loading asynchronously. The dictionary loading is no faster, but (for example) the application's GUI starts faster anyway.

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.