Write huge XML file from DOM to file

Write huge XML file from DOM to file - java

I have a java program which queries a table which has millions of records and generates a xml with each record as node.
The challenge is that the program is running out of heap memory. I have allocated 2GB heap for the program.
I am looking for alternate approaches of creating such huge xml.
Can we write out partial DOM object to file and release the memory?
For eg, create 100 nodes in DOM object, write to file, release the memory, then create next 100 nodes in DOM etc
Code to write a node to file
DOMSource source = new DOMSource(node);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
But how do I release the DOM memory after writing the nodes to file?

Why do you need to generate a DOM? Try to write the XML directly. The most convenient API for outputting XML from Java is the StAX XMLStreamWriter interface. There are a number of implementations of XMLStreamWriter that generate lexical (serialized) XML, including the Saxon serializer which gives you considerable control over the way in which it is serialized (e.g. indentation and encoding) if you need it.

I would use a simple OutputStreamWriter and format the xml by myself, you don't need to create a huge dom structure. I think this is the fastest way.
Of course depends on how much xml structure you want to accomplish. If one table row corresponds to one xml line, this should be the fastest way to do it.

For processing a huge document, SAX is often preferred precisely because it keeps in memory only what you have explicitly decided to keep in memory -- which means you can use a specialized, and hence smaller, data model. For tasks such as this one, where you have no need to crossreference different parts of the document, you may not need any data model at all and can just generate SAX events directly from the input data and feed those into the serializer.
(StAX is pretty much equivalent in this regard. I usually prefer to stay with SAX since it's part of the JAXP API package and should be present in just about every Java environment at this point, but StAX may be a bit easier to work with.)

Related

Create a copy of xml file in memory in java

I need to create a copy of an xml file in memory using java and i need to edit this file in memory without affecting the original one. After making changes to this xml in memory i need to send it as an input to a function. What is the appropriate option .Please help me.

You can use java native api for xml parsing:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
File file = new File("xml_file_name");
Document doc = builder.parse(file);
and then edit the Document in memory before sending it to your designated function.

Do what you wrote:
Read the file.
Write it to another file.
Edit so called another file.
Pass it to the function. Here you have to decide if it's better to pass a file or a path.

What you are looking for is ByteArrayOutputStream. http://docs.oracle.com/javase/7/docs/api/java/io/ByteArrayOutputStream.html
This will allow you to write a byte array in to memory most xml lib will accept implementations of OutputStream.

Given the file is XML you should consider using loading it into the Document Object Model (DOM): https://docs.oracle.com/javase/tutorial/jaxp/dom/readingXML.html
That will make it easier for you to modify it and write it back as valid XML document.
I would only suggest loading it as bytes/characters if you're operating on it at a byte level. An example of when that might be appropriate is if you're making some character encoding translation (say UTF-16 -> UTF-8) or removing 'illegal' characters.
Code that tries to parse and modify XML in place usually becomes dreadfully bloated if it covers all valid XML files.
Unless you're a domain expert for XML, pick the parser of the shelf. It's pretty full of good libraries.
If the files may be large and your logic ameanable I would prefer to use an XML stream model such as SAX: https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
However I get the impression you're not experienced and non-experts tend to struggle with the event driven parsing model of SAX.
Try DOM first time out.

Android: DOM vs SAX vs XMLPullParser parsing?

I am parsing XML Document using SAX Parser.
I want to know which is better and faster to work with DOM, SAX Parser or XMLPullParser.

it depends on what are you doing , if you have very large files then you should use SAX parser since it will fire events and releasing them ,nothing is stored in memory ,and using SAX parser you can't access element in a random way there is no going back ! , but Dom let you access any part of the xml file since it keeps the whole file/document in memory . hope this answer you question .
if you want to know which fastest parser Xerces is going to be the fastest you'll find and SAX parser should give you more performance than Dom

The SAX XML Parser already available into the Android SDK.
http://developer.android.com/reference/org/xml/sax/XMLReader.html
so it is easy to access.

One aspect by which different kinds of parsers can be classified is whether they need to load the entire XML document into memory up front. Parsers based on the Document Object Model (DOM) do that: they parse XML documents into a tree structure, which can then be traversed in-memory to read its contents. This allows you to traverse a document in arbitrary order, and gives rise to some useful APIs that can be slapped on top of DOM, such as XPath, a path query language that has been specifically designed for extracting information from trees. Using DOM alone isn’t much of a benefit because its API is clunky and it’s expensive to always read everything into memory even if you don’t need to. Hence, DOM parsers are, in most cases, not the optimal choice to parse XML on Android.
There are class of parsers that don’t need to load a document up front. These parsers are stream-based, which means they process an XML document while still reading it from the data source (the Web or a disk). This implies
that you do not have random access to the XML tree as with DOM because no internal representation of the document is being maintained. Stream parsers can be further distinguished from each other. There are push parsers that, while streaming the document, will call back to your application when encountering a new element. SAX parsers, fall into this class. Then there are pull parsers, which are more like iterators or cursors: here the client must explicitly ask for the next element to be retrieved.
Source: Android in Practice.

Importing big XML file (100k of records) into database

I am facing the problem the problem while parsing the XML. Its cosuming 47% of CPU and its very slow. It seems like DOM loads the XML into the memory and from there it starts reading the XML Tree node by node.
I am reading a node and dumping it to the Database.
I want a solution where I can read the XML without loading into the memory.
I am using JDK1.4.2_05.

Look for SAX parser, it's only way to do something with XML without build of full DOM in memory. There are some limitations but maybe it will suit your needs.

Try StAX or SAX.

The Nux project includes the StreamingPathFilter class. With this class you can combine the streaming facilities and low memory footprint of SAX with the ease of use of DOM.
But this works only if your XML document has a record like structure. E.g. lots of <person/> elements.
(Following examples are taken from the Nux website and modified by me)
First you define how to handle one record:
StreamingTransform myTransform = new StreamingTransform() {
public Nodes transform(Element person) {
// Process person element, i.e. store it in a database
return new Nodes(); // mark element as subject to garbage collection
}
};
Then you create a StreamingPathFilter passing an XPath expression which matches to your record nodes.
// parse document with a filtering Builder
NodeFactory factory = new StreamingPathFilter("/persons/person", null).
createNodeFactory(null, myTransform);
new Builder(factory).build(new File("/tmp/persons.xml"));
The Nux library seems not maintained any more. But it is still usefull.

Reading Huge XML File using StAX and XPath

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.
The sample content of the file
<transactions>
<txn id="1">
<name> product 1</name>
<price>29.99</price>
</txn>
<txn id="2">
<name> product 2</name>
<price>59.59</price>
</txn>
</transactions>
The (technical)user is expected to give the input tag name like <txn>.
We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.
There are few technical things we have to consider here
The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM
Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.
Looking for suggestions. Thanks in advance.

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]
As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
Document document = getParser().parse( source );
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.
SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.
This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.
Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.
I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)
In this situation, you will have to use
<p:for-each>
<p:iteration-source select="//transactions/txn"/>
<!-- you processing on a small file -->
</p:for-each>
You can even wrapp each of the resulting transformation with a single line of XProc
<p:wrap-sequence wrapper="transactions"/>
Hope this helps

A fun solution for processing huge XML files >10GB.
Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527

Streaming Transformations for XML (STX) might be what you need.

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.
For fast reading of the whole data StAX will be OK.
If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.