Java XML Parser for huge files

Java XML Parser for huge files - java

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Related

Create a copy of xml file in memory in java

I need to create a copy of an xml file in memory using java and i need to edit this file in memory without affecting the original one. After making changes to this xml in memory i need to send it as an input to a function. What is the appropriate option .Please help me.

You can use java native api for xml parsing:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
File file = new File("xml_file_name");
Document doc = builder.parse(file);
and then edit the Document in memory before sending it to your designated function.

Do what you wrote:
Read the file.
Write it to another file.
Edit so called another file.
Pass it to the function. Here you have to decide if it's better to pass a file or a path.

What you are looking for is ByteArrayOutputStream. http://docs.oracle.com/javase/7/docs/api/java/io/ByteArrayOutputStream.html
This will allow you to write a byte array in to memory most xml lib will accept implementations of OutputStream.

Given the file is XML you should consider using loading it into the Document Object Model (DOM): https://docs.oracle.com/javase/tutorial/jaxp/dom/readingXML.html
That will make it easier for you to modify it and write it back as valid XML document.
I would only suggest loading it as bytes/characters if you're operating on it at a byte level. An example of when that might be appropriate is if you're making some character encoding translation (say UTF-16 -> UTF-8) or removing 'illegal' characters.
Code that tries to parse and modify XML in place usually becomes dreadfully bloated if it covers all valid XML files.
Unless you're a domain expert for XML, pick the parser of the shelf. It's pretty full of good libraries.
If the files may be large and your logic ameanable I would prefer to use an XML stream model such as SAX: https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
However I get the impression you're not experienced and non-experts tend to struggle with the event driven parsing model of SAX.
Try DOM first time out.

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently.
At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way.
I have seen many source code examples for loading files into Tika / POI / PDFBox via input streams. I have seen many examples for extracting plain text via output streams. However, I've performed some basic memory profiling experiments... and I haven't yet found a way with any of these libraries (Tika, POI, or PDFBox) to avoid loading an entire document into main memory.
In between reading from a stream and writing to a stream, there is obviously conversion step in the middle... which I have not yet found a way to perform on a streaming basis. Am I missing something, or is this a known issue with extracting text from MS Office or PDF files using Tika / POI / PDFBox? Can I have true end-to-end streaming, without a file being fully loaded into main memory at any point along the way?

The first thing to make sure, if you care about the memory footprint, is that you're using a TikaInputStream backed by a File, eg change from something like
InputStream input = new FileInputStream("foo.xls");
To something like
InputStream input = TikaInputStream.get(new File("foo.xls"));
If you really only have an InputStream, not a file, and you want the lower memory option if possible, force Tika to buffer it to a temp file with something like
InputStream origInput = getAnInputStream();
TikaInputStream input = TikaInputStream.get(origInput);
input.getFile();
Many, but not all parsers will be able to take advantage of the backing File and read only the bits they need into memory, rather than buffering the whole thing, which'll help
.
Next up, make sure your ContentHandler doesn't buffer the whole contents into memory before outputting. Anything which does XPath lookups on the resulting document is probably out, as is anything which has an internal StringBuffer or similar. Pick a simpler one, and make sure you're setup to write the resulting html / text sax events somewhere as they come in
.
Finally, not all of the Tika parsers support streaming processing. Some only work by parsing the whole file's structure, then wandering through that finding the interesting bits to output. With those, using a File backed TikaInputStream will probably help, but won't stop a fair bit of memory being used.
IIRC, the low memory parsers include:
.xls
.xlsx
All ODF-based formats
XML
Some of the common document parsers which load + parse most/all of the file before being able to output anything include:
.doc / .docx / .ppt / .pptx
.pdf
Images
Videos

Write huge XML file from DOM to file

I have a java program which queries a table which has millions of records and generates a xml with each record as node.
The challenge is that the program is running out of heap memory. I have allocated 2GB heap for the program.
I am looking for alternate approaches of creating such huge xml.
Can we write out partial DOM object to file and release the memory?
For eg, create 100 nodes in DOM object, write to file, release the memory, then create next 100 nodes in DOM etc
Code to write a node to file
DOMSource source = new DOMSource(node);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
But how do I release the DOM memory after writing the nodes to file?

Why do you need to generate a DOM? Try to write the XML directly. The most convenient API for outputting XML from Java is the StAX XMLStreamWriter interface. There are a number of implementations of XMLStreamWriter that generate lexical (serialized) XML, including the Saxon serializer which gives you considerable control over the way in which it is serialized (e.g. indentation and encoding) if you need it.

I would use a simple OutputStreamWriter and format the xml by myself, you don't need to create a huge dom structure. I think this is the fastest way.
Of course depends on how much xml structure you want to accomplish. If one table row corresponds to one xml line, this should be the fastest way to do it.

For processing a huge document, SAX is often preferred precisely because it keeps in memory only what you have explicitly decided to keep in memory -- which means you can use a specialized, and hence smaller, data model. For tasks such as this one, where you have no need to crossreference different parts of the document, you may not need any data model at all and can just generate SAX events directly from the input data and feed those into the serializer.
(StAX is pretty much equivalent in this regard. I usually prefer to stay with SAX since it's part of the JAXP API package and should be present in just about every Java environment at this point, but StAX may be a bit easier to work with.)

How to best output large single line XML file (with Java/Eclipse)?

We have a process that outputs the contents of a large XML file to System.out.
When this output is pretty printed (ie: multiple lines) everything works. But when it's on one line Eclipse crashes with an OutOfMemory error. Any ideas how to prevent this?

Sounds like it is the Console panel blowing up. Consider limiting its buffer size.
EDIT: It's in Preferences. Search for Console.

How do you print it on one line?
using several System.out.print(String s)
using System.out.println(String verybigstring)
in the second case, you need a lot more memory...
If you want more memory for eclipse, could try to increase eclipses memory by changing the -Xmx value in eclipse.ini

I'm going to assume that you're building an org.w3c.Document, and writing it using a serializer. If you're hand-building an XML string, you're all but guaranteed to be producing something that's almost-but-not-quite XML, and I strongly suggest fixing that first.
That said, if you're writing to a stream from the serializer (and System.out is a stream), then you should be writing directly to the stream rather than writing to a string and printing that (which you'd do with a StringWriter). The reason for this is that the XML serializer will properly handle character encodings, while serializer to String to stream may not.
If you're not currently building a DOM, and are concerned about the memory requirements of doing so, then I suggest looking at the Practical XML library (which I maintain), in particular the builder package. It uses lightweight nodes, that are then output via a serializer using a SAX transform.
Edit in response to comment:
OK, you've got the serializer covered with XStream. I'm next going to assume that you are calling XStream.toXML(Object) to produce the string, and recommend that you call the variant toXML(Object, OutputStream), and pass it the actual output. The reason for this is that XML is very sensitive to character encoding, which is something that often breaks when converting strings to streams.
This may, of course, cause issues with building your POST request, particularly if you're using a library that doesn't provide you an OutputStream.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.