restart SAX parser from the middle of the document

restart SAX parser from the middle of the document - java

I'm working on a project that needs to parse a very big XML file (about 10GB). Because process time is really long (about days), It's possible that my code exit in the middle of the process; so I want to save my code's status once in a while and then be able to restart it from last save point.
Is there a way to start (restart) a SAX parser not from the beginning of a XML file?
P.S: I'm programming using Python, but solutions for Java and C++ are also acceptable.

Not really sure if this answers your question, but I would take a different approach. 10GB is not THAT much data, so you could implement a two-phase parsing.
Phase 1 would be to split the file in smaller chunks based on some tag, so you end up with more smaller files. For example if your first file is A.xml, you split it to A_0.xml, A_1.xml etc.
Phase 2 would do the real heavy lifting on each chuck, so you invoke it on A_0.xml, then after that on A_1.xml etc. You could then restart on a chunk after your code has exitted.

Related

Concurrently parsing a very large XML file using Stax in Java

I need to search and extract a subset of a very large XML file based on a certain key. On very large files (~20 GB), because the stax algorithm I am using searches the document linearly, it takes 7 minutes if the subset is at the end of the file. I am new to Java and concurrency, so my question might be naive and not very clear. So, is it possible to run the same search method on multiple threads such that the first start at the beginning of the XML document, the second one after at 100,000th byte, the 3rd one at the 200,000th byte.. etc.? I am using the FileInputStream in addition to the Stax parsing method. I've read that there is a FileInputStream.skip() method which allows one to start reading a file at a certain byte location.
Basically, I want to run the same processes on chunks of the document in parallel. Is this possible and a good optimization or I just don't what I'm doing/saying?

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Validate Format Tricky File Using Java

I need to parse and validate a file whose format is a little bit tricky.
Basically the file comes in this format:
\n -- just to make clear it may have empty lines
CLIENT_ID
A_NUMERIC_VALUE
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT
\n
\n
CLIENT_ID_2
A_NUMERIC_VALUE_2
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT_2
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT_2
OHH_THIS_ONE_HAS_THREE_LINES_OF_COMMENTS
The file will be big very seldom (10 mb is probably the biggest file I've ever seen - usually they have around 900kb-1mb).
So I have two problems:
1) How can I effectively validate the format of the file? Using regex + scanner? (I see this one as a very feasible option if I can transform each client entry into only one string - so I can apply the regex upon it).
2) I need to transform each of the entries in the file into Client objects. Should I validate the whole file before transforming it into Java objects? Or should I validate the file as I go on transforming its entry into Java objects? (Bear in mind that if any client entry is invalid, the processing halts immediately and an exception is thrown - hence any object that was created will be discarded).
I'm really keen to see your suggestions about question #1. Question #2 is more a curiosity of mine on how you would handle this situation. Ignore #2 if you will, but please answer #1 =)
Does anyone know any framework to help me on handling the file by the way?
Thanks.
Update:
I saw this question and the problem is very similar to mine, but I'm not sure whether regex is the best way out to this problem. There might be quite a lot of "\n" throughout the file, varying number of comments for each client entry and an optional ID - hence the regex would have to be quite complex. That's why I mentioned transforming each entry into one row in the question #1 because this way would be much easier to create a regex to validate... nevertheless, this solution does not sound very elegant to my ears :(
Cheers.

If you intend to fail the batch if any part is found invalid, then validate the file first.
There are several advantages. One is that validation and processing need not be synchronous. If, for example, you process batches daily, but receive files throughout the day, you can validate them throughout the day and notify to correct problems before your scheduled processing. Another is that validation of whether a file is well-formed is very fast.
A short, simple perl script would certainly do the job. No need to transform the data, if I understand the pattern correctly, and it's all read-forward.
read past any newlines
read and validate a client id
read and validate a numeric value
read and validate one or more comments until a blank line is found
repeat the above four steps until EOF or invalid data detected

Splitting a big XML file into smaller ones

I'm currently working on a project that requires me to split an XML. For example here is a sample:
<Lakes>
<Lake>
<id>1</id>
<Name>Caspian</Name>
<Type>Natyral</Type>
</Lake>
<Lake>
<id>2</id>
<Name>Moreo</Name>
<Type>Glacial</Type>
</Lake>
<Lake>
<id>3</id>
<Name>Sina</Name>
<Type>Artificial</Type>
</Lake>
</Lakes>
Now in my java code ideally what would happen is it will split the XML into 3 small ones for this example and send each of them out using a messenger service. The code for the messenger service is not important. I have that done already.
So for example the code would run, split the first part into this:
<Lakes>
<Lake>
<id>1</id>
<Name>Caspian</Name>
<Type>Natyral</Type>
</Lake>
</Lakes>
and then the java code would send this out in a message. It would then move on to the next part, send that out etc etc until it reaches the end of the big XML. This can be done through an XSLT or through java it doesn't matter. Any ideas?
To make it clear, I pretty much know how to break up a file using XSLT but I don't know how to break it up and send each part individually one at a time. I also don't want to store anything locally so they would ideally all get transferred into strings and sent out.

If the way you have to chunk your files is fixed and known, the easiest solution is to use SAX or StAX to do it programmatically. I personally prefer StAX for this kind of task as the code is generally cleaner and easier to understand but SAX will do the job equally well.
XSLT is a great tool but its main drawback is that it can only produce one output. And apart from a few exceptions XSLT engines don't support streaming processing, so if the initial file is too big to fit in memory, you can't use them.
Update: In XSLT 2.0 <xsl:result-document> can be used to produce multiple output files, but if you want to get your chunks one by one and not store them in files, it's not ideal.

I would stream the XML (instead of building a DOM tree in memory) and cut the chunks out on the go. Whenever you meet a Lake tag, start copying the content into a buffer which you will send and reset when the final tag </Lake> is met.
EDIT Have a look at this link to know more about XML streaming in Java

Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.