Splitting a big XML file into smaller ones - java

I'm currently working on a project that requires me to split an XML. For example here is a sample:
<Lakes>
<Lake>
<id>1</id>
<Name>Caspian</Name>
<Type>Natyral</Type>
</Lake>
<Lake>
<id>2</id>
<Name>Moreo</Name>
<Type>Glacial</Type>
</Lake>
<Lake>
<id>3</id>
<Name>Sina</Name>
<Type>Artificial</Type>
</Lake>
</Lakes>
Now in my java code ideally what would happen is it will split the XML into 3 small ones for this example and send each of them out using a messenger service. The code for the messenger service is not important. I have that done already.
So for example the code would run, split the first part into this:
<Lakes>
<Lake>
<id>1</id>
<Name>Caspian</Name>
<Type>Natyral</Type>
</Lake>
</Lakes>
and then the java code would send this out in a message. It would then move on to the next part, send that out etc etc until it reaches the end of the big XML. This can be done through an XSLT or through java it doesn't matter. Any ideas?
To make it clear, I pretty much know how to break up a file using XSLT but I don't know how to break it up and send each part individually one at a time. I also don't want to store anything locally so they would ideally all get transferred into strings and sent out.

If the way you have to chunk your files is fixed and known, the easiest solution is to use SAX or StAX to do it programmatically. I personally prefer StAX for this kind of task as the code is generally cleaner and easier to understand but SAX will do the job equally well.
XSLT is a great tool but its main drawback is that it can only produce one output. And apart from a few exceptions XSLT engines don't support streaming processing, so if the initial file is too big to fit in memory, you can't use them.
Update: In XSLT 2.0 <xsl:result-document> can be used to produce multiple output files, but if you want to get your chunks one by one and not store them in files, it's not ideal.

I would stream the XML (instead of building a DOM tree in memory) and cut the chunks out on the go. Whenever you meet a Lake tag, start copying the content into a buffer which you will send and reset when the final tag </Lake> is met.
EDIT Have a look at this link to know more about XML streaming in Java

Related

restart SAX parser from the middle of the document

I'm working on a project that needs to parse a very big XML file (about 10GB). Because process time is really long (about days), It's possible that my code exit in the middle of the process; so I want to save my code's status once in a while and then be able to restart it from last save point.
Is there a way to start (restart) a SAX parser not from the beginning of a XML file?
P.S: I'm programming using Python, but solutions for Java and C++ are also acceptable.
Not really sure if this answers your question, but I would take a different approach. 10GB is not THAT much data, so you could implement a two-phase parsing.
Phase 1 would be to split the file in smaller chunks based on some tag, so you end up with more smaller files. For example if your first file is A.xml, you split it to A_0.xml, A_1.xml etc.
Phase 2 would do the real heavy lifting on each chuck, so you invoke it on A_0.xml, then after that on A_1.xml etc. You could then restart on a chunk after your code has exitted.

Which format to create logs in profilers?

In the profiler I am writing, which is in fact a JVMTI agent for Java programs, I need a format to log the events collected. Further these logs have to be send to a socket and read by a GUI somewhere else. So I need a working serialization between two languages.
I already implemented my own protocol in XML and it worked very well. However I was told to consider another format. As XML building might be very slow and every additional code executed in the profiler influences heavily the profiled program. This is true, but does XML DOM Building take that long?
I used TinyXML so far. I hope no one points to RapidXML, as I hope there are not that different on a not-embedded machine.
What do you think? Currently I am trying to reimplement it with protobuf, which claims to be n times faster then XML.
I have a design I am working on for all log file in my remit. I record data in JSON but the JSON data is nested in a very simple xml format.
eg
<entry ts="2011-02-23T17:18:19.202" level="trc_1" typ="trace">New Message Received</entry>
<entry ts="2011-02-23T17:18:19.202" level="trace" typ="msg"><data>{"Name":"AgtConf","AgtId":1111,...}</data></entry>
That way I can easily separate out data and logging, but keep the logging directory from being complicated. Also saves having to write my own parser for a custom format. However given your situation I recommend using JSON only given that you are basically using to serialise. JSON is very much human-readable when it is formatted correctly, it can be very concise, and there are stable parsers for it.
my first choice is always the traditional txt file.
you can append new entries at the end of file (bottom)

Reading only root element in XML

In many REST based API calls, we have this parameter called nextURL, using which we can query for the next URL. This is usually in the root element.(or may be the next one)
In general how do you guys read this? In case you are using a standard XML parser, it reads and loads the entire XML and then you get to read the nextURL by getElementsByTag. Is there a better work around? Reading the entire xml is of course waste of time/memory.
Edit: An example XML would be something like
<result pubisher="xyz" nextURL="http://actualurl?since_date=<newdate>">
<element>adfsaf</element>
..
</result>
I need to capture the new since_date without reading the entire XML.
Python: You could use the ElementTree iterparse method ... provided the data you want is in an attribute, which will have been parsed by the time that you get the start event. If it's in the text or tail of the element, you will have to wait until the end event. It would be a good idea if you edited your question to show what your XML looks like, and explain "or maybe in the next one" with an example.
The term "Standard XML parser" covers a lot of territory, so much so that I don't think that you can generalize their behaviors. For instance, a standard DOM parser is tree-based and will read the entire XML into memory, but a SAX parser (and I think StAX as well) won't but rather will advance as the app desires it to advance. It sounds like the latter, a SAX or StAX parser, is what you need.
Edit: Please be sure to read KitsuneYMG's comment below on the difference between SAX and StAX behaviors.

Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

How to best output large single line XML file (with Java/Eclipse)?

We have a process that outputs the contents of a large XML file to System.out.
When this output is pretty printed (ie: multiple lines) everything works. But when it's on one line Eclipse crashes with an OutOfMemory error. Any ideas how to prevent this?
Sounds like it is the Console panel blowing up. Consider limiting its buffer size.
EDIT: It's in Preferences. Search for Console.
How do you print it on one line?
using several System.out.print(String s)
using System.out.println(String verybigstring)
in the second case, you need a lot more memory...
If you want more memory for eclipse, could try to increase eclipses memory by changing the -Xmx value in eclipse.ini
I'm going to assume that you're building an org.w3c.Document, and writing it using a serializer. If you're hand-building an XML string, you're all but guaranteed to be producing something that's almost-but-not-quite XML, and I strongly suggest fixing that first.
That said, if you're writing to a stream from the serializer (and System.out is a stream), then you should be writing directly to the stream rather than writing to a string and printing that (which you'd do with a StringWriter). The reason for this is that the XML serializer will properly handle character encodings, while serializer to String to stream may not.
If you're not currently building a DOM, and are concerned about the memory requirements of doing so, then I suggest looking at the Practical XML library (which I maintain), in particular the builder package. It uses lightweight nodes, that are then output via a serializer using a SAX transform.
Edit in response to comment:
OK, you've got the serializer covered with XStream. I'm next going to assume that you are calling XStream.toXML(Object) to produce the string, and recommend that you call the variant toXML(Object, OutputStream), and pass it the actual output. The reason for this is that XML is very sensitive to character encoding, which is something that often breaks when converting strings to streams.
This may, of course, cause issues with building your POST request, particularly if you're using a library that doesn't provide you an OutputStream.

Categories

Resources