I've got an XML performance problem that I'm looking to solve.
Specifically I've got the same small/medium sized XML file that's being parsed many hundreds of times.
The functionality is bound to a StAX XML event reader. Its output cannot be cloned or otherwise copied, the only way to reproduce the needed functionality is to run this XML event reader over the XML document again.
For performance I would like to read the XML into a StAX event sequence eagerly, and then replay that event sequence rather than re-parse the XML each time.
I believe the problem is implementation: while this idea is reasonable in principal, "Events" are expressed as state-changes against the XMLStreamReader which has a large API surface, a large portion (but not all) of which is related to its "current" event.
Does a system like this already exist?
If I have to build it myself, what might be the best way to ensure correctness?
The usual way to represent an XML document in memory, to avoid parsing it repeatedly, is to use one of the many tree models (JDOM2 and XOM are the best in my view, though many people still use the horrible old DOM model simply because it's packaged in the JDK). So I guess I'm asking why doesn't this "obvious" approach work for you?
There are cases where (internally within Saxon) I use a replayable stream of events instead, simply because storing the events and then replaying them is a bit more efficient than building a tree and then walking the tree. I don't use StaX events for this, I use my own class net.sf.saxon.event.EventBuffer which holds a list of net.sf.saxon.event.Event objects. Perhaps this event model is a bit better designed for the purpose, being rather simpler than the StAX model. Saxon doesn't have any logic to read an EventBuffer as a StAX event stream, but it would be easy enough to add. It's open source code, so see if you can adapt it.
Related
I have a requirement where I can have 100 mb or bigger xml file having list of companies for which I need to add each company into a table from that xml file.
I was thinking of using SAX parser however I was also thinking of using stax parser. Can someone pls help me know which one should I use.
thx
StAX has a much more easier to use API, so I think it is a better choice. SAX has a low-level push API, which is not very nice to use (e.g. working with char[]). StAX has a much nicer to use pull API.
Another potential advantage: using StAX you don't have read the whole document, you may stop if you have what you needed.
There is a nice - though quite old - comparison of the Java XML parsing APIs found here.
Using StAX will allow you to minimize the amount of data kept in memory to only the most recently parsed record. Once you insert that record into your table, you no longer need to keep it in memory.
If you use SAX you would (likely) have to parse the entire xml content into memory before inserting records into your table. While it would be possible to insert as you go (when encountering the closing element for a record), that is more complicated with SAX than StAX.
My question is both a language implementation question and an ANTLR4 API question. Is there way I can modify a ParseTree and it's accompanying TokenStream?
Here is the scenario. I have a simple language that defines a dataflow program. You can see it on github, if you're curious. I lex and parse the language with ANTLR4. I use listeners to walk the parse tree and evaluate the code.
The problem I have most recently run into is I need to be able to modify the code at runtime. I need to be able to define new objects and create instances from them. Note, I'm not referring to having reflection in the language. I'm referring to having a program like an IDE modify the internal representation of the source code.
I have started off down the path of defining a bunch of definition objects to create an AST, but I just realized this approach will require me to come up with my own solutions for walking the AST. Rather than reinvent the wheel, I'd rather use ANTLR's listeners/visitors.
Another problem I face is the need to be able to output the current state the AST as code at any point in time (The tool I'm embedding the language in needs to be able to save.) I am using StringTemplate to generate the code from my definition objects. I think I should be able to make ST render the parse tree.
In general, I need to able to lex, parse, evaluate, refactor, evaluate, and generate code all from within my runtime.
Rather than create my own definition objects, I'm wondering what the best approach is modify the ParseTree/TokenStreams?
I checked out your language. It looks pretty simple, and I'm assuming it is.
From your description I'm working on the basis that the IDE will operate directly on the tree. Given that this is so you need:
A parser for your language, to convert source code into tree. ANTLR can do this, but you may need to build your own tree rather than rely on what is provided. Writing your own parser is not that hard.
Tree rewriting rules. A series of permitted transformations of the tree that ensure it remains valid. This can be in the form of an API that operates on the tree. This http://www.program-transformation.org/Transform/TreeRewriting may be helpful.
Source code generation. The ability to reconstruct source code from the tree. If you need to preserve comments and formatting then either (a) include them in the grammar (b) merge generated source code with authored code.
The end result is that you can round-trip any change in either the source code or the generated tree.
If you think this is a lot of work, it is. However, it is the basis for many other tools that aim to provide these facilities. The saving grace is that your language is simple.
We have no means to alter parse tree safely at the moment. It's best to derive a new version from the old. You should also look at the token stream rewrite engine thingie.
Ter
The short answer to your question is yes, but as you surmise, it will be painful and fragile.
A far more workable alternative, unless you have a hard requirement explicitly to modify the 'internal' representation of the source code/parse tree, is to simply re-parse the modified source code between (virtually) every keystroke. Just grab the current contents of the active editor as a String and feed it as a stream to the Lexer.
While this may sound like an expensive operation, Antlr is actually quite fast. Eclipse in particular works well with this approach - I have used it with a number of DSL editors without any noticeable impact on editor performance. The parse occurs entirely on a background thread. Editor problemMarkers are only updated when there is a sufficient pause in the foreground editing thread. NetBeans should be similar.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looping over a large XML file
What is a better way to parse large XML data which is essentially a collection of XML data in Java and Java based frameworks? We get data from a webservice call which runs into few MB (typically 25MB+). This data essentially corresponds to an unmarshalled list of Objects. My objective is to create the list of objects from the XML.
I tried using the SAX parser and it takes a good 45 seconds to parse these 3000 objects.
What are the other recommended approaches?
Try pull parsing instead, use StAX?
First search hit on comparing:
http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html
Have you profiled and seen where the bottlenecks are?
StAX is built into java (since java 6), but some recommend the woodstox StAX implementation for even better performance. I have not tried it though. http://woodstox.codehaus.org/
I tried using the SAX parser and it takes a good 45 seconds to parse
these 3000 objects. What are the other recommended approaches?
There are only the following options:
DOM
SAX
StAX
SAX is the fastest SAXvsDOMvsStax so if you switch to different style, I don't think you'll get any benefit.
Unless you are doing something wrong now
Of course there are also the marshalling/demarshalling frameworks such as JAXB etc but IMO (not done any measurements) they could be slower since the add an extra layer of abstraction on the XML processing
SAX doesn't provide random access to the structure of the XML file, this means that SAX provides a relatively fast and efficient method of parsing. Because the SAX parser deals with only one element at a time, implementations can be extremely memory-efficient, making it often the one choice for dealing with large files.
Parsing 25Mb of XML should not take 45 seconds. There is something else going on. Perhaps most of the time is spent waiting for an external DTD to be fetched from the web, I don't know. Before changing your approach, you need to understand where the costs are coming from and therefore what part of the system will benefit from changes.
However, if you really do want to convert the XML into Java objects (not the application architecture I would choose, but never mind), then JAXB sounds a good bet. I haven't used JAXB much since I prefer to stick with XML-oriented languages like XSLT and XQuery, but when I did try JAXB I found it pretty fast. Of course it uses a SAX or StAX parser underneath.
I am storing data for ranking users in XML documents - one row per user - containing a 36 char key, score, rank, and username as attributes.
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE Ranks [<!ELEMENT Rank ANY ><!ATTLIST Rank id ID #IMPLIED>]>
<Ranks>
..<Rank id="<userKey>" score="36.0" name="John Doe" rank=15></Rank>..
</Ranks>
There are several such documents which are parsed on request using a DOM parser and kept in memory until the file is updated. This happens from within a HttpServlet which is backing a widget. Every time the widget is loaded it calls the servlet with a get request which then requires one of the documents to be queried. The queries on the documents require the following operations:
Look up - finding a particular ID
Iterate through each Rank element and get the id attribute
In my test environment the number of users is <100 and everything works well. However we are soon supposed to be delivering to a system with 200K+ users. I have serious concerns about the scalability of my approach - i.e. OutOfMemoryException!
I'm stuck for ideas for an implementation which balances performance and memory usage. While DOM is good for find operations it may choke because of the large size. I don't know much about StAX, but from what I have read it seems that it might solve the memory issue but could really slow down the queries as I will have to effectively iterate through the document to find the element of interest (Is that correct?).
Questions:
Is it possible to use StAX for multiple find (like getElementById) operations on large documents quick enough to serve an HttpRequest?
What is the maximum file size that a DOM Parser can handle?
Is it possible to estimate how much memory per user would be used for an XML document with the above structure?
Thanks
Edit: I am not allowed to use databases.
Edit: Would it be better/neater to use a custom formatted file instead and use Regular expressions to search the file for the required entry?
It sounds like you're using the xml document as a database. I think you'll be much happier using a proper database for this, and importing/exporting to xml as needed. Several databases work well, so you might as well use one that's well supported, like mysql or postgresql, although even sqlite will work better than xml.
In terms of SAX parsing, you basically build a large state machine that handles various events that occur while parsing (entering a tag, leaving a tag, seeing data, etc.). You're then on your own to manage memory (recording the data you see depending on the state you're in), so you're correct that it can have a better memory footprint, but running a query like that for every web request is ridiculous, especially when you can store all your data in a nice indexed database.
One of the big problems here is that DOM is not thread-safe, so even read operations need to be synchronized. From that point of view, using JDOM or XOM would definitely be better.
The other issue is the search strategy used to find the data. You really want the queries to be supported by indexing rather than using serial search. In fact, you need a decent query optimizer to generate efficient access paths. So given your constraint of not using a database, this sounds like a case for an in-memory XQuery engine with agressive optimization, for which the obvious candidate is Saxon-EE. But then I would say that, wouldn't I?
For heavy due XML processing, VTD-XML is the most efficient option available, it is far more efficent than JDOM, DOM4j or DOM... the key is non-object oriented approach of its info-set modeling... it is also far less likely to cause out of memory exception... Read this 2013 paper for the comprehensive comparison/benchmark between various XML frameworks
Processing XML with Java – A Performance Benchmark
I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?
Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.
First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?
No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...
SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.
I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?
Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly
You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.
I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.
You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).
If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.