which parser (java) would you recommend for parsing GPX data?
Im looking for a one that is very intuitive to use and should not need too much RAM (it seems that DOM requires too much, doesn't it?). I have no idea about parsing xml, so it is time for me to learn this ;-)
My documents are not very huge and are always read twice(a point for DOM), but I want to keep as few things as possible in RAM.
What would you do in this situation? Which one would you coose and why?
Unless you have a special reason to use a third-party library for XML parsing, I'd just use the standard Java API. See the package javax.xml.parsers. Assuming you have the XML in a file, you can parse it into an org.w3c.dom.Document (also part of Java's standard API) like this:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File(filename));
Since your files are not so large, using DOM would be the easiest and most obvious choice.
You can use the methods of the Document object and related objects (see the classes in the org.w3c.dom package) to get at the data, or use XPath (see the package javax.xml.xpath).
JAXB, that Quotidian mentions, is also in the standard API since Java 6, but it might be a bit more work to set up than using the standard DOM API.
I would suggest XPP,
http://www.extreme.indiana.edu/xgws/xsoap/xpp/
It's more efficient DOM and easier to use than SAX.
STAX is another popuplar pull parser,
http://stax.codehaus.org/Home
Depending on what your wanting to do with the xml, JAXB might be a possibility. The idea is to convert xml into POJOs without messing (directly) with the parsers. (Although one still can mess with the parsers if needed.) Plus I've found that since JAXB is in the javax root package it tends to play nicer with standards.
I'm a big fan of apache digester since it lets you define translation rules and go straight to Java objects. I think that JaxB does something similar.
I would start with an IDE like Oxygen.
Related
I have to generate huge and quite complex xml files by Java. I have to fetch the data from a Oracle database. What I really don't know is a proper and reliable way to this? I could of course create a String and concatenate all the tags, attributes and data but it doesn't feel right. I guess this is a quite common task and there are many established ways to this by Java. My question is what is the best way to this? What is your suggestion?
Thank you for any clues...
You could use JAXB for building XML out of structured objects that are a result of querying your data store.
If your object hierarchy is not complex, you can use Oracle's capability to generate results in XML.
There are several options for object to xml transformation.
jaxb
saxparser
dom parser
I would personally suggest JAXB for easy of use and saxparser for performance centric application.
You can use JAXP(Java API for XML parsing ) to create a XML structure.This is having all the features you wanted.
I am parsing XML Document using SAX Parser.
I want to know which is better and faster to work with DOM, SAX Parser or XMLPullParser.
it depends on what are you doing , if you have very large files then you should use SAX parser since it will fire events and releasing them ,nothing is stored in memory ,and using SAX parser you can't access element in a random way there is no going back ! , but Dom let you access any part of the xml file since it keeps the whole file/document in memory . hope this answer you question .
if you want to know which fastest parser Xerces is going to be the fastest you'll find and SAX parser should give you more performance than Dom
The SAX XML Parser already available into the Android SDK.
http://developer.android.com/reference/org/xml/sax/XMLReader.html
so it is easy to access.
One aspect by which different kinds of parsers can be classified is whether they need to load the entire XML document into memory up front. Parsers based on the Document Object Model (DOM) do that: they parse XML documents into a tree structure, which can then be traversed in-memory to read its contents. This allows you to traverse a document in arbitrary order, and gives rise to some useful APIs that can be slapped on top of DOM, such as XPath, a path query language that has been specifically designed for extracting information from trees. Using DOM alone isn’t much of a benefit because its API is clunky and it’s expensive to always read everything into memory even if you don’t need to. Hence, DOM parsers are, in most cases, not the optimal choice to parse XML on Android.
There are class of parsers that don’t need to load a document up front. These parsers are stream-based, which means they process an XML document while still reading it from the data source (the Web or a disk). This implies
that you do not have random access to the XML tree as with DOM because no internal representation of the document is being maintained. Stream parsers can be further distinguished from each other. There are push parsers that, while streaming the document, will call back to your application when encountering a new element. SAX parsers, fall into this class. Then there are pull parsers, which are more like iterators or cursors: here the client must explicitly ask for the next element to be retrieved.
Source: Android in Practice.
The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.
The sample content of the file
<transactions>
<txn id="1">
<name> product 1</name>
<price>29.99</price>
</txn>
<txn id="2">
<name> product 2</name>
<price>59.59</price>
</txn>
</transactions>
The (technical)user is expected to give the input tag name like <txn>.
We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.
There are few technical things we have to consider here
The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM
Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.
Looking for suggestions. Thanks in advance.
If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]
As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
Document document = getParser().parse( source );
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.
SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.
This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.
Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.
Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?
We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.
I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.
It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)
In this situation, you will have to use
<p:for-each>
<p:iteration-source select="//transactions/txn"/>
<!-- you processing on a small file -->
</p:for-each>
You can even wrapp each of the resulting transformation with a single line of XProc
<p:wrap-sequence wrapper="transactions"/>
Hope this helps
A fun solution for processing huge XML files >10GB.
Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527
Streaming Transformations for XML (STX) might be what you need.
Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.
For fast reading of the whole data StAX will be OK.
If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.
I find myself writing the same verbose DOM manipulation code again and again:
Element e1 = document.createElement("some-name");
e1.setAttribute("attr1", "val1");
e2.setAttribute("attr2", "val2");
document.appendChild(e1);
Element e2 = document.createElement("some-other-name");
e.appendChild(e2);
// Etc, the same for attributes and finding the nodes again:
Element e3 = (Element) document.getElementsByTagName("some-other-name").item(0);
Now, I don't want to switch architecture all together, i.e. I don't want to use JDOM, JAXB, or anything else. Just Java's org.w3c.dom. The reasons for this are
It's about an old and big legacy system
The XML is used in many places and XSLT transformed several times to get XML, HTML, PDF output
I'm just looking for convenience, not a big change.
I'm just wondering if there is a nice wrapper library (e.g. with apache commons or google) that allows me to do things like this with a fluent style similar to jRTF:
// create a wrapper around my DOM document and manipulate it:
// like in jRTF, this code would make use of static imports
dom(document).add(
element("some-name")
.attr("attr1", "val1")
.attr("attr2", "val2")
.add(element("some-other-name")),
element("more-elements")
);
and then
Element e3 = dom(document).findOne("some-other-name");
The important requirement I have here is that I explicitly want to operate on a org.w3c.dom.Document that
already exists
is pretty big
needs quite a bit of manipulation
So transforming the org.w3c.dom.Document into JDOM, dom4j, etc seems like a bad idea. Wrapping it with adapters is what I'd prefer.
If it doesn't exist, I might roll my own, as this jRTF syntax looks really nice! And for XML, it seems quite easy to implement, as there are only few node types. This could become as powerful as jquery from the fluent API perspective!
To elaborate my comment, Dom4J gets you pretty close to what you wanted:
final Document dom = DocumentHelper.createDocument().addElement("some-name")
.addAttribute("attr1", "val1")
.addAttribute("attr2", "val2")
.addElement("some-other-name").getDocument();
System.out.println(dom.asXML());
Output:
<?xml version="1.0" encoding="UTF-8"?>
<some-name attr1="val1" attr2="val2"><some-other-name/></some-name>
I know it's not native DOM, but it's very similar and it has very nice features for Java developers (element iterators, live element lists etc.)
I found some tools that roughly do what I asked for in my question:
http://code.google.com/p/xmltool/
http://jsoup.org/
However, in the mean time, I am more inclinded to roll my own. I'm really a big fan of jquery, and I think jquery can be mapped to a Java fluent API:
http://www.jooq.org/products/jOOX
Well, this is maybe silly but why don't you implement that little API on your own? I'm sure you know DOM API pretty well and it won't take much time to implement what you want.
Btw consider using XPath for manipulation with document (you can also implement your mini-api over this one).
I am using Java back end for creating an XML string which is passed to the browser. Currently I am using simple string manipulation to produce this XML. Is it essential that I use some XML library in Java to produce the XML string?
I find the libraries very difficult to use compared to what I need.
It's not essential, but advisable. However, if string manipulation works for you, then go for it! There are plenty of cases where small or simple XML text can be safely built by hand.
Just be aware that creating XML text is harder than it looks. Here's some criteria I would consider:
First: how much control do you have on the information that goes into the xml?
The less control you have on the source data, the more likely you will have trouble, and the more advantageous the library becomes. For example: (a) Can you guarantee that the element names will never have a character that is illegal in a name? (b) How about quotes in an attribute's content? Can they happen, and are you handling them? (c) Does the data ever contain anything that might need to be encoded as an entity (like the less-than which often needs to be output as <); are you doing it correctly?
Second, maintainability: is the code that builds the XML easy to understand by someone else?
You probably don't want to be stuck with the code for life. I've worked with second-hand C++ code that hand-builds XML and it can be surprisingly obscure. Of course, if this is a personal project of yours, then you don't need to worry about "others": substitute "in a year" for "others" above.
I wouldn't worry about performance. If your XML is simple enough that you can hand-write it, any overhead from the library is probably meaningless. Of course, your case might be different, but you should measure to prove it first.
Finally, Yes; you can hand build XML text by hand if it's simple enough; but not knowing the libraries available is probably not the right reason.
A modern XML library is a quite powerful tool, but it can also be daunting. However, learning the essentials of your XML library is not that hard, and it can be quite handy; among other things, it's almost a requisite in today's job marketplace. Just don't get bogged down by namespaces, schemas and other fancier features until you get the essentials.
Good luck.
Xml is hard. Parsing yourself is a bad idea, it's even a worse idea to generate content yourself. Have a look at the Xml 1.1 spec.
You have to deal with such things as proper encoding, attribute encoding (e.g., produces invalid xml), proper CDATA escaping, UTF encoding, custom DTD entities, and that's without throwing in the mix xml namespaces with the default / empty namespace, namespace attributes, etc.
Learn a toolkit, there's plenty available.
I think that custom string manipulation is fine, but you have to keep two things in mind:
Your code isn't as mature as the library. Allocate time in your plan to handle the bugs that pop-up.
Your approach will probably not scale as well as a 3rd party library when the xml starts to grow (both in terms of performance and ease of use).
I know a code base that uses custom string manipulation for xml output (and a 3rd party library for input). It was fine to begin with but became a real hassle after a while.
Yes, use the library.
Somebody took the time and effort to create something that is usually better than what you could come up with. String manipulation is for sending back a single node, but once you start needing to manipulate the DOM, or use an XPath query, the library will save you.
By not using a library, you risk generating or parsing data that isn't well-formed, which sooner or later will happen. For the same reason document.write isn't allowed in XHTML, you shouldn't write your XML markup as a string.
Yes.
It makes no sense to skip essential tool: even writing xml is non-trivial with having to escape those ampersands and lts, not to mention namespace bindings (if needed).
And in the end libs can generally read and write xml not only more reliably but more efficiently (esp. so for Java).
But you may have been looking at wrong tools, if they seem overcomplicated. Data binding using JAXB or XStream is simple; but for simple straight-forward XML output, I go with StaxMate. It can actually simplify the task in many ways (automatically closes start tags, writes namespace declarations if needde etc).
No - If you can parse it yourself (as you are doing), and it will scale for your needs, you do not need any library.
Just ensure that your future needs are going to be met - complex xml creation is better done using libraries - some of which come in very simple flavors too.
The only time I've done something like this in production code was when a collegue and I built a pre-processor so that we could embed XML fragments from other files into a larger XML. On load we would first parse these embed (file references in XML comment strings) and replace them with the actual fragment they referenced. Then we would pass on the combined result to the XML Parser.
You don't have to use library to parse XML, but check out this question What considerations should be made before reinventing the wheel?
before you start writing your own code for parsing/generating xml.
No - especially for generating (parsing I would be less inclined to as input text can always surprise you). I think its fine - but be prepared to shift to a library should you find yourself spending more then a few minutes maintaining your own code.
I don't think that using the DOM XML API wich comes with the JDK is difficult, it's easy to create Element nodes, attributes, etc... and later is easy convert strings to a DOM document sor DOM documents into a String
In the first page google finds from Spain (spanish XML example):
public String DOM2String(Document doc)
{
TransformerFactory transformerFactory =TransformerFactory.newInstance();
Transformer transformer = null;
try{
transformer = transformerFactory.newTransformer();
}catch (javax.xml.transform.TransformerConfigurationException error){
coderror=123;
msgerror=error.getMessage();
return null;
}
Source source = new DOMSource(doc);
StringWriter writer = new StringWriter();
Result result = new StreamResult(writer);
try{
transformer.transform(source,result);
}catch (javax.xml.transform.TransformerException error){
coderror=123;
msgerror=error.getMessage();
return null;
}
String s = writer.toString();
return s;
}
public Document string2DOM(String s)
{
Document tmpX=null;
DocumentBuilder builder = null;
try{
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
}catch(javax.xml.parsers.ParserConfigurationException error){
coderror=10;
msgerror="Error crando factory String2DOM "+error.getMessage();
return null;
}
try{
tmpX=builder.parse(new ByteArrayInputStream(s.getBytes()));
}catch(org.xml.sax.SAXException error){
coderror=10;
msgerror="Error parseo SAX String2DOM "+error.getMessage();
return null;
}catch(IOException error){
coderror=10;
msgerror="Error generando Bytes String2DOM "+error.getMessage();
return null;
}
return tmpX;
}