XML parsing diference - java

I would like to know what is the difference between StAX and SAX parsing in Java?
Can someone explain it as easy as possible, I don't understand what does it mean the one is pulling data and the other pushing?

"Push" and "Pull" refer to the style of coding that is used.
For "Push," you register a "handler" that the parser calls as it works its way through the document. So, you register your handlers with the parser and then tell it to parse the document. Your handlers will be called by the parser to tell your code when an element is starting, ending, etc.
For "Pull," your code is driving the step-by-step process of parsing the document. It is like getting an Iterator for the document and your code is going to loop and ask for the next element from the parser. In other words, your "handler" code is calling the parser for the next element to handle.
The different coding styles make different types of interactions with the document easier or harder. The choice of which style to use for a particular project is dependent on the requirements of that project.

Related

Java XML Reading an XML file from top to bottom

I want to read an XML starting from the top to the bottom using Java. However, I don't want to use recursive functions because I want to be able to jump to a different element and start reading from that position.
I've tried using getParent() and indexOf() methods (All three libraries below have these methods) to do this, but it's gotten very messy, mainly because the methods don't distinguish between attributes and elements.
I'm sure there must be a simple way to do this, but after trying dom4j, jdom, and xom, I still have not found a solution.
[Edit] More Information:
My friend wants to make a console text-based game in a question/answer type style. Instead of hard-coding it into java, I decided to try and make it read an XML file instead, because XML has a tree-like style that would be convenient. Here is an example of what my XML file might look like:
<disp>Text to be displayed</disp>
<disp>Text to be displayed afterward</disp>
<disp>What is your favorite color?</disp>
<question>
<answer name="orange">
<disp>Good choice.</disp>
<!-- More questions and stuff -->
</answer>
<default>
<disp>Wrong. The correct answer was orange.</disp>
</default>
</question>
I don't know if it taboo to use XML like an pseudo programming language. If anyone has other suggestions feel free to give them.
Your design is basically good and an example of Declarative Programming. You should read your XML files using an XML parser either into a DOM or using SAX. Since I think you will want to revisit nodes I suspect you will need a DOM (FWIW I use XOM, xom.nu). One of the best examples of XML-based declarative programming is XSLT where the data and commands are all XML.
I use this model a great deal. It has the advantage that the data structure can be external and can be edited.
(Note that your XML needs a root element)
but it's gotten very messy, mainly because the methods don't
distinguish between attributes and elements.
All DOM or SAX tools differentiate very clearly between attributes and elements, so if there is confusion it is somewhere else.
It would be good if you would show what you want to do with the XML snippets, but normally if you want to read off of any kind of file, use java.util.Scanner.
Scanner scan = new Scanner (new File("file.xml"));
while (scan.hasNext()) {
String theData = scan.nextLine();
}
scan.close();
This should return the values that you need until it runs out of lines to scan.
Hope it works and Happy Coding!
Since you want to read a xml file. I recommend go for SAX parser. It is event based parser very fast and efficient for xml reading (top down approach).
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/ will explain about usage of sax parser.
Thanks

Xml Query in java?

I am new to this validation process in Java...
-->XML file named Validation Limits
-->Structure of the XML
parameter /parameter
lowerLimit /lowerLimit
upperLimit /upperLimit
enable /enable
-->Depending the the enable status, 'true or false', i must perform the validation process for the respective parameter
--> what could be the best possible method to perform this operation...
I have parsed the xml (DOM) [forgot this to mention earlier] and stored the values in the arrays but is complicated with lot of referencing that take place from one array to another. If any better method that could replace array procedure will be helpful
Thank you in advance.
Try using a DOM or SAX parser, they will do the parsing for you. You can find some good, free tutorials in the internet.
The difference between DOM and SAX is as follows: DOM loads the XML into a tree structure which you can browse through (i.e. the whole XML is loaded), whereas SAX parses the document and triggers events (calls methods) in the process. Both have advantages and disadvantages, but personally, for reasonably sized XML files, I would use DOM.
So, in your case: use DOM to get a tree of your XML document, locate the attribute, see other elements depending on it.
Also, you can achieve this in pure XML, using XML Schema, although this might be too much for simple needs.

Reading only root element in XML

In many REST based API calls, we have this parameter called nextURL, using which we can query for the next URL. This is usually in the root element.(or may be the next one)
In general how do you guys read this? In case you are using a standard XML parser, it reads and loads the entire XML and then you get to read the nextURL by getElementsByTag. Is there a better work around? Reading the entire xml is of course waste of time/memory.
Edit: An example XML would be something like
<result pubisher="xyz" nextURL="http://actualurl?since_date=<newdate>">
<element>adfsaf</element>
..
</result>
I need to capture the new since_date without reading the entire XML.
Python: You could use the ElementTree iterparse method ... provided the data you want is in an attribute, which will have been parsed by the time that you get the start event. If it's in the text or tail of the element, you will have to wait until the end event. It would be a good idea if you edited your question to show what your XML looks like, and explain "or maybe in the next one" with an example.
The term "Standard XML parser" covers a lot of territory, so much so that I don't think that you can generalize their behaviors. For instance, a standard DOM parser is tree-based and will read the entire XML into memory, but a SAX parser (and I think StAX as well) won't but rather will advance as the app desires it to advance. It sounds like the latter, a SAX or StAX parser, is what you need.
Edit: Please be sure to read KitsuneYMG's comment below on the difference between SAX and StAX behaviors.

ignore some XML tags in SAX

I'm parsing an XML document using SAX in Java.
I'm working with the XML that describes research publications in different fields.
Among others there are elements like "abstract" that shortly describes what the reserch paper is about. The basic HTML formatting is allowed in that field, but I don't want the SAX to threat the HTML tags (like i,b,u,sub,sup an so on) as real XML tags and fire strartElement() and endElement() events on that elements.
Is there a way to tell to SAX to ignore some predefined set of XML tags and to pass theirs XML code as is to the characters() method?
I suspect not, without some work. I would perhaps slot in different SAX handlers as you encounter different elements, and push/pop them off a stack. So when you encounter an <abstract> element, you slot in a new handler that the SAX parser delegates to, and that is intelligent enough to process your HTML elements as you require. Not a trivial solution, I'm afraid.

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.
What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).
Try the parser on the XHtml object. It is much more lenient than the one on XML.
Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.
Try Tag Soup.
JTidy does something similar but only for HTML.
I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.
I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!
Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here
A related topic (with my solution) is listed below:
Scala and html parsing

Categories

Resources