I'm looking into how I can get values from specific XML nodes in an XML file that I have. In my application, I have the entire XML file in a string, and I want to grab the specific information from there. I've heard a little bit about DOM and SAX, but I don't exactly know where to start. Any help?
One of the easiest ways is to use xPath. Here's a tutorial.
You can either use XPath (example) or you can use DOM or SAX (as you mentioned) You can view my answer here (how to retrieve element value of XML using Java?) on SO.
Well, there is also Xstream
http://x-stream.github.io/index.html
It let´s you do both directions (object to xml, and xml to object).
Here is the "two minutes tutorial":
http://x-stream.github.io/tutorial.html
Related
I have a XML file with several <text> nodes. Each text node has attributes named "top" and "left" and has a child node named <textValue>. This XML file basically represents the coordinate positions of text in a PDF file that has been converted to XML using a PDF2HTML converter.
I want to parse the XML file using conditions such as:
1. Give me all the consecutive nodes in the XML file that have the same "top" attribute. - Here. I am trying to get all nodes that have the same "top" attribute, but may have different "left" attribute value.
Which XML parser supports these kinds of queries? I am familiar with basic DOM parser that just allows me to iterate through the elements and access its attribute value. Is there any XML parser that allows conditional queries to be written on top of it?
Thanks
You'll want to investigate XPath, which can do exactly this. Java provides robust, built-in support for this, and can operate on top of a DOM tree. See How to read XML using XPath in Java for one example on how to get started with this.
You are not looking for a parser, you need a query processor. Any XQuery-compatible processor can do that. Just use a pair of nested loop in your xquery.
I dont know how to read data from such XML file. Lets say i want to read every every GUID and userID. How do i do it?
Here is part of XML: http://pastebin.com/7B25eyFz
if your xml file is Tree base then use DOM, if it is not nested then use SAX, is faster then DOM.
You may use Xstream
Look into SAX Parser. Also, do a search for your terms - there are a ton of questions about this topic.
Have you read the trail about XML of the Java tutorial?
You should use an XML library like XOM. You can then use it to query the XML document using XPATH. XOM offers a tutorial.
Adding to #user651407 point, If you just want to read the XML then go for SAX, It parses the XML in serial fashion so its faster, but if you want to do more complex operation like Adding, Updating or deleting a node then go for DOM but DOM Has Limitation
1. required more memory as entire XML is loaded at a time.
2. Slow in processing as it is a tree based parser.
In many REST based API calls, we have this parameter called nextURL, using which we can query for the next URL. This is usually in the root element.(or may be the next one)
In general how do you guys read this? In case you are using a standard XML parser, it reads and loads the entire XML and then you get to read the nextURL by getElementsByTag. Is there a better work around? Reading the entire xml is of course waste of time/memory.
Edit: An example XML would be something like
<result pubisher="xyz" nextURL="http://actualurl?since_date=<newdate>">
<element>adfsaf</element>
..
</result>
I need to capture the new since_date without reading the entire XML.
Python: You could use the ElementTree iterparse method ... provided the data you want is in an attribute, which will have been parsed by the time that you get the start event. If it's in the text or tail of the element, you will have to wait until the end event. It would be a good idea if you edited your question to show what your XML looks like, and explain "or maybe in the next one" with an example.
The term "Standard XML parser" covers a lot of territory, so much so that I don't think that you can generalize their behaviors. For instance, a standard DOM parser is tree-based and will read the entire XML into memory, but a SAX parser (and I think StAX as well) won't but rather will advance as the app desires it to advance. It sounds like the latter, a SAX or StAX parser, is what you need.
Edit: Please be sure to read KitsuneYMG's comment below on the difference between SAX and StAX behaviors.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Best method to parse various custom XML documents in Java
HI all,
I am beginner to java. I hope the question I am asking may be easy one. My question is if I had an XML file and i want to parse it get the elements only with in specific tag.
for example if XML file looks like..
<date>2005-10-31</date>
<number>12345</number>
<purchased-by>
<name>My name</name>
<address>My address</address>
</purchased-by>
<order-items>
<item>
<code>687</code>
<type>CD</type>
<label>Some music</label>
</item>
<item>
<code>129851</code>
<type>DVD</type>
<label>Some video</label>
</item>
</order-items>
And from this XML I want to parse only the elements with in the tag name order-items.
Is there any generic way to do this..?Please let me know..
Thanks
As said in the comments, a short Google Search should bring you to the SUN examples on how to do this. Basically, you have two main XML parsing methods in Java :
SAX, where you use an handler to only grab what you want in your XML and ditch the rest
DOM, which parses your file all along, and allows you to grab all elements in a more tree-like fashion.
Another very useful XML parsing method, albeit a little more recent than these ones, and included in the JRE only since Java6, is StAX. StAX was conceived as a medial method between the tree-based of DOM and event-based approach of SAX. It is quite similar to SAX in the fact that parsing very large documents is easy, but in this case the application "pulls" info from the parser, instead of the parsing "pushing" events to the application. You can find more explanation on this subject here.
So, depending on what you want to achieve, you can use one of these approaches.
If you want to limit the parsing operation itself to the <order-items> element, then you'll have to use SAX. A SAX parser visits all elements of the input "file" (or stream) and you can define, that the parser shall ignore anything that is not <order-items> or any of its children. The result will be a Document containing these elements only.
If the xml documents are rather small and performance is not a limiting factor, then simply parse the whole document (that's a 2-liner) and use XPath expressions to select the correct nodes.
Use XPath. It lets you select nodes on their name and loads of other conditions. Very little code involved to setup.
IBM Example
It is a classic case for SAX. Register handler that receives tags and ignore all tags other than order-items.
Probably better way is to use Apache Digester but it is over-kill for your specific task.
You can use a DOM Parser to build a Document and then extract whatever elements you want using the getElementsByTagName method.
Here is some sample code to help you get started:
//parse file and build Document
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("file.xml"));
//get list of elements called order-items
NodeList orderItemsNodes = doc.getElementsByTagName("order-items");
//iterate over the elements
for(int i = 0 ; i <orderItemsNodes.getLength();i++ ){
Node orderItemNode = orderItemsNodes.item(i);
}
It honestly depends on how you are planning to use the item data. If you want to parse it into object and then work with it, I would use jaxb marshalling, but if you just want to strip string values from code, type, and label attributes of each item element, you may just consider using simple regex matching on the xml string - match content for each item tag, then match each attribute and extract its value.
I am trying to write XML data using Stax where the content itself is HTML
If I try
xtw.writeStartElement("contents");
xtw.writeCharacters("<b>here</b>");
xtw.writeEndElement();
I get this
<contents><b>here</b></contents>
Then I notice the CDATA method and change my code to:
xtw.writeStartElement("contents");
xtw.writeCData("<b>here</b>");
xtw.writeEndElement();
and this time the result is
<contents><![CDATA[<b>here</b>]]></contents>
which is still not good. What I really want is
<contents><b>here</b></contents>
So is there an XML API/Library that allows me to write raw text without being in a CDATA section? So far I have looked at Stax and JDom and they do not seem to offer this.
In the end I might resort to good old StringBuilder but this would not be elegant.
Update:
I agree mostly with the answers so far. However instead of <b>here</b> I could have a 1MB HTML document that I want to embed in a bigger XML document. What you suggest means that I have to parse this HTML document in order to understand its structure. I would like to avoid this if possible.
Answer:
It is not possible, otherwise you could create invalid XML documents.
The issue is that is not raw text it is an element so you should be writing
xtw.writeStartElement("contents");
xtw.writeStartElement("b");
xtw.writeCData("here");
xtw.writeEndElement();
xtw.writeEndElement();
If you want the XML to be included AS XML and not as character data, then it has to be parsed at some point. If you don't want to manually do the parsing yourself, you have two alternatives:
(1) Use external parsed entities -- in this case the external file will be pulled in and parsed by the XML parser. When the output is again serialized, it will include the contents of the external file.
[ See http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238 ]
(2) Use Xinclude -- in that case the file has to be run thru an xinclude processor which will merge the xinclude references into the output. Most xslt processors, as well as xmllint will also do xinclude with an appropriate option.
[ See: http://www.xml.com/pub/a/2002/07/31/xinclude.html ]
( XSLT can also be used to merge documents without using the XInclude syntax. XInclude just provides a standard syntax )
The problem is not "here", it's <b></b>.
Add the <b> element as a child of contents and you'll be able to do it. Any library like JDOM or DOM4J will allow you to do this. The general case is to parse the content into an XML DOM and add the root element as a child of <contents>.
You can't add escaped values outside of a CDATA section.
If you want to embed a large HTML document in an XML document then CDATA imho is the way to go. That way you don't have to understand or process the internal structure and you can later change the document type from HTML to something else without much hassle. Also I think you can't embed e.g. DOCTYPE instructions directly (i.e. as structured data that retains the semantics of the DOCTYPE instruction). They have to be represented as characters.
(This is primarily a response to your update but alas I don't have enough rep to comment...............)
I don't see what the problem is with parsing the large block of XML you want to insert into your output. Use a StAX parser to parse it, and just write code to forward all of the events to your existing serializer (variable "xtw").
If the blob of html is actually xhtml then I'd suggest doing something like (in pseudo-code):
xtw.writeStartElement("contents")
XMLReader xtr=new XMLReader();
xtr.read(blob);
Dom dom=xtr.getDom();
for(element e:dom){
xtw.writeElement(e);
}
xtw.writeEndElement();
or something like that. I had to do something similar once but used a different library.
If your XML and HTML are not too big, you could make a workaround:
xtw.writeStartElement("contents");
xtw.writeCharacters("anUniqueIdentifierForReplace"); // <--
xtw.writeEndElement();
When you have your XML as a String:
xmlAsString.replace("anUniqueIdentifierForReplace", yourHtmlAsString);
I know, it's not so nice, but this could work.
Edit: Of course, you should check if yourHtmlAsString is valid.