PARSING A COMPLEX XML USING DOM

PARSING A COMPLEX XML USING DOM - java

I know that this sort of question has been asked here before, but still i couldn't find any solution to my problem. So please, can anyone help me with it.....
Situation:
I am parsing the following xml response using DOM parser.
<feed>
<post_id>16</post_id>
<user_id>68</user_id>
<restaurant_id>5</restaurant_id>
<dish_id>7</dish_id>
<post_img_id>14</post_img_id>
<rezing_post_id></rezing_post_id>
<price>8.30</price>
<review>very bad</review>
<rating>4.0</rating>
<latitude>22.299999000000</latitude>
<longitude>73.199997000000</longitude>
<near> Gujarat</near>
<posted>1340869702</posted>
<display_name>username</display_name>
<username>vivek</username>
<first_name>vivek</first_name>
<last_name>mitra</last_name>
<dish_name>Hash brows</dish_name>
<restaurant_name>Waffle House</restaurant_name>
<post_img>https://img1.yumzing.com/1000/9cab8fc91</post_img>
<post_comment_count>0</post_comment_count>
<post_like_count>0</post_like_count>
<post_rezing_count>0</post_rezing_count>
<comments>
<comment/>
</comments>
</feed>
<feed>
<post_id>8</post_id>
<user_id>13</user_id>
<restaurant_id>5</restaurant_id>
<dish_id>6</dish_id>
<post_img_id>7</post_img_id>
<rezing_post_id></rezing_post_id>
<price>3.50</price>
<review>This is cheesy!</review>
<rating>4.0</rating>
<latitude>42.187000000000</latitude>
<longitude>-88.346497000000</longitude>
<near>Lake in the Hills IL</near>
<posted>1340333509</posted>
<display_name>username</display_name>
<username>Nullivex</username>
<first_name>Bryan</first_name>
<last_name>Tong</last_name>
<dish_name>Hash Brows with Cheese</dish_name>
<restaurant_name>Waffle House</restaurant_name>
<post_img>https://img1.yumzing.com/1000/78e5c184fd3ae752f8665636381a8f0006762dc0</post_img>
<post_comment_count>6</post_comment_count>
<post_like_count>1</post_like_count>
<post_rezing_count>1</post_rezing_count>
<comments>
<comment>
<user_id>16</user_id>
<email>existentialism27#gmail.com</email>
<email_new></email_new>
<email_verification_code></email_verification_code>
<password>9d99ef4f72f9d2df968a75e058c78245fa45e9e7</password>
<password_reset_code></password_reset_code>
<salt>31a988badccd35a1be7dacc073f60f52e49ff881</salt>
<username>existentialism27</username>
<first_name>Daniel</first_name>
<last_name>Amaya</last_name>
<display_name>username</display_name>
<birth_month>10</birth_month>
<birth_day>5</birth_day>
<birth_year>1985</birth_year>
<city>Colorado Springs</city>
<state>CO</state>
<country>US</country>
<timezone>US/Mountain</timezone>
<last_seen>1338365509</last_seen>
<is_confirmed>1</is_confirmed>
<is_active>1</is_active>
<post_comment_id>9</post_comment_id>
<post_id>8</post_id>
<comment>this is a test comment!</comment>
<posted>1340334121</posted>
</comment>
<comment>
<user_id>16</user_id>
<email>existentialism27#gmail.com</email>
<email_new></email_new>
<email_verification_code></email_verification_code>
<password>9d99ef4f72f9d2df968a75e058c78245fa45e9e7</password>
<password_reset_code></password_reset_code>
<salt>31a988badccd35a1be7dacc073f60f52e49ff881</salt>
<username>existentialism27</username>
<first_name>Daniel</first_name>
<last_name>Amaya</last_name>
<display_name>username</display_name>
<birth_month>10</birth_month>
<birth_day>5</birth_day>
<birth_year>1985</birth_year>
<city>Colorado Springs</city>
<state>CO</state>
<country>US</country>
<timezone>US/Mountain</timezone>
<last_seen>1338365509</last_seen>
<is_confirmed>1</is_confirmed>
<is_active>1</is_active>
<post_comment_id>10</post_comment_id>
<post_id>8</post_id>
<comment>this is a test comment!</comment>
<posted>1340334128</posted>
</comment>
</comments>
</feed>
In the above xml response, i am getting multiple "feed", which i am able to parse without any problem, but here each "feed" can have None or N numbers of "comment". I am not able to parse the comment for an individual feed. Can anyone suggest me how do proceed in the right direction.
I am also putting a snippet of code here, NOT the entire code.. that i am using to parse the xml doc, so it will be easier to pin point the problem.
DocumentBuilderFactory odbf = DocumentBuilderFactory.newInstance();
DocumentBuilder odb = odbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
Document odoc = odb.parse(is);
odoc.getDocumentElement().normalize ();
NodeList LOP = odoc.getElementsByTagName("feed");
System.out.println(LOP.getLength());
for (int s=0 ; s<LOP.getLength() ; s++){
Node FPN =LOP.item(s);
try{
if(FPN.getNodeType() == Node.ELEMENT_NODE)
{
Element token = (Element)FPN;
NodeList oNameList0 = token.getElementsByTagName("post_id");
Element ZeroNameElement = (Element)oNameList0.item(0);
NodeList textNList0 = ZeroNameElement.getChildNodes();
feed_post_id = Integer.parseInt(((Node)textNList0.item(0)).getNodeValue().trim());
System.out.println("#####The Parsed data#####");
System.out.println("post_id : " + ((Node)textNList0.item(0)).getNodeValue().trim());
System.out.println("#####The Parsed data#####");
}
}catch(Exception ex){}
}

Once you have the feed NodeList run on it and use:
NodeList nodes = feedNode.getChildNodes();
for (Node node: nodes)
{
if(node.getNodeName().equals("comments")){
//do something with comments node
}
}

Related

Java DOM: getElementsByTagName is returning no elements?

I am trying to parse the XML document at http://web.mta.info/status/ServiceStatusSubway.xml and extract all the PtSituationElement elements with the following code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document subwayStatusDoc = builder.parse(new URL("http://web.mta.info/status/ServiceStatusSubway.xml").openStream());
NodeList situationList = subwayStatusDoc.getDocumentElement().getElementsByTagName("PtSituationElement");
System.out.println(situationList.item(0)); //prints null
What am I doing wrong here ?

The PtSituationElement tags contain child tags, so you need to go into those. Just printing .item(0) relies on the toString() method, and apparently it does not do a great job of explaining your nodes.
So add this to see some of the data in the child nodes:
Node item = situationList.item(0);
NodeList childNodes = item.getChildNodes();
for (int j = 0; j < childNodes.getLength(); j++) {
System.out.println(childNodes.item(j).getTextContent());
}
(I'm not sure what you want to do with the data in the xml structure, but this example shows how you can proceed with your work.)
Also, I noted that the LongDescription tags contain HTML that is not correct XML (<br clear=left> should be <br clear=left> etc). The parser could have a problem with that. It would be better if the HTML was escaped (see How to escape "&" in XML?).

Getting null values from XPath query

I have this xml file:
<?xml version="1.0" encoding="UTF-8"?>
<iet:aw-data xmlns:iet="http://care.aw.com/IET/2007/12" class="com.aw.care.bean.resource.MessageResource">
<iet:metadata filter=""/>
<iet:message-resource>
<iet:message>some message 1</iet:message>
<iet:customer id="1"/>
<iet:code>edi.claimfilingindicator.11</iet:code>
<iet:locale>iw_IL</iet:locale>
</iet:message-resource>
<iet:message-resource>
<iet:message>some message 2</iet:message>
<iet:customer id="1"/>
<iet:code>edi.claimfilingindicator.12</iet:code>
<iet:locale>iw_IL</iet:locale>
</iet:message-resource>
.
.
.
.
</iet:aw-data>
Using this code below i'm getting over the data and finding what I need.
try {
FileInputStream fileIS = new FileInputStream(new File("resources\\bootstrap\\content\\MessageResources_iw_IL\\MessageResource_iw_IL.ctdata.xml"));
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
builderFactory.setNamespaceAware(true); // never forget this!
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String query = "//*[local-name()='message-resource']//*[local-name()='code'][contains(text(), 'account')]";
NodeList nodeList = (NodeList) xPath.compile(query).evaluate(xmlDocument, XPathConstants.NODESET);
System.out.println("size= " + nodeList.getLength());
for (int i = 0; i < nodeList.getLength(); i++) {
System.out.println(nodeList.item(i).getNodeValue());
}
}
catch (Exception e){
e.printStackTrace();
}
The issue is that i'm getting only null values while printing in the for loop, any idea why it's happened?
The code needs to return a list of nodes which have a code and message fields that contains a given parameters (same as like SQL query with two parameters with operator of AND between them)

Check the documentation:
https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
getNodeValue() applied to an element node returns null.
Use getTextContent().
Alternatively, if you find DOM too frustrating, switch to one of the better tree models like JDOM2 or XOM.
Also, if you used an XPath 2.0 engine like Saxon, it would (a) simplify your expression to
//*:message-resource//*:code][contains(text(), 'account')]
and (b) allow you to return a sequence of strings from the XPath expression, rather than a sequence of nodes, so you wouldn't have to mess around with nodelists.
Another point: I suspect that the predicate [contains(text(), 'account')] should really be [.='account']. I'm not sure of that, but using text() instead of ".", and using contains() instead of "=", are both common mistakes.

get child nodes from parent (xml, java)

UPDATE
i was specifically targeting staff under some root node, not all "staff" elements in the whole document. i forgot to mention this important detail in the question. sorry guys.
i found this answer to my question:
getElementsByTagName
But with this data:
<one>
<two>
<three>
<company>
<staff id="1001">
<firstname>Golf</firstname>
<lastname>4</lastname>
<nickname>Schnecke</nickname>
<salary>1</salary>
</staff>
<staff id="2001">
<firstname>Audi</firstname>
<lastname>R8</lastname>
<nickname>Rennaudi</nickname>
<salary>1111111</salary>
</staff>
<staff id="2002">
<firstname>Skoda</firstname>
<lastname>xyz</lastname>
<nickname>xyz</nickname>
<salary>0.1</salary>
</staff>
</company>
</three>
</two>
</one>
and this code:
public static void parseXML2() {
File fXmlFile = new File("src\\main\\java\\staff.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = null;
try {
dBuilder = dbFactory.newDocumentBuilder();
} catch (ParserConfigurationException ex) {
Logger.getLogger(MyParser.class.getName()).log(Level.SEVERE, null, ex);
}
Document doc = null;
try {
doc = dBuilder.parse(fXmlFile);
} catch (SAXException ex) {
Logger.getLogger(MyParser.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(MyParser.class.getName()).log(Level.SEVERE, null, ex);
}
System.out.println("test");
System.out.println(doc.getElementsByTagName("company").item(0).getTextContent());
}
i dont get just one staff element, but all of them. how come?
i was expecting to get:
Golf
4
Schnecke
1
but instead i get this:
Golf
4
Schnecke
1
Audi
R8
Rennaudi
1111111
Skoda
xyz
xyz
0.1
looks like your post is mostly code, please add more details...yes the details are there.

You are almost there. If you want to get the text contents of the first staff node, then get the elements by that tag name:
System.out.println(doc.getElementsByTagName("staff").item(0).getTextContent());
// ^^^^^^
Update
In case you want to get the first staff node under company, then you can find them with node type and node name checks. Here is a rudimentary loop to do this:
Node companyNode = doc.getElementsByTagName("company").item(0);
NodeList companyChildNodes = companyNode.getChildNodes();
for (int i = 0; i < companyChildNodes.getLength(); i++) {
Node node = companyChildNodes.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE && Objects.equals("staff", node.getNodeName())) {
System.out.println(node.getTextContent());
break;
}
}
You might want to refactor the for loop into a separate method.
You can use XPATH too. I think it's more concise:
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//company/staff[1]");
NodeList nl = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
System.out.println(nl.item(0).getTextContent());
Explanation:
//company selects all the company tags, regardless where they are in the xml. It's the // that ignores the rest of the xml structure.
//company/staff selects all the staff tags that are under company tag.
[0] selects the first such item.

You have a line of code which gets the text content from the first company.
System.out.println(doc.getElementsByTagName("company").item(0).getTextContent());
This gets all the text content in the first company node which in this case is the data inside 3 staff elements. If you want to select the first staff element, you can either select it by name or by getting the first child of the company.

If your xml has a format like you mentioned, and with this line of code,
System.out.println(doc.getElementsByTagName("company").item(0).getTextContent();
you are printing ALL contents under the COMPANY tag name, thus its gonna print EVERYTHING in company, if you want to select to print only 1st staff, change tagname in the sysout to "staff" and item variable to (0,1,2,3 -> index of wanted staff)
System.out.println(doc.getElementsByTagName("staff").item(0).getTextContent();
i havent tried this out, but i think its gonna work

java dom parser only gets the first entity

This code returs only one "question " tag's element but I have another 9 question element inside the xml file.What is the wrong thing in here?Do I need to loop.Because when I checked the loop, it loops only one time.What is the problem?I am figure out.
Here is my xml:
<Results>
<question>
<eno>3</eno>
<qno>1</qno>
<qtext>The Battle of Gettysburg was fought during which war?</qtext>
<correctAnswer>C</correctAnswer>
</question>
<question>
<eno>3</eno>
<qno>2</qno>
<qtext>Neil Armstrong and Buzz Aldrin walked how many
minutes on the moon in 1696?</qtext>
<correctAnswer>B</correctAnswer>
</question>
</Results>
my source code:
NodeList listOfQuestions = doc.getElementsByTagName("question");
for(int s=0; s<listOfQuestions.getLength(); s++)
{
System.out.println(listOfQuestions.getLength());
Node firstQuestionNode = listOfQuestions.item(0);
if(firstQuestionNode.getNodeType() == Node.ELEMENT_NODE){
Element firstQElement = (Element)firstQuestionNode;
NodeList enoList = firstQElement.getElementsByTagName("eno");
Element enoElement =(Element)enoList.item(s);
NodeList enosList = enoElement.getChildNodes();
String eno=((Node)enosList.item(s)).getNodeValue().trim();
System.out.println(eno);
NodeList qnoList = firstQElement.getElementsByTagName("qno");
Element qnoElement =(Element)qnoList.item(s);
NodeList qnosList = qnoElement.getChildNodes();
String qno= ((Node)qnosList.item(s)).getNodeValue().trim();
System.out.println(qno);
NodeList qtextList = firstQElement.getElementsByTagName("qtext");
Element qtextElement =(Element)qtextList.item(s);
NodeList qtextsList = qtextElement.getChildNodes();
String qtext= ((Node)qtextsList.item(s)).getNodeValue().trim();
System.out.println(qtext);
NodeList correctAnswerList = firstQElement.getElementsByTagName("correctAnswer");
Element correctAnswerElement =(Element)correctAnswerList.item(s);
NodeList correctAnswerElementList = correctAnswerElement.getChildNodes();
String correctAnswer= ((Node)correctAnswerElementList.item(s)).getNodeValue().trim();
System.out.println(correctAnswer);
int i=st.executeUpdate("insert into question(eno,qno,qtext,correctAnswer) values('"+eno+"','"+qno+"','"+qtext+"','"+correctAnswer+"')");
System.out.println("s is"+s);
}
}

You have hardcoded
Node firstQuestionNode = listOfQuestions.item(0);
^^^
I think you meant to use the variable s there... or maybe not, it's hard to tell what you're trying to do. Regardless, there are no other references to listOfQuestions and you never retrieve any node except the first one.

You should have a look at jsoup, it is an API specifically built for parsing HTML DOM code in java and has tons of extra features. Also, what you are currently trying to extract would not be more than just 3-4 LOC using the API components.
Look at their example on their website for connection to an URL and fetching DOM-Elements is just 2 LOC:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

How to parse this XML file

I am new to programming in java and i have just learned how to parse an xml file. But i am not getting any idea on how to parse this xml file. Please help me with a code on how to get the tags day1 and their inner tags order1,order2
<RoutePlan>
<day1>
<Order1>
<customer> XYZ</customer>
<address> INDIA </address>
<data> 10-10-2011 </data>
<time> 9.30 A.M </time>
</Order1>
<Order2>
<customer> ABC </customer>
<address> US </address>
<data> 10-10-2011 </data>
<time> 10.30 A.M </time>
</Order2>
</day1>
I wrote the following code to retrieve. But i am only getting the data in order1 but not in order2
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(file);
document.getDocumentElement().normalize();
System.out.println("Root Element: "+document.getDocumentElement().getNodeName());
NodeList node = document.getElementsByTagName("day1");
for(int i=0;i<node.getLength();i++){
Node firstNode = node.item(i);
Element element = (Element) firstNode;
NodeList customer = element.getElementsByTagName("customer");
Element customerElement = (Element) customer.item(0);
NodeList firstName = customerElement.getChildNodes();
System.out.println("Name: "+((firstName.item(0).getNodeValue())));
NodeList address = element.getElementsByTagName("address");
Element customerAddress = (Element) address.item(0);
NodeList addName = customerAddress.getChildNodes();
System.out.println("Address: "+((addName.item(0).getNodeValue())));
NodeList date = element.getElementsByTagName("date");
Element customerdate = (Element) date.item(0);
NodeList dateN = customerdate.getChildNodes();
System.out.println("Address: "+((dateN.item(0).getNodeValue())));
NodeList time = element.getElementsByTagName("time");
Element customertime = (Element) time.item(0);
NodeList Ntime = customertime.getChildNodes();
System.out.println("Time: "+((Ntime.item(0).getNodeValue())));
}

I can give you not one, not two, but three directions to parse this XML (there are more but let's say they are the most commons ones):
DOM -> two good resources to start : here and here
SAX -> quickstart from official website: here
StAX -> a good introduction: here
Judging by the size of your XML document, I'd probably go for a DOM parsing, which gonna be the easiest to implement and to use (but if you have to deal with larger files, take a look at SAX for reading-only manipulations and StAX for reading and writing ones).

The reason you are getting only "Order1" elements is because:
You lock on the "day1" node.
You retrieve the "customer" elements by tag name which returns 2 elements.
You retrieve the first element and print its value and hence the second "customer" is ignored.
When working with DOM, be prepared to spin up multiple loops for retrieving data. Also, you are a bit misguided when it comes to representing your schema. You really don't need to name "elements" as "day1"/"order1" etc. In XML, that can be simply expressed by having multiple "day" or "order" elements which in turn automatically enforces ordering. An example XML would look like:
<route-plan>
<day>
<order>
<something>
</order>
</day>
<day>
<order>
<something>
</order>
</day>
</route-plan>
Now retrieving "day" elements is a simple matter of:
Look up "day" elements by tag name
For each "day" element
Look up "order" element by tag name
For each "order" element
Print out the value of "customer"/"address" etc.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PARSING A COMPLEX XML USING DOM - java

Once you have the feed NodeList run on it and use: NodeList nodes = feedNode.getChildNodes(); for (Node node: nodes) { if(node.getNodeName().equals("comments")){ //do something with comments node } }

Related

Java DOM: getElementsByTagName is returning no elements?

Getting null values from XPath query

get child nodes from parent (xml, java)

java dom parser only gets the first entity

How to parse this XML file

Categories

Resources