There are a lot of questions that ask the best XML parser, I am more interested in what is the XML parser that is the most like Groovy for Java?
I want:
SomeApiDefinedObject o = parseXml( xml );
for( SomeApiDefinedObject it : o.getChildren() ) {
System.out.println( it.getAttributes() );
}
The most important things are that I don't want to create a class for every type of XML node, I'd rather just deal with them all as strings, and that building the XML doesn't require any converters or anything, just a simple object that is already defined
If you have used the Groovy XML parser, you will know what I'm talking about
Alternatively, would it be better for me to just use Groovy from Java?
Here is something quick you can do with Sun Java Streaming XML Parser
FileInputStream xmlStream = new FileInputStream(new File("myxml.xml"));
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(xmlStream);
while(reader.hasNext()){
reader.next();
for(int i=0; i < reader.getAttributeCount(); i++) {
System.out.println(reader.getAttributeName(i) + "=" + reader.getAttributeValue(i));
}
}
I would like to shamelessly plug the small open-source library I have written to make parsing XML in Java a breeze.
Check out Jinq2XML.
http://code.google.com/p/jinq2xml/
Some sample code would look like:
Jocument joc = Jocument.load(urlOrStreamOrFileName);
joc.single("root").children().each(new Action() {
public void act(Jode j){
System.out.println(j.name() + ": " + j.value());
}
});
Looks like all you want is a simple DOM API, such as provided by dom4j. There actually already a DOM API in the Standard Library (the org.w3c.dom packages), but it's only the API, so you need a separate implementation - might as well use something a little more advanced like dom4j.
Use Groovy.
It seems that your primary goal is to be able to access the DOM in a "natural" way via object accessors, and Java won't let you do this without defining classes. Groovy, because it is "duck typed," will allow you to do this.
The only reason not to use Groovy is if (1) XML processing is a very small part of your application, and/or (2) you have to work with other people who may want to program strictly in Java.
Whatever you do, do not decide to "just deal with them all as strings." XML is not a simple format, and unless you know the spec inside and out, you're not likely to get it right. Which means that your XML will be rejected by spec-conformant parsers.
I highly recommend JAXB. Great for XML <--> Java objects framework.
There used to be a very small and simple XML parser called NanoXML. It seems not to be developed anymore, but it's still available at http://devkix.com/nanoxml.php
I have good experiences with XStream. It's fairly quick and will serialize and deserialize Java to/from XML with no schema and very little code It Just Works™. The Java object hierarchies it builds will directly mirror your XML.
I work with Dozer and Castor for getting OTOM (Object to Object Mapping).
Try This code!!!!!
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.*;
import javax.xml.*;
import java.io.File;
import java.io.IOException;
public class XmlDemo {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException
{
// Get Document Builder
// Get Document Builder
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
//Build Document
Document document = builder.parse(new File("D:/test.xml"));
//Normalize the XML Structure; It's just too important !!
document.getDocumentElement().normalize();
//Here comes the root node
Element root = document.getDocumentElement();
System.out.println(root.getNodeName());
//Get all employees
NodeList nList = document.getElementsByTagName("company");
System.out.println(nList.getLength());
System.out.println("============================");
visitChildNodes(nList);
}
//This function is called recursively
private static void visitChildNodes(NodeList nList)
{
Node tempNode = null;
// System.out.println("The Number of child nodes are " + (nList.getLength()) );
if(!(nList.getLength() == 1)) {
// System.out.println("The Number of child nodes are " + (nList.getLength()) );
// for (int temp = 0; temp < nList.getLength(); temp++)
// {
// Node node = nList.item(temp);
// if (node.getNodeType() == Node.ELEMENT_NODE)
// {
// System.out.println();
// System.out.println("Node Name = " + node.getNodeName() + ";");
// }
// }
}
for (int temp = 0; temp < nList.getLength(); temp++)
{
Node node = nList.item(temp);
if (node.getNodeType() == Node.ELEMENT_NODE)
{
System.out.println();
System.out.println("Node Name = " + node.getNodeName() + ";");
// if (node.hasAttributes()) {
// // get attributes names and values
// NamedNodeMap nodeMap = node.getAttributes();
// for (int i = 0; i < nodeMap.getLength(); i++)
// {
// tempNode = nodeMap.item(i);
// System.out.println(" Attr : " + tempNode.getNodeName()+ "; Value = " + tempNode.getNodeValue());
// }
//
// }else {
//// System.out.println("No Attributes");
// }
if (node.hasChildNodes()) {
NodeList nodeList = node.getChildNodes();
if((node.getChildNodes().getLength()/2)>0) {
System.out.println("This node has child nodes "+ (node.getChildNodes().getLength()/2));
System.out.println("Child nodes of : [ " + node.getNodeName() + " ] =>");
for(int k = 0;k < nodeList.getLength(); k++) {
Node n = nodeList.item(k);
if (n.getNodeType() == Node.ELEMENT_NODE)
{
if((k<(nodeList.getLength()))) {
System.out.println(" [ " + n.getNodeName() + " ] =>" );
if (n.hasAttributes()) {
// // get attributes names and values
NamedNodeMap nodeMap = n.getAttributes();
for (int i = 0; i < nodeMap.getLength(); i++)
{
tempNode = nodeMap.item(i);
System.out.println(" Attr : " + tempNode.getNodeName()+ "; Value = " + tempNode.getNodeValue() + " ]");
}
}else {
// System.out.println("No Attributes");
}
}else if((k==(nodeList.getLength()))) {
System.out.println(" [ " + n.getNodeName() + " ]");
}
}
}
System.out.println(" ]");
}
visitChildNodes(node.getChildNodes());
}
}
}
}
}
enter code here
Related
What I am actually doing is a recursive function which reads the tags in the xml. Below is the code:
private void readTag(org.w3c.dom.Node item, String histoTags, String fileName, Hashtable<String, String> tagsInfos) {
try {
if (item.getNodeType() == Node.ELEMENT_NODE) {
NodeList itemChilds = item.getChildNodes();
for (int i=0; i < itemChilds.getLength(); i++) {
org.w3c.dom.Node itemChild = itemChilds.item(i);
readTag(itemChild, histoTags + "|" + item.getNodeName(), fileName, tagsInfos);
}
}
else if (item.getNodeType() == Node.TEXT_NODE) {
tagsInfosSoft.put(histoTags, item.getNodeValue());
}
}
This function takes some time to execute. The xml the function reads is in this format:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Mouvement>
<Com>
<IdCom>32R01000000772669473</IdCom>
<RefCde>32R</RefCde>
<Edit>0</Edit>
<Com>
<Mouvement>
<Document>
Is there any way of optimizing this code in java?
Two optimizations, don't know how much they will help:
Don't use getChildNodes(). Use getFirstChild() and getNextSibling().
Reuse a single StringBuilder instead of creating a new one for every element (implicitly done by histoTags + "|" + item.getNodeName()).
But, you should also be aware that the text content of an element node may seen as a combination of multiple TEXT and CDATA nodes.
Your code will also work better if it works on elements, not nodes.
private static void readTag(Element elem, StringBuilder histoTags, String fileName, Hashtable<String, String> tagsInfos) {
int histoLen = histoTags.length();
CharSequence textContent = null;
boolean hasChildElement = false;
for (Node child = elem.getFirstChild(); child != null; child = child.getNextSibling()) {
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
histoTags.append('|').append(child.getNodeName());
readTag((Element)child, histoTags, fileName, tagsInfos);
histoTags.setLength(histoLen);
hasChildElement = true;
break;
case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
//uncomment to test: System.out.println(histoTags + ": \"" + child.getTextContent() + "\"");
if (textContent == null)
// Optimization: Don't copy to a StringBuilder if only one text node will be found
textContent = child.getTextContent();
else if (textContent instanceof StringBuilder)
// Ok, now we need a StringBuilder to collect text from multiple nodes
((StringBuilder)textContent).append(child.getTextContent());
else
// And we keep collecting text from multiple nodes
textContent = new StringBuilder(textContent).append(child.getTextContent());
break;
default:
// ignore all others
}
}
if (textContent != null) {
String text = textContent.toString();
// Suppress pure whitespace content on elements with child elements, i.e. structural whitespace
if (! hasChildElement || ! text.trim().isEmpty())
tagsInfos.put(histoTags.toString(), text);
}
}
Test
String xml = "<root>\n" +
" <tag>hello <![CDATA[world]]> Foo <!-- comment --> Bar</tag>\n" +
"</root>\n";
Element docElem = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)))
.getDocumentElement();
Hashtable<String, String> tagsInfos = new Hashtable<>();
readTag(docElem, new StringBuilder(docElem.getNodeName()), "fileName", tagsInfos);
System.out.println(tagsInfos);
Output (with print uncommented)
root: "
"
root|tag: "hello "
root|tag: "world"
root|tag: " Foo "
root|tag: " Bar"
root: "
"
{root|tag=hello world Foo Bar}
See how splitting the text inside the <tag> node using CDATA and comments caused the DOM node to contain multiple TEXT/CDATA child nodes.
I'm trying to get XML parsing down (and yes I know there's easier ways to parse/validate like xstream) but I can't seem to get text content of just a single element. For example:
<container>
<element0>textThatIWant</element0> //only returned by .getTextContent
<element1>
<subelement0>textThatIDontWant</subelement0> //but also returned by
<subelement1>textThatIDontWant</subelement1> //.getTextContent
</element1>
<container>
I'm piping results out to the console and get mostly what I'm looking for but the only way I seem to get the text strings is with .getTextContent() which returns all text in the sub elements, as well, without whitespace (or else I'd have split on spaces) or .getNodeValue().toString() which throws nullPointerExceptions. #Jihar mentioned something like .getTextValue() but Eclipse doesn't recognize it (maybe there's something I can implement/inherit/whatever to add capability), any help?
Here's the code I'm using:
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.*;
public class Test {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
StringBuilder xmlStringBuilder = new StringBuilder();
String appendage = "..." //This string holds the xml formatted data I'll be
//using in a long annoying line, I'll include it
//separately for clarity
xmlStringBuilder.append(appendage);
ByteArrayInputStream input = new ByteArrayInputStream(xmlStringBuilder.toString().getBytes("UTF-8"));
System.out.println("Test Results:");
System.out.println();
Document doc = builder.parse(input);
Element root = doc.getDocumentElement();
NodeList children = root.getChildNodes();
System.out.println(root.getTagName());
System.out.println();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
if (child instanceof Element) {
Element childElement = (Element) child;
System.out.println(childElement.getTagName() + " " + childElement);
NodeList grandChildren = child.getChildNodes();
for (int x = 0; x < grandChildren.getLength(); x++) {
Node grandChild = grandChildren.item(x);
if (grandChild instanceof Element) {
Element grandChildElement = (Element) grandChild;
System.out.print("\t" + grandChildElement.getTagName() + ":\t");
NodeList greatGrandChildren = grandChild.getChildNodes();
for (int y = 0; y < greatGrandChildren.getLength(); y++) {
Node greatGrandChild = greatGrandChildren.item(y);
if (greatGrandChild instanceof Element) {
Element greatGrandChildElement = (Element) greatGrandChild;
System.out.print(" " + greatGrandChildElement.getTextContent());
if ( y < greatGrandChildren.getLength() - 1) { System.out.print(","); } }
}
System.out.println();
}
}
}
}
}
}
And here's the appendage variable in full:
String appendage = "<?xml version=\"1.0\"?><branch0><name>business</name><taxINFO/><personnel><executives><name>Billy Bob</name><name>Colonel Jessup</name></executives><managerial/><operations><name>sabrina</name><name>lisa</name></operations><services><name>jamie</name><name>justin</name><name>forest</name></services></personnel><regions><ebay><area>OK</area><area>BE</area><area>EV</area><area>WC</area></ebay><sbay><area>SJ</area><area>MP</area><area>SV</area><area>MV</area></sbay><S.F.><area>SF</area></S.F.><N.Y.><area>NY</area></N.Y.><S.CA><area>SD</area><area>LA</area></S.CA></regions><products/><services/></branch0>";
or:
String appendage = "
<?xml version=\"1.0\"?>
<branch0>
<name>business</name>
<taxINFO/>
<personnel>
<executives>
<name>Billy Bob</name>
<name>Colonel Jessup</name>
</executives>
<managerial/>
<operations>
<name>sabrina</name>
<name>lisa</name>
</operations>
<services>
<name>jamie</name>
<name>justin</name>
<name>forest</name>
</services>
</personnel>
<regions>
<ebay>
<area>OK</area>
<area>BE</area>
<area>EV</area>
<area>WC</area>
</ebay>
<sbay>
<area>SJ</area>
<area>MP</area>
<area>SV</area>
<area>MV</area>
</sbay>
<S.F.>
<area>SF</area>
</S.F.>
<N.Y.>
<area>NY</area>
</N.Y.>
<S.CA>
<area>SD</area>
<area>LA</area>
</S.CA>
</regions>
<products/>
<services/>
</branch0>";
";
And, finally my console output (which you'll see is stating [name: null] where I'd like it to say something like [name: business] or even just business; but not include the sub element data w/out whitespace):
Test Results:
branch0
name [name: null]
taxINFO [taxINFO: null]
personnel [personnel: null]
executives: Billy Bob, Colonel Jessup
managerial:
operations: sabrina, lisa
services: jamie, justin, forest
regions [regions: null]
ebay: OK, BE, EV, WC
sbay: SJ, MP, SV, MV
S.F.: SF
N.Y.: NY
S.CA: SD, LA
products [products: null]
services [services: null]
and here's my console output using .getTextContent:
Test Results:
business
branch0
name business
taxINFO
personnel Billy BobColonel Jessupsabrinalisajamiejustinforest
executives: Billy Bob, Colonel Jessup
managerial:
operations: sabrina, lisa
services: jamie, justin, forest
regions OKBEEVWCSJMPSVMVSFNYSDLA
ebay: OK, BE, EV, WC
sbay: SJ, MP, SV, MV
S.F.: SF
N.Y.: NY
S.CA: SD, LA
products
services
System.out.println(childElement.getTagName() + " " + childElement);
should be (as you actually know!)
System.out.println(childElement.getTagName() + " "
+ childElement.getTextContent());
So, for my purposes, I was able to get the individual elements I was looking for using an XPath:
XPathFactory xpfactory = XPathFactory.newInstance();
XPath path = xpfactory.newXPath();
try {
String aString = path.evaluate("/branch0/name", doc);
System.out.println(aString);
} catch (XPathExpressionException e) { e.printStackTrace(); }
Of course this requires pre-existing knowledge of the structure, but since I can validate with an XML Schema and my docs are not too complicated/heavily nested I don't think that will be an issue for me. When I finish working on my current project I'll try to look up and post links about iterating over the child nodes and checking for text nodes (as #Ian Roberts suggested) but I don't know enough about XML to do that now.
I am stuck on an issue trying to parse some XML documents to obtain the output i require.
Take this sample XML:
<root>
<ZoneRule Name="After" RequiresApproval="false">
<Zone>
<WSAZone ConsecutiveDayNumber="1">
<DaysOfWeek>
<WSADaysOfWeek Saturday="false"/>
</DaysOfWeek>
<SelectedLimits>
</SelectedLimits>
<SelectedHolidays>
</SelectedHolidays>
</WSAZone>
</Zone>
</ZoneRule>
<ZoneRule Name="Before" RequiresApproval="false">
<Zone>
<WSAZone ConsecutiveDayNumber="3">
<DaysOfWeek>
<WSADaysOfWeek Saturday="true"/>
</DaysOfWeek>
<SelectedLimits>
</SelectedLimits>
<SelectedHolidays>
</SelectedHolidays>
</WSAZone>
</Zone>
</ZoneRule>
</root>
What i am attempting to do is to be able to ignore the root tag (this is working so no problems here), and treat each of the "ZoneRule's" as its own individual block.
Once i have each ZoneRule isolated i need to extract all of the nodes and attributes to allow me to to create a string to query a database to check if it exists (this part is also working).
The issue i am having is that in my code i cannot separate out each individual ZoneRule block, for some reason it is being processed all as one.
My sample code is as follows:
public String testXML = "";
int andCount = 0;
public void printNote(NodeList nodeList) {
for (int count = 0; count < nodeList.getLength(); count++) {
Node tempNode = nodeList.item(count);
// make sure it's element node.
if (tempNode.getNodeType() == Node.ELEMENT_NODE) {
if (tempNode.hasAttributes()))) {
// get attributes names and values
NamedNodeMap nodeMap = tempNode.getAttributes();
for (int i = 0; i < nodeMap.getLength(); i++) {
Node node = nodeMap.item(i);
if (andCount == 0) {
testXML = testXML + "XMLDataAsXML.exist('//" + tempNode.getNodeName() + "[./#" + node.getNodeName() + "=\"" + node.getNodeValue() + "\"]')=1 \n";
} else {
testXML = testXML + " and XMLDataAsXML.exist('//" + tempNode.getNodeName() + "[./#" + node.getNodeName() + "=\"" + node.getNodeValue() + "\"]')=1 \n";
}
andCount = andCount + 1;
}
}
if (tempNode.hasChildNodes()) {
// loop again if has child nodes
printNote(tempNode.getChildNodes());
}
}
}
}
private void jButton2ActionPerformed(java.awt.event.ActionEvent evt) {
try {
File file = new File("C:\\Test.xml");
DocumentBuilder dBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = dBuilder.parse(file);
//System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
if (doc.hasChildNodes()) {
printNote(doc.getChildNodes());
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
System.out.println(testXML);
}
Which produces this output (both nodes combined).
XMLDataAsXML.exist('//ZoneRule[./#Name="After"]')=1
and XMLDataAsXML.exist('//ZoneRule[./#RequiresApproval="false"]')=1
and XMLDataAsXML.exist('//WSAZone[./#ConsecutiveDayNumber="1"]')=1
and XMLDataAsXML.exist('//WSADaysOfWeek[./#Saturday="false"]')=1
and XMLDataAsXML.exist('//ZoneRule[./#Name="Before"]')=1
and XMLDataAsXML.exist('//ZoneRule[./#RequiresApproval="false"]')=1
and XMLDataAsXML.exist('//WSAZone[./#ConsecutiveDayNumber="3"]')=1
and XMLDataAsXML.exist('//WSADaysOfWeek[./#Saturday="true"]')=1
What i am actually after is this (excuse the incomplete SQL statements):
XMLDataAsXML.exist('//ZoneRule[./#Name="After"]')=1
and XMLDataAsXML.exist('//ZoneRule[./#RequiresApproval="false"]')=1
and XMLDataAsXML.exist('//WSAZone[./#ConsecutiveDayNumber="1"]')=1
and XMLDataAsXML.exist('//WSADaysOfWeek[./#Saturday="false"]')=1
XMLDataAsXML.exist XMLDataAsXML.exist('//ZoneRule[./#Name="Before"]')=1
and XMLDataAsXML.exist('//ZoneRule[./#RequiresApproval="false"]')=1
and XMLDataAsXML.exist('//WSAZone[./#ConsecutiveDayNumber="3"]')=1
and XMLDataAsXML.exist('//WSADaysOfWeek[./#Saturday="true"]')=1
The XML that will be parsed will not always be exactly like above so i cannot use hardcoded xPaths etc - i need to dynamically loop through the document, looking for the ZoneRule node as my base (i will dynamically generate this value based on the file received) and then extract all the required info.
I am completely open to better methods than what i have tried above.
Thanks very much.
In your code, the testXML and andCount are declared outside the printNote method and are not being reset during iterations.
You start with the first ZoneRule, generate the correct text during the first for iterations (lets forget about the recursion) and now you move to the next ZoneRule, but testXML contains the whole generated text and the andCount is lager then 0 so you keep attaching the text generated for the next ZoneRule.
You should reset the andCount and testXML at the beggining of each iteriation of the for loop. But then you 'recursive' children would not be rendered correctly.
So either you need two methods one to deal with top level ZoneRule elements and another for its children, or much better, instead of appending to text to shared variable, you should redisng your method so they would return String value which then can be appended correctly (with and or without, withou new line or without) at the place when it is recursively callled.
I have been asked to add logs and sysout before the class declaration.What logs should I use and how to add Sysout??What is the significance of adding these in this program?Also I am asked to create constant field for Staff Id and first name.Does that mean I should create constant variables that will store staff id and first name??
public class Read {
public static void main(String argv[]) {
try {
File fXmlFile = new File("/Users/mkyong/staff.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("staff");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("\nCurrent Element :" + nNode.getNodeName());
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Staff id : " + eElement.getAttribute("id"));
System.out.println("First Name : " + eElement.getElementsByTagName("firstname").item(0).getTextContent());
System.out.println("Last Name : " + eElement.getElementsByTagName("lastname").item(0).getTextContent());
System.out.println("Nick Name : " + eElement.getElementsByTagName("nickname").item(0).getTextContent());
System.out.println("Salary : " + eElement.getElementsByTagName("salary").item(0).getTextContent());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
I assume that by Sysout they mean System.out.println and as I can see, you have already added those.
By logs I assume it is meant using a logging implementation. An example of using java.utils.logging.Logger is this:
import java.util.logging.Logger;
public class Main {
private static Logger LOGGER = Logger.getLogger(Main.class.getSimpleName());
public static void main(String[] args) {
LOGGER.info("Logging an INFO-level message");
}
}
As you can see, some of a Logger's purpose is to output text that is using for diagnosing (debugging, monitoring). It can output the text to a console, file, database and it formats it in whatever way you want.
Using a logger vs plain System.out.println offers a lot more power and control.
Also I am asked to create constant field for Staff Id and first name
You got the idea here. It means creating constants for "firstname", "staff", "lastname" and any other string. Then, where you need to use that string you use the constant:
Ex:
private static final String STAFF = "staff";
private static final String FIRST_NAME = "firstname";
The advantage of doing so is that you can see all your constants in a single place and easily modify them when you need to. Imagine using "firstname" in 5 places. Than you realize you meant to use "first_name" instead. If you don't use constants you would have to change it in 5 places, else, just in 1 place.
"you have been asked", When your homework is not clear you should ask your TA first.
Besides, "/Users/mkyong/staff.xml" looks very very suspicious to me. since mkyong is a very famous developper. You should not take examples from the internet and copy/paste them without understanding them.
That said, yes you should declare static final fields to be constants, such as
private static final String STAFF_ID = "Staff id"
private static final String FIRST_NAME = "First Name"
and then replace those values in your code.
As for logging and sysout, sysout just means adding
System.out.println("something you would like to print")
Logging does the same thing but using a framework and giving more information automatically, such as the time the line was printed and which class triggered the print It also gives flexibility on when to print messages.
See this first
http://en.wikipedia.org/wiki/Java_logging_framework
I am relatively new to Java and I have been trying to figure out how to reach the following tags for output for a couple of long, LONG days now. I would really appreciate some insight into the problem. It seems like everything I could find and or try just does not pan out right. (Excuse the cheesy news articles)
<item>
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate>
<title>
<![CDATA[
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (#carmenbryan)
]]>
</title>
<link>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</link>
<guid>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</guid>
<description>
<![CDATA[
<img ... /><br />.
<p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p>
<p>Check out what else Bryan had to say above.</p>
<p>Source: </p>
]]>
</description>
</item>
I have managed to parse the XML and print out the content in both the title and description element tags, however the output for the description element tag also includes all its child element tags. I would like to use this project in future to build on my Java portfolio, please help!
My code so far:
public class NewXmlReader
{
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler());
docXml.getDocumentElement().normalize();
NewXMLReaderHandlers.handleItemTags(docXml, "item");
} catch (ParserConfigurationException | SAXException parserConfigurationException) {
System.out.println("You Are Not XML formated !!");
parserConfigurationException.printStackTrace();
} catch (IOException iOException) {
System.out.println("URL NOT FOUND");
iOException.getCause();
}
}
}
public class NewXMLReaderHandlers {
private static int ARTICLELENGTH;
public static String inputHandler() throws IOException {
InputStreamReader inputStream = new InputStreamReader(System.in);
BufferedReader bufferRead = new BufferedReader(inputStream);
System.out.println("Please Enter A Proper URL: ");
String urlPageString = bufferRead.readLine();
return urlPageString;
}
public static void handleItemTags( Document document, String rssFeedParentTopicTag){
NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag);
NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength();
String rootElement = document.getDocumentElement().getNodeName();
if (rootElement == "rss"){
System.out.println("We Have An RSS Feed To Parse");
for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) {
Node itemNode = (Node) listOfArticles.item(i);
if (itemNode.getNodeType() == Node.ELEMENT_NODE) {
Element itemElement= (Element) itemNode;
tagContent (itemElement, "title");
tagContent (itemElement, "description");
}
}
}
}
public static void tagContent (Element item, String tagName) {
NodeList tagNodeList = item.getElementsByTagName(tagName);
Element tagElement = (Element)tagNodeList.item(0);
NodeList tagTElist = tagElement.getChildNodes();
Node tagNode = tagTElist.item(0);
// System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n");
if(tagName == "description"){
System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n\n");
System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n");
}
}
}
For my money, the easiest solution would be to use the XPath API.
Essentially, it's a query language for XML. See XPath Tutorial for a primer.
This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class TestRSSFeed {
public static void main(String[] args) {
try {
// Read the feed...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
Element root = doc.getDocumentElement();
// Create a xPath instance
XPath xPath = XPathFactory.newInstance().newXPath();
// Find all the nodes that are named <entry...> any where in
// the document that live under the parent node...
XPathExpression expression = xPath.compile("//entry");
NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);
System.out.println("Found " + nl.getLength() + " items...");
for (int index = 0; index < nl.getLength(); index++) {
Node node = nl.item(index);
// This is a sub node search.
// The search is based on the parent node and looks for a single
// node titled "title" that belongs to the parent node...
// I did this because I'm only expecting a single node...
expression = xPath.compile("title");
Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
System.out.println(child.getTextContent());
}
} catch (IOException | ParserConfigurationException | SAXException exp) {
exp.printStackTrace();
} catch (XPathExpressionException ex) {
ex.printStackTrace();
}
}
}
Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)
Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:
The logic is as follows:
Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.