Parsing XML Textlist

Parsing XML Textlist - java

I'm trying to parse a XML file. I'm able to parse normal text node but how do I parse a textlist? I'm getting the firstChild of the textlist thats sadly all. If I try to do
elem.nextSibling();
it is always null which can't be, I know there are two other values left.
Does someone can provide me an example maybe?
Thanks!
XML example
<viewentry position="1" unid="7125D090682C3C3EC1257671002F66F4" noteid="962" siblings="65">
<entrydata columnnumber="0" name="Categories">
<textlist>
<text>Lore1</text>
<text>Lore2</text>
</textlist>
</entrydata>
<entrydata columnnumber="1" name="CuttedSubjects">
<text>
LoreImpsum....
</text>
</entrydata>
<entrydata columnnumber="2" name="$35">
<datetime>20091117T094224,57+01</datetime>
</entrydata>
</viewentry>

I assume you're using a DOM parser.
The first child of the <textlist> node is not the first <text> node but rather the raw text that contains the whitespace and carriage return between the end of <textlist> and the beginning of <text>. The output of the following snippet (using org.w3c.dom.* and javax.xml.parsers.*)
Node grandpa = document.getElementsByTagName("textlist").item(0);
Node daddy = grandpa.getFirstChild();
while (daddy != null) {
System.out.println(">>> " + daddy.getNodeName());
Node child = daddy.getFirstChild();
if (child != null)
System.out.println(">>>>>>>> " + child.getTextContent());
daddy = daddy.getNextSibling();
}
shows that <textlist> has five children: the two <text> elements and the three raw text pieces before, between and after them.
>>> #text
>>> text
>>>>>>>> Lore1
>>> #text
>>> text
>>>>>>>> Lore2
>>> #text
When parsing XML this way, it's easy to overlook that the structure of the DOM-tree can be complicated. You can quickly end up iterating over a NodeList in the wrong generation, and then you get nulls where you would expect siblings. This is one of the reasons why people came up with all kinds of xml-to-java stuff, from homegrown XMLHelper classes to XPath expressions to Digester to JAXB, so you need to go down to the DOM level only when you absolutely have to.

Related

Algorithm for identifying differences in XML documents

I'm trying to create a program in Java that takes two XML files (one is an updated version of the other) and takes them into main memory. It will then compare the files and count the number of differences between each corresponding node from the two (excluding white space). Later on the program will do more with the differences but I'm just confused on how to start comparing nodes from two separate files. Any suggestions would be much appreciated.

My first suggestion is that you could use XMLUnit:
Reader expected=new FileReader(...);
Reader tested=new FileReader(...);
Diff diff=XMLUnit.compareXML(expected, tested);

For an algorithm that computes signatures (hashes) at each node to facilitate comparison, see Detecting Changes in XML Documents.
For change detection on XML documents where element ordering is insignificant, see X-Diff: An Effective Change Detection Algorithm for XML Documents. Java and C++ implementations of the X-Diff algorithm are available.

It depends if you have differences of nodes, or differences inside nodes.
This code extract all nodes, and their paths,
and value inside
Assuming, you have two xml Documents:
XPath xPath = XPathFactory.newInstance().newXPath();
//Every nodes
expression="//*";
NodeList nodes = (NodeList) xPath.compile(expression).evaluate(document, XPathConstants.NODESET);
// iterate them all
for(int i=0; i<nodes.getLength(); i++)
{
Node the_node = nodes.item(i);
if(the_node instanceof Element)
{
Element the_element=(Element) the_node;
// PATH
String path ="";
Node noderec = the_node;
while( noderec != null)
{
if (path.equals("")) path = noderec.getNodeName();
else
path = noderec.getNodeName() + '/' + path;
noderec = noderec.getParentNode();
if (noderec==document){path="//"+path; noderec=null;}
}
System.out.println( "PATH:"+path );
System.out.println("CONTENT="+the_element.getTextContent());
}
}
PATH : gives you the path
CONTENT: sub content of the node
With that, you get all the pathes of your xml: you can compare one by one, sort, and use others algorithms to find if something is inserted, ...
And inside each node, you can make another comparisons.
Hope it helps

DOM Parser wrong childNodes Count

This is strange but let me try my best to put it accross.
I have a XML which i am reading through the normal way from desktop and parsing it through DOM parser.
<?xml version="1.0" encoding="UTF-8"?>
<Abase
xmlns="www.abc.com/Events/Abase.xsd">
<FVer>0</FVer>
<DV>abc App</DV>
<DP>abc Wallet</DP>
<Dversion>11</Dversion>
<sigID>Ss22</sigID>
<activity>Adding New cake</activity>
</Abase>
Reading the XML to get the childs.
Document doc = docBuilder.parse("C://Users//Desktop//abc.xml");
Node root = doc.getElementsByTagName("Abase").item(0);
NodeList listOfNodes = root.getChildNodes(); //Sysout Prints 13
So here my logic works well.When am trying to do by pushing the same XML to a queue and read it and get the child nodes it gives me no. of child nodes is 6.
Document doc=docBuilder.parse(new InputSource(new ByteArrayInputStream(msg.getBytes("UTF-8"))));
Node root = doc.getElementsByTagName("Abase").item(0);
NodeList listOfNodes = root.getChildNodes(); //Sysout Prints 6
this screws my logic of parsing the XML.Can anyone help me out?
UPDATE
Adding sending logic :
javax.jms.TextMessage tmsg = session.createTextMessage();
tmsg.setText(inp);
sender.send(tmsg);
PROBLEM
If i read this xml from desktop it says 13 childs, 6 element node and 7 text nodes.The Common Logic is :
Read all the childs and iterate through list of child items.
If node ISNOT text node get inside if block,add one parent element with two child and append to existing ROOT.Then get NodeName and get TextContext between the element node and push them as setTextContext for both the childs respectively.
So i have a fresh ELEMENT NODE now which have two childs .And as i dont need the already existing element node now which are still the childs of root,Lastly am removing them.
So the above logic is all screwed if i am pushing the XML to queue and areading it for doing the same logic.
OUTPUT XML which is coming good when i read from desktop,but reading from queue is having problem, because it screw the complete tree.
<Abase
xmlns="www.abc.com/Events/Abase.xsd">
<Prop>
<propName>FVer</propName>
<propName>0</propName> //similarly for other nodes
</Prop>
</Abase>
Thanks

Well, there are 13 children if whitespace text nodes are included, but only 6 if whitespace text nodes are dropped. So there's some difference in the way the tree has been built between the two cases, that affects whether whitespace text nodes are retained or not.

The document under "Output XML" means that there is something wrong on the sender side. My guess would by that inp isn't a String but some kind of object and setText(inp) doesn't call inp.toString() but instead triggers some kind of serialization code which produces this odd XML that you're seeing.

extracting xml node(not text but complete xml ) and with other test nodes from xml file using SAX parser in java

I have to read from large xml files each ranging ~500MB. The batch processes typically 500 such files in each run. I have to extract text nodes from it and at the same time extract xml nodes from it. I used xpath DOM in java for easy of use but that doesn't work due to memory issues as i have limited resources.
I intent to use SAX or stax in java now - the text nodes can be easily extracted but i don't know how to extract xml nodes from xml using sax.
a sample:
<?xml version="1.0"?>
<Library>
<Book name = "ABC">
<Author>John</Author>
<PrintingCompanyDT><Printer>Sam</Printer><Printmachine>Laser</Printmachine>
<AssocPrint>Oreilly</AssocPrint> </PrintingCompanyDT>
</Book>
<Book name = "123">
<Author>Mason</Author>
<PrintingCompanyDTv<Printervkelly</Printer><Printmachine>DOTPrint</Printmachine>
<AssocPrint>Oxford</AssocPrint> </PrintingCompanyDT>
</Book>
</Library>
The expected result:
1)Book: ABC:
Author:John
PrintCompany Detail XML:
<PrintingCompanyDT>
<Printer>Sam</Printer>
<Printmachine>Laser</Printmachine>
<AssocPrint>Oreilly</AssocPrint>
</PrintingCompanyDT>
2) Book: 123
Author : Mason
PrintCompany Detail XML:
<PrintingCompanyDT>
<Printer>kelly</Printer>
<Printmachine>DOTPrint</Printmachine>
<AssocPrint>Oxford</AssocPrint>
</PrintingCompanyDT>
If i try in the regular way of appending characters in public void characters(char ch[], int start, int length) method
I get the below
1)Book: ABC:
Author:John
PrintCompany Detail XML :
Sam
Laser
Oreilly
exactly the content and spaces.
Can somebody suggest how to extract an xml node as it is from a xml file through SAX or StaX parser in java.

I'd be tempted to use XOM for this sort of task rather than SAX or StAX directly. XOM is a tree-based representation similar to DOM or JDOM but it has support for processing XML "twigs" in a kind of semi-streaming fashion, ideal for your kind of case where you have many similar elements that can be processed independently of one another. Also every Node has a toXML method that prints the node as XML.
import nu.xom.*;
public class LibraryProcessor extends NodeFactory {
private Nodes empty = new Nodes();
private bookNum = 0;
/** Called for each closing tag in the XML */
public Nodes finishMakingElement(Element element) {
if("Book".equals(element.getLocalName())) {
bookNum++;
// process the complete Book element ...
processBook(element);
// ... and throw it away
return empty;
} else {
// process other elements (except Book) in the normal way
return super.finishMakingElement(element);
}
}
private void processBook(Element book) {
System.out.println(bookNum + ": " +
book.getAttributeValue("name"));
System.out.println("Author: " +
book.getFirstChildElement("Author").getValue());
System.out.println("PrintCompany Detail XML: " +
book.getFirstChildElement("PrintingCompanyDT").toXML());
}
public static void main(String[] args) throws Exception {
Builder builder = new Builder(new LibraryProcessor());
builder.build(new File(args[0]));
}
}
This will work its way through the XML document, calling processBook once for each Book element in turn. Within processBook you have access to the whole Book XML tree as XOM nodes, but without having to load the entire file into memory in one go - the best of both worlds. The "Factories, Filters, Subclassing, and Streaming" section of the XOM tutorial has more detail on this technique.
This example just shows the most basic bits of the XOM API, but it also provides powerful XPath support if you need to do more complex processing. For example, you can directly access the PrintMachine element within processBook using
Element machine = (Element)book.query("PrintingCompanyDT/PrintMachine").get(0);
or if the structure is not so regular, for example if PrintingCompanyDT is sometimes a direct child of Book and sometimes deeper (e.g. a grandchild) then you can use a query like
Element printingCompanyDT = (Element)book.query(".//PrintingCompanyDT").get(0);
(// being the XPath notation for finding descendants at any level, as opposed to / which looks only for direct children).

Parsing XML with apostrophe

Taking the BBC News RSS feed for example, one of their news items is as follows:
<item><title>Pupils 'bullied on sports field'</title><description>bla bla..
I have some java code parsing this - however, when a title contains an apostrophe (as above), the parsing stops, so I end up with the following title: Pupils ' and then it continues on and parses the description (which is fine). How do I get it to parse the full title? The following is a segment of code from inside my for loop where I parse the info:
NodeList title = element.getElementsByTagName("title");
Element line = (Element) title.item(0);
tmp.setTitle(getCharacterDataFromElement(line).toString());
The exact same code is used to parse the other elements like description and pubDate etc, which are all fine.
This is the getCharacterDataFromElement method:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
What am I doing wrong? I use the DocumentBuilder, DocumentBuilderFactory and org.w3c.dom to work with the RSS Feed.

Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together
HTH - DF

As davidfrancis suggested, you should iterate over all children in getCharacterDataFromElement().
Alternatively, if you can use DOM level 3, you can use the Node.getTextContent() method instead which does what you want.
NodeList title = element.getElementsByTagName("title");
Element line = (Element)title.item(0);
tmp.setTitle(line.getTextContent());

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as &apos;.
This means the BBC News RSS feed doesnt provide well-formatted XML.
The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.

Java DOM XML Parsing How to walk through multiple node levels

I have the following xml structure
<clinic>
<category>
<employees>
<medic>
<medic_details>
<medic_name />
<medic_address />
</medic_details>
<pacients>
<pacient>
<pacient_details>
<pacient_name> ...
<pacient_address> ...
</pacient_details>
<diagnostic>
<disease>
<disease_name>Disease</disease_name>
<treatment>Treatment</treatment>
</disease>
<disease>
<disease_name>Disease</disease_name>
<treatment>Treatment</treatment>
</disease>
</diagnostic>
</pacient>
</pacients>
<medic>
</employees>
</category>
</clinic>
I have a JTextArea where I want to show information from the xml file. For example, for showing each medic, with its name, adress, and treating pacients with their respective names, i have the following code:
NodeList medicNList = doc.getElementsByTagName("medic");
for (int temp = 0; temp < medicNList.getLength(); temp++) {
Node medicNode = medicNList.item(temp);
Element eElement = (Element) medicNode;
area.append("\n");
area.append("Medic Name : " + getTagValue("medic_name", eElement) + "\n");
area.append("Medic Address : " + getTagValue("medic_address", eElement) + "\n");
area.append("\n");
area.append("Pacients : \n");
area.append("Pacient Name : " + getTagValue("pacient_name", eElement) + "\n");
area.append("Pacient Name : " + getTagValue("pacient_address", eElement) + "\n");
}
My question is, if i want to have more than 1 disease per pacient, how do I display all of the diseases for each pacient? I don't know how to "walk" to the diagnostic node for each pacient and showing the relevant data inside

Your code looks incorrect as it is. You currently have multiple pacient (patients) per medic so you should be iterating the list of patients for each medic.
Then iterate diseases for each patient. You need to use the getElementsByTagName method for each nesting in the XML. Plus you need to skip over the pluralised elements such as <pacients>.
I would suggest you use an XPath library instead as it can make the code a lot easier to read. There are plenty of good ones out there. I would recommend jaxen

I would give htmlcleaner a try.
HTMLCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. It is designed to be small, fast, flexible and independant. HtmlCleaner may be used in java code, as command line tool or as Ant task. Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on).
You can use XPath with htmlcleaner to get contents within xml tags.Here is a nice
example Xpath Example

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML Textlist - java

Related

Algorithm for identifying differences in XML documents

DOM Parser wrong childNodes Count

extracting xml node(not text but complete xml ) and with other test nodes from xml file using SAX parser in java

Parsing XML with apostrophe

Java DOM XML Parsing How to walk through multiple node levels

Categories

Resources