Normalization in DOM parsing with java - how does it work? - java

I saw the line below in code for a DOM parser at this tutorial.
doc.getDocumentElement().normalize();
Why do we do this normalization ?
I read the docs but I could not understand a word.
Puts all Text nodes in the full depth of the sub-tree underneath this Node
Okay, then can someone show me (preferably with a picture) what this tree looks like ?
Can anyone explain me why normalization is needed?
What happens if we don't normalize ?

The rest of the sentence is:
where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes.
This basically means that the following XML element
<foo>hello
wor
ld</foo>
could be represented like this in a denormalized node:
Element foo
Text node: ""
Text node: "Hello "
Text node: "wor"
Text node: "ld"
When normalized, the node will look like this
Element foo
Text node: "Hello world"
And the same goes for attributes: <foo bar="Hello world"/>, comments, etc.

In simple, Normalisation is Reduction of Redundancies.
Examples of Redundancies:
a) white spaces outside of the root/document tags(...<document></document>...)
b) white spaces within start tag (<...>) and end tag (</...>)
c) white spaces between attributes and their values (ie. spaces between key name and =")
d) superfluous namespace declarations
e) line breaks/white spaces in texts of attributes and tags
f) comments etc...

As an extension to #JBNizet's answer for more technical users here's what implementation of org.w3c.dom.Node interface in com.sun.org.apache.xerces.internal.dom.ParentNode looks like, gives you the idea how it actually works.
public void normalize() {
// No need to normalize if already normalized.
if (isNormalized()) {
return;
}
if (needsSyncChildren()) {
synchronizeChildren();
}
ChildNode kid;
for (kid = firstChild; kid != null; kid = kid.nextSibling) {
kid.normalize();
}
isNormalized(true);
}
It traverses all the nodes recursively and calls kid.normalize()
This mechanism is overridden in org.apache.xerces.dom.ElementImpl
public void normalize() {
// No need to normalize if already normalized.
if (isNormalized()) {
return;
}
if (needsSyncChildren()) {
synchronizeChildren();
}
ChildNode kid, next;
for (kid = firstChild; kid != null; kid = next) {
next = kid.nextSibling;
// If kid is a text node, we need to check for one of two
// conditions:
// 1) There is an adjacent text node
// 2) There is no adjacent text node, but kid is
// an empty text node.
if ( kid.getNodeType() == Node.TEXT_NODE )
{
// If an adjacent text node, merge it with kid
if ( next!=null && next.getNodeType() == Node.TEXT_NODE )
{
((Text)kid).appendData(next.getNodeValue());
removeChild( next );
next = kid; // Don't advance; there might be another.
}
else
{
// If kid is empty, remove it
if ( kid.getNodeValue() == null || kid.getNodeValue().length() == 0 ) {
removeChild( kid );
}
}
}
// Otherwise it might be an Element, which is handled recursively
else if (kid.getNodeType() == Node.ELEMENT_NODE) {
kid.normalize();
}
}
// We must also normalize all of the attributes
if ( attributes!=null )
{
for( int i=0; i<attributes.getLength(); ++i )
{
Node attr = attributes.item(i);
attr.normalize();
}
}
// changed() will have occurred when the removeChild() was done,
// so does not have to be reissued.
isNormalized(true);
}
Hope this saves you some time.

Related

Parse HTMl using JSOUP - Need specific pattern

I am trying to get text between tags and save into some variable, for example:
Here I want to save value return which is between em tags. Also I need the rest of the text which is in p tags,
em tag value is assigned with return and
p tag value should return only --> an item, cancel an order, print a receipt, track your purchases or reorder items.
if some value is before em tag, even that value should be in different variable basically one p if it has multiple tags within then it should be split and save into different variables. If I know how can I get rest of text which are not in inner tags I can retrieve the rest.
I have written below: the below is returning just "return" which is in "'em' tags.
Here ep is basically doc.select(p), selecting p tag and then iterating, not sure if I am doing right way, any other approaches are highly appreciated.
String text ="\<p><em>return </em>an item, cancel an order, print a receipt, track your purchases or reorder items.</p>"
Elements italic_tags = ep.select("em");
for(Element em:italic_tags) {
if(em.tagName().equals("em")) {
System.out.println( em.select("em").text());
}
}
If you need to select each sub text and text enclosed by different tags you need to try selecting Node instead of Element. I modified your HTML to include more tags so the example is more complete:
String text = "<p><em>return </em>an item, <em>cancel</em> an order, <em>print</em> a receipt, <em>track</em> your purchases or reorder items.</p>";
Document doc = Jsoup.parse(text);
Element ep = doc.selectFirst("p");
List<Node> childNodes = ep.childNodes();
for (Node node : childNodes) {
if (node instanceof TextNode) {
// if it's a text, just display it
System.out.println(node);
} else {
// if it's another element, then display its first
// child which in this case is a text
System.out.println(node.childNode(0));
}
}
output:
return
an item,
cancel
an order,
print
a receipt,
track
your purchases or reorder items.

Find occurrences of a specific tag in an XML file in java using recursion

I need to return the number of occurrences of the given tag, for example, a user will provide a link to an xml file and the name of the tag to find and it will return the number of occurrences of that specific tag. My code so far only works for the child of the parent node, whereas I need to check all the child of the child nodes as well, and I quite don't understand how to iterate through all of the elements of the xml file.
Modify your code to make use of recursion properly. You need to ALWAYS recurse, not only if a tag has the name you are looking for, because the children still might have the name you are looking for. Also, you need to add the result of the recursive call to the sum. Something like this:
private static int tagCount(XMLTree xml, String tag) {
assert xml != null : "Violation of: xml is not null";
assert tag != null : "Violation of: tag is not null";
int count = 0;
if (xml.isTag()) {
for (int i = 0; i < xml.numberOfChildren(); i++) {
if (xml.child(i).label().equals(tag)) {
count++;
}
count = count + tagCount(xml.child(i), tag);
}
}
return count;
}

Parsing XML dropping all characters after &

I am creating an app that parses some XML and display it in a ListView. A few items in my xml contain &s so I have escaped them like this & It is working correctly on a few devices and on the emulator.
But on two devices (Samsung Sidekick 4g API 2.2, and Samsung Replish API 2.3.6) it is failing. Everything after the & gets magically disappeared.
Here is the item in the XML giving me trouble:
<site>
<name>English Language & Usage</name>
<link>http://english.stackexchange.com/</link>
<about>English Language & Usage Stack Exchange is a question and answer site for linguists, etymologists, and serious English language enthusiasts.</about>
<image>https://dl.dropboxusercontent.com/u/5724095/XmlParseExample/english.png</image>
</site>
Here is the "meat" of the parsing code:
private static String getValue(Element item, String str) {
NodeList n = item.getElementsByTagName(str);
Log.i("StackSites", ""+getElementValue(n.item(0)));
return getElementValue(n.item(0));
}
private static String getElementValue( Node elem ) {
Node child;
if( elem != null){
if (elem.hasChildNodes()){
for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
if( child.getNodeType() == Node.TEXT_NODE ){
return child.getNodeValue();
}
}
}
}
return "";
}
On some devices (LG Optimus G, Moto Attrix 2, and a few emulators) this works correctly and comes out like this:
However on the two Samsung devices that I've tried getValue() method returns only the text that comes before the & so the result is this:
That's because you aren't looking at the rest of the nodes. The entity gets a different node, and the text following the entity gets the node after that. You are returning immediately -- you need to concatenate your results.
This is a known bug on some Android releases. It was fixed in Honeycomb (3.0).
There's no good work-around. You need to process the text as [text node] [entity node] [text node], interpret the entity reference yourself, and concatenate the results.
Alternatively, you can avoid the use of XML character reference and substitute your own escape sequences. As long as the parser doesn't see a &, the problem is avoided.
CommonsWare got me pointed in the right direction.
I changed the getElementValue() method to this:
private static String getElementValue( Node elem ) {
StringBuilder value = new StringBuilder();
Node child;
if( elem != null){
if (elem.hasChildNodes()){
for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
if( child.getNodeType() == Node.TEXT_NODE ){
value.append(child.getNodeValue());
}
}
return value.toString();
}
}
return "";
}
and it gets the second half of the text correctly now.

Implementing XML parser in Java for educational purposes

I want to validate if a XML (in a String object) is well formed. Like this:
"<root> Hello StackOverflow! <a> Something here </a> Goodbye StackOverflow </root>"
It should also validate attributes, but I'm kind of too far of that right now. I just want to make sure I have the logic right. Here's what I've got so far, but I'm stucked and I need some help.
public boolean isWellFormed( String str )
{
boolean retorno = true;
if ( str == null )
{
throw new NullPointerException();
}
else
{
this.chopTheElements( str );
this.chopTags();
}
return retorno;
}
private void chopTags()
{
for ( String element : this.elements )
{
this.tags.add( element.substring( 1, element.length()-1 ) );
}
}
public void chopTheElements( String str )
{
for ( int i = 0; i < str.length(); i++ )
{
if ( str.charAt( i ) == '<' )
{
elements.add( getNextToken( str.substring( i ) ) );
}
}
}
private String getNextToken( String str )
{
String retStr = "";
if ( str.indexOf( ">" ) != -1 )
{
retStr = str.substring( 0, str.indexOf( ">" ) + 1 );
}
return retStr;
}
So far I chopped the elements like "" in a list, and then the tags in another, like this: root, /root.
But I don't know how to proceed or if I'm going in the right direction. I been asigned to solve this without regex.
Any advice? I'm lost here. Thanks.
Starting by breaking the string when you see a "<" is not the way to go about it, because the chunks you identify will be unrelated to the hierarchic structure of the XML. For example, if you have as input:
<a>xxx<b>...</b>yyy</a>
then one of your chunks will be "/b>yyy<" which isn't a useful thing to break up further.
You need to structure your code according to the structure of the grammar. If the grammar says that an element consists of a start tag then a sequence of (elements or characters) then an end tag, then you need a method that matches that sequence, and calls other methods to process its components. Because the grammar is recursive, your code will be recursive, so this is known as recursive descent parsing. It's something that is often taught in computer science courses so you'll find excellent coverage of the topic in textbooks.
If you're not dealing with a huge XML file, consider DOM parsers for your purpose. I would suggest that you look at DocumentBuilder class for this purpose. You would actually need to call the different parse() methods (your source can be a file or any other InputSource)

Recursion Filling a Tree

I have a Java class: BinaryTree< t > that I am filling from a file as follow:
E .
T -
I ..
N -.
M --
A .-
W .--
R .-.
S ...
etc (to end of alphabit)
BinaryTree has:
setRight(BinaryTree) -sets the right element
setLeft(BinaryTree) -sets the left element
setRootElement(t) -sets the root element
getRight() -gets the right element
getLeft() -gets the left element
getRootElement() -gets the root element of the node IE/ a Character
size() -returns the size of the tree
These are the only methods available in the BinaryTree class I was given
So what i want to do is I want to read each line of the file one by one getting the Letter and the string of "morse code". NOTE: I can only use the Scanner class for reading the File!
Then i want to recursively fill this tree from the contents of the file and a few rules:
A "." means tack to left so the first part of the file would mean tack node with 'E' character to the Left of the root
A "-" means tack to right so the second line in the file would mean tack node with 'T' character to the Right of the root.
So "W .--" would mean tack node with 'W' from root One node to the left, then one node to the right then tack on to the right of that Node.
In the end the tree would look like:
tree http://i56.tinypic.com/339tuys.png
Since i'm new to Recursion I am having a lot of trouble visualizing how a tree could be filled recursively while reading from a file using a scanner.
Would I have to read the file elsewhere and pass the information into the recursive method???
Or could I read the file right in the recursive method? Which doesn't seem possible.
Also, what would you use as a Base Case, i'm tempted to use t.size() == 27, because that is the size of the final tree.
Any suggestions or comments would be greatly appreciated!!
Thank you!
Scanner sc = new Scanner(new File(...));
while (sc.hasNext()) {
String letter = sc.next();
String morse = sc.next();
BinaryTree forPosition = theBinaryTree;
for(int i = 0; i < morse.length(); i++) {
if (morse.charAt(i) == '.') {
if(forPosition.getLeft() == NULL) {
forPosition.setLeft() = new BinaryTree();
}
forPosition = forPosition.getLeft();
}
else {
// similar
}
}
forPostion.setRootElement(letter);
}
A weird recursive version:
Scanner sc = new Scanner(new File(...));
while (sc.hasNext()) {
String letter = sc.next();
String morse = sc.next();
findTheNode (theBinaryTree, letter, morse);
}
forPostion.setRootElement(letter);
}
findTheNode (BinaryTree node, String letter, String morse) {
if (morse.length() == 0) {
node.setRootElement(letter);
return;
} // found
if (morse.charAt(0) == '.') {
if (node.getLeft() == NULL) {
node.setLeft() = new BinaryTree();
}
findTheNode (node.getLeft(), letter, morse.substring(1));
}
else {
// similar
}
}
Hope both of the above work.
The result may look like this.
Recursive is usually used for traversal and binary search tree, but this tree is more similar to Trie, of only 2 character in alphabet (i.e. . and -). The rule of the construction of the tree (. for left, and - for right) makes it unnecessary to use recursion.

Categories

Resources