I'm trying to write an automated test of an application that basically translates a custom message format into an XML message and sends it out the other end. I've got a good set of input/output message pairs so all I need to do is send the input messages in and listen for the XML message to come out the other end.
When it comes time to compare the actual output to the expected output I'm running into some problems. My first thought was just to do string comparisons on the expected and actual messages. This doens't work very well because the example data we have isn't always formatted consistently and there are often times different aliases used for the XML namespace (and sometimes namespaces aren't used at all.)
I know I can parse both strings and then walk through each element and compare them myself and this wouldn't be too difficult to do, but I get the feeling there's a better way or a library I could leverage.
So, boiled down, the question is:
Given two Java Strings which both contain valid XML how would you go about determining if they are semantically equivalent? Bonus points if you have a way to determine what the differences are.
Sounds like a job for XMLUnit
http://www.xmlunit.org/
https://github.com/xmlunit
Example:
public class SomeTest extends XMLTestCase {
#Test
public void test() {
String xml1 = ...
String xml2 = ...
XMLUnit.setIgnoreWhitespace(true); // ignore whitespace differences
// can also compare xml Documents, InputSources, Readers, Diffs
assertXMLEqual(xml1, xml2); // assertXMLEquals comes from XMLTestCase
}
}
The following will check if the documents are equal using standard JDK libraries.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new File("file1.xml"));
doc1.normalizeDocument();
Document doc2 = db.parse(new File("file2.xml"));
doc2.normalizeDocument();
Assert.assertTrue(doc1.isEqualNode(doc2));
normalize() is there to make sure there are no cycles (there technically wouldn't be any)
The above code will require the white spaces to be the same within the elements though, because it preserves and evaluates it. The standard XML parser that comes with Java does not allow you to set a feature to provide a canonical version or understand xml:space if that is going to be a problem then you may need a replacement XML parser such as xerces or use JDOM.
Xom has a Canonicalizer utility which turns your DOMs into a regular form, which you can then stringify and compare. So regardless of whitespace irregularities or attribute ordering, you can get regular, predictable comparisons of your documents.
This works especially well in IDEs that have dedicated visual String comparators, like Eclipse. You get a visual representation of the semantic differences between the documents.
The latest version of XMLUnit can help the job of asserting two XML are equal. Also XMLUnit.setIgnoreWhitespace() and XMLUnit.setIgnoreAttributeOrder() may be necessary to the case in question.
See working code of a simple example of XML Unit use below.
import org.custommonkey.xmlunit.DetailedDiff;
import org.custommonkey.xmlunit.XMLUnit;
import org.junit.Assert;
public class TestXml {
public static void main(String[] args) throws Exception {
String result = "<abc attr=\"value1\" title=\"something\"> </abc>";
// will be ok
assertXMLEquals("<abc attr=\"value1\" title=\"something\"></abc>", result);
}
public static void assertXMLEquals(String expectedXML, String actualXML) throws Exception {
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreAttributeOrder(true);
DetailedDiff diff = new DetailedDiff(XMLUnit.compareXML(expectedXML, actualXML));
List<?> allDifferences = diff.getAllDifferences();
Assert.assertEquals("Differences found: "+ diff.toString(), 0, allDifferences.size());
}
}
If using Maven, add this to your pom.xml:
<dependency>
<groupId>xmlunit</groupId>
<artifactId>xmlunit</artifactId>
<version>1.4</version>
</dependency>
Building on Tom's answer, here's an example using XMLUnit v2.
It uses these maven dependencies
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-core</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-matchers</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>
..and here's the test code
import static org.junit.Assert.assertThat;
import static org.xmlunit.matchers.CompareMatcher.isIdenticalTo;
import org.xmlunit.builder.Input;
import org.xmlunit.input.WhitespaceStrippedSource;
public class SomeTest extends XMLTestCase {
#Test
public void test() {
String result = "<root></root>";
String expected = "<root> </root>";
// ignore whitespace differences
// https://github.com/xmlunit/user-guide/wiki/Providing-Input-to-XMLUnit#whitespacestrippedsource
assertThat(result, isIdenticalTo(new WhitespaceStrippedSource(Input.from(expected).build())));
assertThat(result, isIdenticalTo(Input.from(expected).build())); // will fail due to whitespace differences
}
}
The documentation that outlines this is https://github.com/xmlunit/xmlunit#comparing-two-documents
Thanks, I extended this, try this ...
import java.io.ByteArrayInputStream;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
public class XmlDiff
{
private boolean nodeTypeDiff = true;
private boolean nodeValueDiff = true;
public boolean diff( String xml1, String xml2, List<String> diffs ) throws Exception
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new ByteArrayInputStream(xml1.getBytes()));
Document doc2 = db.parse(new ByteArrayInputStream(xml2.getBytes()));
doc1.normalizeDocument();
doc2.normalizeDocument();
return diff( doc1, doc2, diffs );
}
/**
* Diff 2 nodes and put the diffs in the list
*/
public boolean diff( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( diffNodeExists( node1, node2, diffs ) )
{
return true;
}
if( nodeTypeDiff )
{
diffNodeType(node1, node2, diffs );
}
if( nodeValueDiff )
{
diffNodeValue(node1, node2, diffs );
}
System.out.println(node1.getNodeName() + "/" + node2.getNodeName());
diffAttributes( node1, node2, diffs );
diffNodes( node1, node2, diffs );
return diffs.size() > 0;
}
/**
* Diff the nodes
*/
public boolean diffNodes( Node node1, Node node2, List<String> diffs ) throws Exception
{
//Sort by Name
Map<String,Node> children1 = new LinkedHashMap<String,Node>();
for( Node child1 = node1.getFirstChild(); child1 != null; child1 = child1.getNextSibling() )
{
children1.put( child1.getNodeName(), child1 );
}
//Sort by Name
Map<String,Node> children2 = new LinkedHashMap<String,Node>();
for( Node child2 = node2.getFirstChild(); child2!= null; child2 = child2.getNextSibling() )
{
children2.put( child2.getNodeName(), child2 );
}
//Diff all the children1
for( Node child1 : children1.values() )
{
Node child2 = children2.remove( child1.getNodeName() );
diff( child1, child2, diffs );
}
//Diff all the children2 left over
for( Node child2 : children2.values() )
{
Node child1 = children1.get( child2.getNodeName() );
diff( child1, child2, diffs );
}
return diffs.size() > 0;
}
/**
* Diff the nodes
*/
public boolean diffAttributes( Node node1, Node node2, List<String> diffs ) throws Exception
{
//Sort by Name
NamedNodeMap nodeMap1 = node1.getAttributes();
Map<String,Node> attributes1 = new LinkedHashMap<String,Node>();
for( int index = 0; nodeMap1 != null && index < nodeMap1.getLength(); index++ )
{
attributes1.put( nodeMap1.item(index).getNodeName(), nodeMap1.item(index) );
}
//Sort by Name
NamedNodeMap nodeMap2 = node2.getAttributes();
Map<String,Node> attributes2 = new LinkedHashMap<String,Node>();
for( int index = 0; nodeMap2 != null && index < nodeMap2.getLength(); index++ )
{
attributes2.put( nodeMap2.item(index).getNodeName(), nodeMap2.item(index) );
}
//Diff all the attributes1
for( Node attribute1 : attributes1.values() )
{
Node attribute2 = attributes2.remove( attribute1.getNodeName() );
diff( attribute1, attribute2, diffs );
}
//Diff all the attributes2 left over
for( Node attribute2 : attributes2.values() )
{
Node attribute1 = attributes1.get( attribute2.getNodeName() );
diff( attribute1, attribute2, diffs );
}
return diffs.size() > 0;
}
/**
* Check that the nodes exist
*/
public boolean diffNodeExists( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( node1 == null && node2 == null )
{
diffs.add( getPath(node2) + ":node " + node1 + "!=" + node2 + "\n" );
return true;
}
if( node1 == null && node2 != null )
{
diffs.add( getPath(node2) + ":node " + node1 + "!=" + node2.getNodeName() );
return true;
}
if( node1 != null && node2 == null )
{
diffs.add( getPath(node1) + ":node " + node1.getNodeName() + "!=" + node2 );
return true;
}
return false;
}
/**
* Diff the Node Type
*/
public boolean diffNodeType( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( node1.getNodeType() != node2.getNodeType() )
{
diffs.add( getPath(node1) + ":type " + node1.getNodeType() + "!=" + node2.getNodeType() );
return true;
}
return false;
}
/**
* Diff the Node Value
*/
public boolean diffNodeValue( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( node1.getNodeValue() == null && node2.getNodeValue() == null )
{
return false;
}
if( node1.getNodeValue() == null && node2.getNodeValue() != null )
{
diffs.add( getPath(node1) + ":type " + node1 + "!=" + node2.getNodeValue() );
return true;
}
if( node1.getNodeValue() != null && node2.getNodeValue() == null )
{
diffs.add( getPath(node1) + ":type " + node1.getNodeValue() + "!=" + node2 );
return true;
}
if( !node1.getNodeValue().equals( node2.getNodeValue() ) )
{
diffs.add( getPath(node1) + ":type " + node1.getNodeValue() + "!=" + node2.getNodeValue() );
return true;
}
return false;
}
/**
* Get the node path
*/
public String getPath( Node node )
{
StringBuilder path = new StringBuilder();
do
{
path.insert(0, node.getNodeName() );
path.insert( 0, "/" );
}
while( ( node = node.getParentNode() ) != null );
return path.toString();
}
}
AssertJ 1.4+ has specific assertions to compare XML content:
String expectedXml = "<foo />";
String actualXml = "<bar />";
assertThat(actualXml).isXmlEqualTo(expectedXml);
Here is the Documentation
Below code works for me
String xml1 = ...
String xml2 = ...
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreAttributeOrder(true);
XMLAssert.assertXMLEqual(actualxml, xmlInDb);
skaffman seems to be giving a good answer.
another way is probably to format the XML using a commmand line utility like xmlstarlet(http://xmlstar.sourceforge.net/) and then format both the strings and then use any diff utility(library) to diff the resulting output files. I don't know if this is a good solution when issues are with namespaces.
I'm using Altova DiffDog which has options to compare XML files structurally (ignoring string data).
This means that (if checking the 'ignore text' option):
<foo a="xxx" b="xxx">xxx</foo>
and
<foo b="yyy" a="yyy">yyy</foo>
are equal in the sense that they have structural equality. This is handy if you have example files that differ in data, but not structure!
I required the same functionality as requested in the main question. As I was not allowed to use any 3rd party libraries, I have created my own solution basing on #Archimedes Trajano solution.
Following is my solution.
import java.io.ByteArrayInputStream;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.junit.Assert;
import org.w3c.dom.Document;
/**
* Asserts for asserting XML strings.
*/
public final class AssertXml {
private AssertXml() {
}
private static Pattern NAMESPACE_PATTERN = Pattern.compile("xmlns:(ns\\d+)=\"(.*?)\"");
/**
* Asserts that two XML are of identical content (namespace aliases are ignored).
*
* #param expectedXml expected XML
* #param actualXml actual XML
* #throws Exception thrown if XML parsing fails
*/
public static void assertEqualXmls(String expectedXml, String actualXml) throws Exception {
// Find all namespace mappings
Map<String, String> fullnamespace2newAlias = new HashMap<String, String>();
generateNewAliasesForNamespacesFromXml(expectedXml, fullnamespace2newAlias);
generateNewAliasesForNamespacesFromXml(actualXml, fullnamespace2newAlias);
for (Entry<String, String> entry : fullnamespace2newAlias.entrySet()) {
String newAlias = entry.getValue();
String namespace = entry.getKey();
Pattern nsReplacePattern = Pattern.compile("xmlns:(ns\\d+)=\"" + namespace + "\"");
expectedXml = transletaNamespaceAliasesToNewAlias(expectedXml, newAlias, nsReplacePattern);
actualXml = transletaNamespaceAliasesToNewAlias(actualXml, newAlias, nsReplacePattern);
}
// nomralize namespaces accoring to given mapping
DocumentBuilder db = initDocumentParserFactory();
Document expectedDocuemnt = db.parse(new ByteArrayInputStream(expectedXml.getBytes(Charset.forName("UTF-8"))));
expectedDocuemnt.normalizeDocument();
Document actualDocument = db.parse(new ByteArrayInputStream(actualXml.getBytes(Charset.forName("UTF-8"))));
actualDocument.normalizeDocument();
if (!expectedDocuemnt.isEqualNode(actualDocument)) {
Assert.assertEquals(expectedXml, actualXml); //just to better visualize the diffeences i.e. in eclipse
}
}
private static DocumentBuilder initDocumentParserFactory() throws ParserConfigurationException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(false);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
return db;
}
private static String transletaNamespaceAliasesToNewAlias(String xml, String newAlias, Pattern namespacePattern) {
Matcher nsMatcherExp = namespacePattern.matcher(xml);
if (nsMatcherExp.find()) {
xml = xml.replaceAll(nsMatcherExp.group(1) + "[:]", newAlias + ":");
xml = xml.replaceAll(nsMatcherExp.group(1) + "=", newAlias + "=");
}
return xml;
}
private static void generateNewAliasesForNamespacesFromXml(String xml, Map<String, String> fullnamespace2newAlias) {
Matcher nsMatcher = NAMESPACE_PATTERN.matcher(xml);
while (nsMatcher.find()) {
if (!fullnamespace2newAlias.containsKey(nsMatcher.group(2))) {
fullnamespace2newAlias.put(nsMatcher.group(2), "nsTr" + (fullnamespace2newAlias.size() + 1));
}
}
}
}
It compares two XML strings and takes care of any mismatching namespace mappings by translating them to unique values in both input strings.
Can be fine tuned i.e. in case of translation of namespaces. But for my requirements just does the job.
This will compare full string XMLs (reformatting them on the way). It makes it easy to work with your IDE (IntelliJ, Eclipse), cos you just click and visually see the difference in the XML files.
import org.apache.xml.security.c14n.CanonicalizationException;
import org.apache.xml.security.c14n.Canonicalizer;
import org.apache.xml.security.c14n.InvalidCanonicalizerException;
import org.w3c.dom.Element;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSSerializer;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.TransformerException;
import java.io.IOException;
import java.io.StringReader;
import static org.apache.xml.security.Init.init;
import static org.junit.Assert.assertEquals;
public class XmlUtils {
static {
init();
}
public static String toCanonicalXml(String xml) throws InvalidCanonicalizerException, ParserConfigurationException, SAXException, CanonicalizationException, IOException {
Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
byte canonXmlBytes[] = canon.canonicalize(xml.getBytes());
return new String(canonXmlBytes);
}
public static String prettyFormat(String input) throws TransformerException, ParserConfigurationException, IOException, SAXException, InstantiationException, IllegalAccessException, ClassNotFoundException {
InputSource src = new InputSource(new StringReader(input));
Element document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(src).getDocumentElement();
Boolean keepDeclaration = input.startsWith("<?xml");
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
writer.getDomConfig().setParameter("xml-declaration", keepDeclaration);
return writer.writeToString(document);
}
public static void assertXMLEqual(String expected, String actual) throws ParserConfigurationException, IOException, SAXException, CanonicalizationException, InvalidCanonicalizerException, TransformerException, IllegalAccessException, ClassNotFoundException, InstantiationException {
String canonicalExpected = prettyFormat(toCanonicalXml(expected));
String canonicalActual = prettyFormat(toCanonicalXml(actual));
assertEquals(canonicalExpected, canonicalActual);
}
}
I prefer this to XmlUnit because the client code (test code) is cleaner.
Using XMLUnit 2.x
In the pom.xml
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-assertj3</artifactId>
<version>2.9.0</version>
</dependency>
Test implementation (using junit 5) :
import org.junit.jupiter.api.Test;
import org.xmlunit.assertj3.XmlAssert;
public class FooTest {
#Test
public void compareXml() {
//
String xmlContentA = "<foo></foo>";
String xmlContentB = "<foo></foo>";
//
XmlAssert.assertThat(xmlContentA).and(xmlContentB).areSimilar();
}
}
Other methods : areIdentical(), areNotIdentical(), areNotSimilar()
More details (configuration of assertThat(~).and(~) and examples) in this documentation page.
XMLUnit also has (among other features) a DifferenceEvaluator to do more precise comparisons.
XMLUnit website
Using JExamXML with java application
import com.a7soft.examxml.ExamXML;
import com.a7soft.examxml.Options;
.................
// Reads two XML files into two strings
String s1 = readFile("orders1.xml");
String s2 = readFile("orders.xml");
// Loads options saved in a property file
Options.loadOptions("options");
// Compares two Strings representing XML entities
System.out.println( ExamXML.compareXMLString( s1, s2 ) );
Since you say "semantically equivalent" I assume you mean that you want to do more than just literally verify that the xml outputs are (string) equals, and that you'd want something like
<foo> some stuff here</foo></code>
and
<foo>some stuff here</foo></code>
do read as equivalent. Ultimately it's going to matter how you're defining "semantically equivalent" on whatever object you're reconstituting the message from. Simply build that object from the messages and use a custom equals() to define what you're looking for.
I'm trying to get XML parsing down (and yes I know there's easier ways to parse/validate like xstream) but I can't seem to get text content of just a single element. For example:
<container>
<element0>textThatIWant</element0> //only returned by .getTextContent
<element1>
<subelement0>textThatIDontWant</subelement0> //but also returned by
<subelement1>textThatIDontWant</subelement1> //.getTextContent
</element1>
<container>
I'm piping results out to the console and get mostly what I'm looking for but the only way I seem to get the text strings is with .getTextContent() which returns all text in the sub elements, as well, without whitespace (or else I'd have split on spaces) or .getNodeValue().toString() which throws nullPointerExceptions. #Jihar mentioned something like .getTextValue() but Eclipse doesn't recognize it (maybe there's something I can implement/inherit/whatever to add capability), any help?
Here's the code I'm using:
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.*;
public class Test {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
StringBuilder xmlStringBuilder = new StringBuilder();
String appendage = "..." //This string holds the xml formatted data I'll be
//using in a long annoying line, I'll include it
//separately for clarity
xmlStringBuilder.append(appendage);
ByteArrayInputStream input = new ByteArrayInputStream(xmlStringBuilder.toString().getBytes("UTF-8"));
System.out.println("Test Results:");
System.out.println();
Document doc = builder.parse(input);
Element root = doc.getDocumentElement();
NodeList children = root.getChildNodes();
System.out.println(root.getTagName());
System.out.println();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
if (child instanceof Element) {
Element childElement = (Element) child;
System.out.println(childElement.getTagName() + " " + childElement);
NodeList grandChildren = child.getChildNodes();
for (int x = 0; x < grandChildren.getLength(); x++) {
Node grandChild = grandChildren.item(x);
if (grandChild instanceof Element) {
Element grandChildElement = (Element) grandChild;
System.out.print("\t" + grandChildElement.getTagName() + ":\t");
NodeList greatGrandChildren = grandChild.getChildNodes();
for (int y = 0; y < greatGrandChildren.getLength(); y++) {
Node greatGrandChild = greatGrandChildren.item(y);
if (greatGrandChild instanceof Element) {
Element greatGrandChildElement = (Element) greatGrandChild;
System.out.print(" " + greatGrandChildElement.getTextContent());
if ( y < greatGrandChildren.getLength() - 1) { System.out.print(","); } }
}
System.out.println();
}
}
}
}
}
}
And here's the appendage variable in full:
String appendage = "<?xml version=\"1.0\"?><branch0><name>business</name><taxINFO/><personnel><executives><name>Billy Bob</name><name>Colonel Jessup</name></executives><managerial/><operations><name>sabrina</name><name>lisa</name></operations><services><name>jamie</name><name>justin</name><name>forest</name></services></personnel><regions><ebay><area>OK</area><area>BE</area><area>EV</area><area>WC</area></ebay><sbay><area>SJ</area><area>MP</area><area>SV</area><area>MV</area></sbay><S.F.><area>SF</area></S.F.><N.Y.><area>NY</area></N.Y.><S.CA><area>SD</area><area>LA</area></S.CA></regions><products/><services/></branch0>";
or:
String appendage = "
<?xml version=\"1.0\"?>
<branch0>
<name>business</name>
<taxINFO/>
<personnel>
<executives>
<name>Billy Bob</name>
<name>Colonel Jessup</name>
</executives>
<managerial/>
<operations>
<name>sabrina</name>
<name>lisa</name>
</operations>
<services>
<name>jamie</name>
<name>justin</name>
<name>forest</name>
</services>
</personnel>
<regions>
<ebay>
<area>OK</area>
<area>BE</area>
<area>EV</area>
<area>WC</area>
</ebay>
<sbay>
<area>SJ</area>
<area>MP</area>
<area>SV</area>
<area>MV</area>
</sbay>
<S.F.>
<area>SF</area>
</S.F.>
<N.Y.>
<area>NY</area>
</N.Y.>
<S.CA>
<area>SD</area>
<area>LA</area>
</S.CA>
</regions>
<products/>
<services/>
</branch0>";
";
And, finally my console output (which you'll see is stating [name: null] where I'd like it to say something like [name: business] or even just business; but not include the sub element data w/out whitespace):
Test Results:
branch0
name [name: null]
taxINFO [taxINFO: null]
personnel [personnel: null]
executives: Billy Bob, Colonel Jessup
managerial:
operations: sabrina, lisa
services: jamie, justin, forest
regions [regions: null]
ebay: OK, BE, EV, WC
sbay: SJ, MP, SV, MV
S.F.: SF
N.Y.: NY
S.CA: SD, LA
products [products: null]
services [services: null]
and here's my console output using .getTextContent:
Test Results:
business
branch0
name business
taxINFO
personnel Billy BobColonel Jessupsabrinalisajamiejustinforest
executives: Billy Bob, Colonel Jessup
managerial:
operations: sabrina, lisa
services: jamie, justin, forest
regions OKBEEVWCSJMPSVMVSFNYSDLA
ebay: OK, BE, EV, WC
sbay: SJ, MP, SV, MV
S.F.: SF
N.Y.: NY
S.CA: SD, LA
products
services
System.out.println(childElement.getTagName() + " " + childElement);
should be (as you actually know!)
System.out.println(childElement.getTagName() + " "
+ childElement.getTextContent());
So, for my purposes, I was able to get the individual elements I was looking for using an XPath:
XPathFactory xpfactory = XPathFactory.newInstance();
XPath path = xpfactory.newXPath();
try {
String aString = path.evaluate("/branch0/name", doc);
System.out.println(aString);
} catch (XPathExpressionException e) { e.printStackTrace(); }
Of course this requires pre-existing knowledge of the structure, but since I can validate with an XML Schema and my docs are not too complicated/heavily nested I don't think that will be an issue for me. When I finish working on my current project I'll try to look up and post links about iterating over the child nodes and checking for text nodes (as #Ian Roberts suggested) but I don't know enough about XML to do that now.
I have an XML file that contains tags such as:
<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>
I need to parse the text content and get the results back as an array of strings ["(b)", "Filing of financial reports.", "(1)(i) Except as provided in paragraphs (b) (3) and (h) of this section,"].
In other words, I need to tokenize the text content of a <p> element according to <E T=03"> and store the results in an array of strings.
There's nothing to "tokenize", as the parsing has already been done for you when the DOM was built. The <P> node contains both text and child nodes. This is what the DOM looks like:
P
|
+---text "(b) "
|
+---E
| |
| +---attribute T=03
| |
| +---text "Filing of financial reports."
|
+---text "Except as provided ..."
To get the results you want you need to navigate through the sub-nodes of <P> and extract all the text nodes.
here's one way to do it using jsoup library:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
class Test {
public static void main(String args[]) throws Exception {
String xml = "<P>(b) <E T=\"03\">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>";
Document doc = Jsoup.parse(xml);
for (Element e : doc.select("p"))
for (Node child : e.childNodes()) {
if (child instanceof TextNode) {
System.out.println(((TextNode) child).text());
} else {
System.out.println(((Element) child).text());
}
}
}
}
output:
(b)
Filing of financial reports.
(1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,
Use XPath. If you don't want to use specialized Java libraries, you may just use standard Java API such us:
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public class ExtractingAllTextNodes {
private static final String XML = "<P>(b) <E T=\"03\">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</P>";
public static void main(final String[] args) throws Exception {
final XPath xPath = XPathFactory.newInstance().newXPath();
final DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder builder = builderFactory.newDocumentBuilder();
final String expression = "//text()";
final Document xmlDocument = builder.parse(new ByteArrayInputStream(XML.getBytes()));
final NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); i++) {
System.out.println("=> " + nodeList.item(i).getTextContent());
}
}
}
Output:
=> (b)
=> Filing of financial reports.
=> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,
Depending on your needs, you may alter the XPath expression.
Ok. I finally managed to find a solution to the problem. The code is somewhat complex but it uses Dom which is the standard library for XML parsing:
public static void parseSection(Element sec){
NodeList pTags = ((Element) (((NodeList) sec
.getElementsByTagName("contents")).item(0)))
.getElementsByTagName("P");
int pTagIndex = 0;
while (pTagIndex < pTags.getLength()) {
System.out.println(pTagIndex);
Node pTag = pTags.item(pTagIndex);
NodeList pTagChildren = pTag.getChildNodes();
int pTagChildrenIndex = 0;
while(pTagChildrenIndex < pTagChildren.getLength()){
Node pTagChild = pTagChildren.item(pTagChildrenIndex);
if(pTagChild.getNodeName().equals("#text")){
System.out.println("Text: " + pTagChild.getNodeValue());
} else if(pTagChild.getNodeName().equals("E")){
System.out.println("E: " + pTagChild.getTextContent());
}
pTagChildrenIndex ++;
}
I was wondering if anyone knows how to successfully parse the company name "Alcoa Inc." shown in the URL below. It would be much easier to show a picture but I do not have enough reputation. Any help would be appreciated.
http://www.google.com/finance?q=NYSE%3AAA&ei=LdwVUYC7Fp_YlgPBiAE
This is what I have tried so far using jsoup to parse the div class:
<div class="appbar-snippet-primary">
<span>Alcoa Inc.</span>
</div>
public Elements htmlParser(String url, String element, String elementType, String returnElement){
try {
Document doc = Jsoup.connect(url).get();
Document parse = Jsoup.parse(doc.html());
if (returnElement == null){
return parse.select(elementType + "." + element);
}
else {
return parse.select(elementType + "." + element + " " + returnElement);
}
}
public String htmlparseGoogleStocks(String url){
String pr = "pr";
String appbar_center = "appbar-snippet-primary";
String val = "val";
String span = "span";
String div = "div";
String td = "td";
Elements price_data;
Elements title_data;
Elements more_data;
price_data = htmlParser(url, pr, span, null);
title_data = htmlParser(url, appbar_center, div, span);
//more_data = htmlParser(url, val, td, null);
//String stockprice = price_data.text().toString();
String title = title_data.text().toString();
//System.out.println(more_data.text());
return title;
Myself, I'd analyze the page of interest's source HTML, and then just use JSoup to extract the information. For instance, using a very small JSoup program like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class GoogleFinance {
public static final String PAGE = "https://www.google.com/finance?q=NASDAQ:XONE";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(PAGE).get();
Elements title = doc.select("title");
System.out.println(title.text());
}
}
You get in return:
ExOne Co: NASDAQ:XONE quotes & news - Google Finance
It doesn't get much easier than that.
There are a lot of questions that ask the best XML parser, I am more interested in what is the XML parser that is the most like Groovy for Java?
I want:
SomeApiDefinedObject o = parseXml( xml );
for( SomeApiDefinedObject it : o.getChildren() ) {
System.out.println( it.getAttributes() );
}
The most important things are that I don't want to create a class for every type of XML node, I'd rather just deal with them all as strings, and that building the XML doesn't require any converters or anything, just a simple object that is already defined
If you have used the Groovy XML parser, you will know what I'm talking about
Alternatively, would it be better for me to just use Groovy from Java?
Here is something quick you can do with Sun Java Streaming XML Parser
FileInputStream xmlStream = new FileInputStream(new File("myxml.xml"));
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(xmlStream);
while(reader.hasNext()){
reader.next();
for(int i=0; i < reader.getAttributeCount(); i++) {
System.out.println(reader.getAttributeName(i) + "=" + reader.getAttributeValue(i));
}
}
I would like to shamelessly plug the small open-source library I have written to make parsing XML in Java a breeze.
Check out Jinq2XML.
http://code.google.com/p/jinq2xml/
Some sample code would look like:
Jocument joc = Jocument.load(urlOrStreamOrFileName);
joc.single("root").children().each(new Action() {
public void act(Jode j){
System.out.println(j.name() + ": " + j.value());
}
});
Looks like all you want is a simple DOM API, such as provided by dom4j. There actually already a DOM API in the Standard Library (the org.w3c.dom packages), but it's only the API, so you need a separate implementation - might as well use something a little more advanced like dom4j.
Use Groovy.
It seems that your primary goal is to be able to access the DOM in a "natural" way via object accessors, and Java won't let you do this without defining classes. Groovy, because it is "duck typed," will allow you to do this.
The only reason not to use Groovy is if (1) XML processing is a very small part of your application, and/or (2) you have to work with other people who may want to program strictly in Java.
Whatever you do, do not decide to "just deal with them all as strings." XML is not a simple format, and unless you know the spec inside and out, you're not likely to get it right. Which means that your XML will be rejected by spec-conformant parsers.
I highly recommend JAXB. Great for XML <--> Java objects framework.
There used to be a very small and simple XML parser called NanoXML. It seems not to be developed anymore, but it's still available at http://devkix.com/nanoxml.php
I have good experiences with XStream. It's fairly quick and will serialize and deserialize Java to/from XML with no schema and very little code It Just Works™. The Java object hierarchies it builds will directly mirror your XML.
I work with Dozer and Castor for getting OTOM (Object to Object Mapping).
Try This code!!!!!
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.*;
import javax.xml.*;
import java.io.File;
import java.io.IOException;
public class XmlDemo {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException
{
// Get Document Builder
// Get Document Builder
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
//Build Document
Document document = builder.parse(new File("D:/test.xml"));
//Normalize the XML Structure; It's just too important !!
document.getDocumentElement().normalize();
//Here comes the root node
Element root = document.getDocumentElement();
System.out.println(root.getNodeName());
//Get all employees
NodeList nList = document.getElementsByTagName("company");
System.out.println(nList.getLength());
System.out.println("============================");
visitChildNodes(nList);
}
//This function is called recursively
private static void visitChildNodes(NodeList nList)
{
Node tempNode = null;
// System.out.println("The Number of child nodes are " + (nList.getLength()) );
if(!(nList.getLength() == 1)) {
// System.out.println("The Number of child nodes are " + (nList.getLength()) );
// for (int temp = 0; temp < nList.getLength(); temp++)
// {
// Node node = nList.item(temp);
// if (node.getNodeType() == Node.ELEMENT_NODE)
// {
// System.out.println();
// System.out.println("Node Name = " + node.getNodeName() + ";");
// }
// }
}
for (int temp = 0; temp < nList.getLength(); temp++)
{
Node node = nList.item(temp);
if (node.getNodeType() == Node.ELEMENT_NODE)
{
System.out.println();
System.out.println("Node Name = " + node.getNodeName() + ";");
// if (node.hasAttributes()) {
// // get attributes names and values
// NamedNodeMap nodeMap = node.getAttributes();
// for (int i = 0; i < nodeMap.getLength(); i++)
// {
// tempNode = nodeMap.item(i);
// System.out.println(" Attr : " + tempNode.getNodeName()+ "; Value = " + tempNode.getNodeValue());
// }
//
// }else {
//// System.out.println("No Attributes");
// }
if (node.hasChildNodes()) {
NodeList nodeList = node.getChildNodes();
if((node.getChildNodes().getLength()/2)>0) {
System.out.println("This node has child nodes "+ (node.getChildNodes().getLength()/2));
System.out.println("Child nodes of : [ " + node.getNodeName() + " ] =>");
for(int k = 0;k < nodeList.getLength(); k++) {
Node n = nodeList.item(k);
if (n.getNodeType() == Node.ELEMENT_NODE)
{
if((k<(nodeList.getLength()))) {
System.out.println(" [ " + n.getNodeName() + " ] =>" );
if (n.hasAttributes()) {
// // get attributes names and values
NamedNodeMap nodeMap = n.getAttributes();
for (int i = 0; i < nodeMap.getLength(); i++)
{
tempNode = nodeMap.item(i);
System.out.println(" Attr : " + tempNode.getNodeName()+ "; Value = " + tempNode.getNodeValue() + " ]");
}
}else {
// System.out.println("No Attributes");
}
}else if((k==(nodeList.getLength()))) {
System.out.println(" [ " + n.getNodeName() + " ]");
}
}
}
System.out.println(" ]");
}
visitChildNodes(node.getChildNodes());
}
}
}
}
}
enter code here