Merge Two XML Files in Java - java

I have two XML files of similar structure which I wish to merge into one file.
Currently I am using EL4J XML Merge which I came across in this tutorial.
However it does not merge as I expect it to for instances the main problem is its not merging the from both files into one element aka one that contains 1, 2, 3 and 4.
Instead it just discards either 1 and 2 or 3 and 4 depending on which file is merged first.
So I would be grateful to anyone who has experience with XML Merge if they could tell me what I might be doing wrong or alternatively does anyone know of a good XML API for Java that would be capable of merging the files as I require?
Many Thanks for Your Help in Advance
Edit:
Could really do with some good suggestions on doing this so added a bounty. I've tried jdigital's suggestion but still having issues with XML merge.
Below is a sample of the type of structure of XML files that I am trying to merge.
<run xmloutputversion="1.02">
<info type="a" />
<debugging level="0" />
<host starttime="1237144741" endtime="1237144751">
<status state="up" reason="somereason"/>
<something avalue="test" test="alpha" />
<target>
<system name="computer" />
</target>
<results>
<result id="1">
<state value="test" />
<service value="gamma" />
</result>
<result id="2">
<state value="test4" />
<service value="gamma4" />
</result>
</results>
<times something="0" />
</host>
<runstats>
<finished time="1237144751" timestr="Sun Mar 15 19:19:11 2009"/>
<result total="0" />
</runstats>
</run>
<run xmloutputversion="1.02">
<info type="b" />
<debugging level="0" />
<host starttime="1237144741" endtime="1237144751">
<status state="down" reason="somereason"/>
<something avalue="test" test="alpha" />
<target>
<system name="computer" />
</target>
<results>
<result id="3">
<state value="testagain" />
<service value="gamma2" />
</result>
<result id="4">
<state value="testagain4" />
<service value="gamma4" />
</result>
</results>
<times something="0" />
</host>
<runstats>
<finished time="1237144751" timestr="Sun Mar 15 19:19:11 2009"/>
<result total="0" />
</runstats>
</run>
Expected output
<run xmloutputversion="1.02">
<info type="a" />
<debugging level="0" />
<host starttime="1237144741" endtime="1237144751">
<status state="down" reason="somereason"/>
<status state="up" reason="somereason"/>
<something avalue="test" test="alpha" />
<target>
<system name="computer" />
</target>
<results>
<result id="1">
<state value="test" />
<service value="gamma" />
</result>
<result id="2">
<state value="test4" />
<service value="gamma4" />
</result>
<result id="3">
<state value="testagain" />
<service value="gamma2" />
</result>
<result id="4">
<state value="testagain4" />
<service value="gamma4" />
</result>
</results>
<times something="0" />
</host>
<runstats>
<finished time="1237144751" timestr="Sun Mar 15 19:19:11 2009"/>
<result total="0" />
</runstats>
</run>

Not very elegant, but you could do this with the DOM parser and XPath:
public class MergeXmlDemo {
public static void main(String[] args) throws Exception {
// proper error/exception handling omitted for brevity
File file1 = new File("merge1.xml");
File file2 = new File("merge2.xml");
Document doc = merge("/run/host/results", file1, file2);
print(doc);
}
private static Document merge(String expression,
File... files) throws Exception {
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xpath = xPathFactory.newXPath();
XPathExpression compiledExpression = xpath
.compile(expression);
return merge(compiledExpression, files);
}
private static Document merge(XPathExpression expression,
File... files) throws Exception {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
.newInstance();
docBuilderFactory
.setIgnoringElementContentWhitespace(true);
DocumentBuilder docBuilder = docBuilderFactory
.newDocumentBuilder();
Document base = docBuilder.parse(files[0]);
Node results = (Node) expression.evaluate(base,
XPathConstants.NODE);
if (results == null) {
throw new IOException(files[0]
+ ": expression does not evaluate to node");
}
for (int i = 1; i < files.length; i++) {
Document merge = docBuilder.parse(files[i]);
Node nextResults = (Node) expression.evaluate(merge,
XPathConstants.NODE);
while (nextResults.hasChildNodes()) {
Node kid = nextResults.getFirstChild();
nextResults.removeChild(kid);
kid = base.importNode(kid, true);
results.appendChild(kid);
}
}
return base;
}
private static void print(Document doc) throws Exception {
TransformerFactory transformerFactory = TransformerFactory
.newInstance();
Transformer transformer = transformerFactory
.newTransformer();
DOMSource source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
}
}
This assumes that you can hold at least two of the documents in RAM simultaneously.

I use XSLT to merge XML files. It allows me to adjust the merge operation to just slam the content together or to merge at an specific level. It is a little more work (and XSLT syntax is kind of special) but super flexible. A few things you need here
a) Include an additional file
b) Copy the original file 1:1
c) Design your merge point with or without duplication avoidance
a) In the beginning I have
<xsl:param name="mDocName">yoursecondfile.xml</xsl:param>
<xsl:variable name="mDoc" select="document($mDocName)" />
this allows to point to the second file using $mDoc
b) The instructions to copy a source tree 1:1 are 2 templates:
<!-- Copy everything including attributes as default action -->
<xsl:template match="*">
<xsl:element name="{name()}">
<xsl:apply-templates select="#*" />
<xsl:apply-templates />
</xsl:element>
</xsl:template>
<xsl:template match="#*">
<xsl:attribute name="{name()}"><xsl:value-of select="." /></xsl:attribute>
</xsl:template>
With nothing else you get a 1:1 copy of your first source file. Works with any type of XML. The merging part is file specific. Let's presume you have event elements with an event ID attribute. You do not want duplicate IDs. The template would look like this:
<xsl:template match="events">
<xsl:variable name="allEvents" select="descendant::*" />
<events>
<!-- copies all events from the first file -->
<xsl:apply-templates />
<!-- Merge the new events in. You need to adjust the select clause -->
<xsl:for-each select="$mDoc/logbook/server/events/event">
<xsl:variable name="curID" select="#id" />
<xsl:if test="not ($allEvents[#id=$curID]/#id = $curID)">
<xsl:element name="event">
<xsl:apply-templates select="#*" />
<xsl:apply-templates />
</xsl:element>
</xsl:if>
</xsl:for-each>
</properties>
</xsl:template>
Of course you can compare other things like tag names etc. Also it is up to you how deep the merge happens. If you don't have a key to compare, the construct becomes easier e.g. for log:
<xsl:template match="logs">
<xsl:element name="logs">
<xsl:apply-templates select="#*" />
<xsl:apply-templates />
<xsl:apply-templates select="$mDoc/logbook/server/logs/log" />
</xsl:element>
To run XSLT in Java use this:
Source xmlSource = new StreamSource(xmlFile);
Source xsltSource = new StreamSource(xsltFile);
Result xmlResult = new StreamResult(resultFile);
TransformerFactory transFact = TransformerFactory.newInstance();
Transformer trans = transFact.newTransformer(xsltSource);
// Load Parameters if we have any
if (ParameterMap != null) {
for (Entry<String, String> curParam : ParameterMap.entrySet()) {
trans.setParameter(curParam.getKey(), curParam.getValue());
}
}
trans.transform(xmlSource, xmlResult);
or you download the Saxon SAX Parser and do it from the command line (Linux shell example):
#!/bin/bash
notify-send -t 500 -u low -i gtk-dialog-info "Transforming $1 with $2 into $3 ..."
# That's actually the only relevant line below
java -cp saxon9he.jar net.sf.saxon.Transform -t -s:$1 -xsl:$2 -o:$3
notify-send -t 1000 -u low -i gtk-dialog-info "Extraction into $3 done!"
YMMV

Thanks to everyone for their suggestions unfortunately none of the methods suggested turned out to be suitable in the end, as I needed to have rules for the way in which different nodes of the structure where mereged.
So what I did was take the DTD relating to the XML files I was merging and from that create a number of classes reflecting the structure.
From this I used XStream to unserialize the XML file back into classes.
This way I annotated my classes making it a process of using a combination of the rules assigned with annotations and some reflection in order to merge the Objects as opposed to merging the actual XML structure.
If anyone is interested in the code which in this case merges Nmap XML files please see http://fluxnetworks.co.uk/NmapXMLMerge.tar.gz the codes not perfect and I will admit not massively flexible but it definitely works. I'm planning to reimplement the system with it parsing the DTD automatically when I have some free time.

This is how it should look like using XML Merge:
action.default=MERGE
xpath.info=/run/info
action.info=PRESERVE
xpath.result=/run/host/results/result
action.result=MERGE
matcher.result=ID
You have to set ID matcher for //result node and set PRESERVE action for //info node. Also beware that .properties XML Merge uses are case sensitive - you have to use "xpath" not "XPath" in your .properties.
Don't forget to define -config parameter like this:
java -cp lib\xmlmerge-full.jar; ch.elca.el4j.services.xmlmerge.tool.XmlMergeTool -config xmlmerge.properties example1.xml example2.xml

It might help if you were explicit about the result that you're interested in achieving. Is this what you're asking for?
Doc A:
<root>
<a/>
<b>
<c/>
</b>
</root>
Doc B:
<root>
<d/>
</root>
Merged Result:
<root>
<a/>
<b>
<c/>
</b>
<d/>
</root>
Are you worried about scaling for large documents?
The easiest way to implement this in Java is to use a streaming XML parser (google for 'java StAX'). If you use the javax.xml.stream library you'll find that the XMLEventWriter has a convenient method XMLEventWriter#add(XMLEvent). All you have to do is loop over the top level elements in each document and add them to your writer using this method to generate your merged result. The only funky part is implementing the reader logic that only considers (only calls 'add') on the top level nodes.
I recently implemented this method if you need hints.

I took a look at the referenced link; it's odd that XMLMerge would not work as expected. Your example seems straightforward. Did you read the section entitled Using XPath declarations with XmlMerge? Using the example, try to set up an XPath for results and set it to merge. If I'm reading the doc correctly, it would look something like this:
XPath.resultsNode=results
action.resultsNode=MERGE

You might be able to write a java app that deserilizes the XML documents into objects, then "merge" the individual objects programmatically into a collection. You can then serialize the collection object back out to an XML file with everything "merged."
The JAXB API has some tools that can convert an XML document/schema into java classes. The "xjc" tool might be able to do this, although I can't remember if you can create classes directly from the XML doc, or if you have to generate a schema first. There are tools out there than can generate a schema from an XML doc.
Hope this helps... not sure if this is what you were looking for.

In addition to using Stax (which does make sense), it'd probably be easier with StaxMate (http://staxmate.codehaus.org/Tutorial). Just create 2 SMInputCursors, and child cursor if need be. And then typical merge sort with 2 cursors. Similar to traversing DOM documents in recursive-descent manner.

So, you're only interested in merging the 'results' elements? Everything else is ignored? The fact that input0 has an <info type="a"/> and input1 has an <info type="b"/> and the expected result has an <info type="a"/> seems to suggest this.
If you're not worried about scaling and you want to solve this problem quickly then I would suggest writing a problem-specific bit of code that uses a simple library like JDOM to consider the inputs and write the output result.
Attempting to write a generic tool that was 'smart' enough to handle all of the possible merge cases would be pretty time consuming - you'd have to expose a configuration capability to define merge rules. If you know exactly what your data is going to look like and you know exactly how the merge needs to be executed then I would imagine your algorithm would walk each XML input and write to a single XML output.

You can try Dom4J which provides a very good means to extract information using XPath Queries and also allows you to write XML very easily. You just need to play around with the API for a while to do your job

Sometimes you need just concatenate XML-files into one, for example with similar structure, like this:
File xml1:
<root>
<level1>
...
</level1>
<!--many records-->
<level1>
...
</level1>
</root>
File xml2:
<root>
<level1>
...
</level1>
<!--many records-->
<level1>
...
</level1>
</root>
In this case, the next procedure that uses jdom2 library can help you:
void concatXML(Path fSource,Path fDest) {
Document jdomSource = null;
Document jdomDest = null;
List<Element> elems = new LinkedList<Element>();
SAXBuilder jdomBuilder = new SAXBuilder();
try {
jdomSource = jdomBuilder.build(fSource.toFile());
jdomDest = jdomBuilder.build(fDest.toFile());
Element root = jdomDest.getRootElement();
root.detach();
String sourceNextElementName=((Element) jdomSource.getRootElement().getContent().get(1)).getName();
for (Element record:jdomSource.getRootElement().getDescendants(new ElementFilter(sourceNextElementName)))
elems.add(record);
for (Element elem : elems) (elem).detach();
root.addContent(elems);
Document newDoc = new Document(root);
XMLOutputter xmlOutput = new XMLOutputter();
xmlOutput.output(newDoc, System.out);
xmlOutput.setFormat(Format.getPrettyFormat());
xmlOutput.output(newDoc, Files.newBufferedWriter(fDest, Charset.forName("UTF-8")));
} catch (Exception e) {
e.printStackTrace();
}
}

Have you considered just not bothering with parsing the XML "properly" and just treating the files as big long strings and using boring old things such as hash maps and regular expressions...? This could be one of those cases where the fancy acronyms with X in them just make the job fiddlier than it needs to be.
Obviously this does depend a bit on how much data you actually need to parse out while doing the merge. But by the sound of things, the answer to that is not much.

Related

Accessing unparsed entities in XSLT with a SAXTransformerFactory and TransformerHandlers

I have some trouble while retrieving unparsed entity URIs, with the XPath function unparsed-entity-uri().
I'm using a SAXTransformerFactory like in "Efficient XSLT pipeline in Java" question, because I need to perform a transformations chain (i.e. apply several XSLT transformations, and use the result of a transformation as input for the second transformation).
I discovered I'm unable to retrieve unparsed entity thank to the code below. Actually it works well with Xalan, but not with Saxon-HE (version 9.7.0) - but I need Saxon because I'd rather XSLT 2.0 (even if in the code below there's nothing specific to XSLT 2, it's only for the sake of providing an example). It also works with Saxon if I don't use a TransformerHandler, e.g. stf.newTransformer(new StreamSource("transfo.xsl")).transform(new StreamSource("input.xsl"), new StreamResult(System.out)) will produce the desired output.
Is there a configuration step that I forgot?
// use "org.apache.xalan.processor.TransformerFactoryImpl" for Xalan
String transformerFactoryClassName = "net.sf.saxon.TransformerFactoryImpl";
SAXTransformerFactory stf = (SAXTransformerFactory) TransformerFactory.newInstance(transformerFactoryClassName,
LaunchSimpleTransformationUnparsedEntities.class.getClassLoader());
try {
TransformerHandler thTransf = stf
.newTransformerHandler(new StreamSource("transfo.xsl"));
// output the result in console
thTransf.setResult(new StreamResult(System.out));
// Launch transformation of input.xml
Transformer t = stf.newTransformer();
t.transform(new StreamSource("input.xml"),
new SAXResult(thTransf));
} catch (TransformerConfigurationException e) {
e.printStackTrace();
} catch (TransformerException e) {
e.printStackTrace();
}
In input, I have (for input.xml):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE book
[<!ENTITY cover_hadrien SYSTEM "images/covers/cover_hadrien.jpg" NDATA jpeg>]>
<book>
<title>Les mémoires d'Hadrien</title>
<author>Marguerite Yourcenar</author>
<cover imgref="cover_hadrien" />
</book>
and a sample XSLT (for transfo.xsl):
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="cover">
<xsl:copy>
<xsl:value-of select="unparsed-entity-uri(#imgref)"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
as a result, I would expect something like:
<?xml version="1.0" encoding="UTF-8"?><book>
<title>Les mémoires d'Hadrien</title>
<author>Marguerite Yourcenar</author>
<cover>images/covers/cover_hadrien.jpg</cover>
</book>
but <cover> is empty when performing the transformation with Saxon.
Interesting observation. The issue in fact is not with Saxon's TransformerHandler, but rather with the "identity transformer" obtained using SAXTransformerFactory.newTransformer(): the identity transformer is not passing unparsed entities down the line. This is essentially because Saxon's identity transformer is reusing parts of the XSLT engine, and XSLT does not provide any way for a transformation to output unparsed entities in the result. If you sent the SAX parser output directly to the TransformerHandler, rather than going via an identity transformer, then I think it would all work.
As with all things JAXP-related, the specification of SAXTransformerFactory.newTransformer() is infuriatingly vague. All it says is that the returned Transformer performs a copy of the Source to the Result. i.e. the "identity transform". What exactly counts as a copy? I think Saxon's interpretation has been that it is equivalent to the effect of doing an XSLT identity transform - which would lose unparsed entities (as well as other things like CDATA sections, the DTD, etc).
Incidentally XSLT 2.0 specifies that the result of unparsed-entity-uri() should be an absolute URI (XSLT 1.0 doesn't say anything on the subject) so even if this is fixed, the Saxon output will be different.
Entered as a Saxon issue here: https://saxonica.plan.io/issues/3201 I think we need to be a bit careful about passing unparsed entities to a SAXResult if we don't pass all the other events expected by a SAX DTDHandler - and we're certainly not going to change the Saxon identity transformer to retain things (like DTD declarations) that aren't modelled in XDM.
Indeed, following #MichaelKay's details, launching the transformation that way works properly:
// launch transformation of input.xml
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setContentHandler(thTransf);
reader.setDTDHandler(thTransf);
reader.parse(new InputSource(input.xml"));
(this will replace the following line:
// Launch transformation of input.xml
Transformer t = stf.newTransformer();
t.transform(new StreamSource("input.xml"),
new SAXResult(thTransf));
that were used initially).

Java XML Programming - Extracting the Child nodes

I have an xml file like below. I need to extract all the Child nodes under logdata and all the sub-Child nodes under each of the Child nodes along with their values. How can i extract these
<logdata>
<Request RequestID="123" RequestType = "Read">
<Data Mode = "Read">
<Type>ReadWrite</Type>
</Data>
<Textdetails Eligible = "true">
<Code>1</Code
<Name>ABC</Name>
</Textdetails>
</Request>
<Request RequestID="456" RequestType = "Read">
<Data Mode = "Read">
<Type>ReadWrite</Type>
</Data>
<Textdetails Eligible = "true">
<Code>2</Code>
<Name>DEF</Name>
</Textdetails>
</Request>
</logdata>
Using the XOM Library this would be rather simple. All you would need is to build the Document from a Builder. Then get the root element (logdata) using getRootElement(). After that you can use getChildElements() to get all the child elements from logdata and any other Element.

Doing DOM Node-to-String transformation, but with namespace issues

So we have an XML Document with custom namespaces. (The XML is generated by software we don't control. It's parsed by a namespace-unaware DOM parser; standard Java7SE/Xerces stuff, but also outside our effective control.) The input data looks like this:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<MainTag xmlns="http://BlahBlahBlah" xmlns:CustomAttr="http://BlitherBlither">
.... 18 blarzillion lines of XML ....
<Thing CustomAttr:gibberish="borkborkbork" ... />
.... another 27 blarzillion lines ....
</MainTag>
The Document we get is usable and xpath-queryable and traversable and so on.
Converting this Document into a text format for writing out to a data sink uses the standard Transformer approach described in a hundred SO "how do I change my XML Document into a Java string?" questions:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter stringwriter = new StringWriter();
transformer.transform (new DOMSource(theXMLDocument), new StreamResult(stringwriter));
return stringwriter.toString();
and it works perfectly.
But now I'd like to transform individual arbitrary Nodes from that Document into strings. A DOMSource constructor accepts Node pointers just the same as it accepts a Document (and in fact Document is just a subclass of Node, so it's the same API as far as I can tell). So passing in an individual Node in the place of "theXMLDocument" in the snippet above works great... until we get to the Thing.
At that point, transform() throws an exception:
java.lang.RuntimeException: Namespace for prefix 'CustomAttr' has not been declared.
at com.sun.org.apache.xml.internal.serializer.SerializerBase.getNamespaceURI(Unknown Source)
at com.sun.org.apache.xml.internal.serializer.SerializerBase.addAttribute(Unknown Source)
at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.addAttribute(Unknown Source)
......
That makes sense. (The "com.sun.org.apache" is weird to read, but whatever.) It makes sense, because the namespace for the custom attribute was declared at the root node, but now the transformer is starting at a child node and can't see the declarations "above" it in the tree. So I think I understand the problem, or at least the symptom, but I'm not sure how to solve it though.
If this were a String-to-Document conversion, we'd be using a DocumentBuilderFactory instance and could call .setNamespaceAware(false), but this is going in the other direction.
None of the available properties for transformer.setOutputProperty() affect the namespaceURI lookup, which makes sense.
There is no such corresponding setInputProperty or similar function.
The input parser wasn't namespace aware, which is how the "upstream" code got as far as creating its Document to hand to us. I don't know how to hand that particular status flag on to the transforming code, which is what I really would like to do, I think.
I believe it's possible to (somehow) add a xmlns:CustomAttr="http://BlitherBlither" attribute to the Thing node, the same as the root MainTag had. But at that point the output is no longer identical XML to what was read in, even if it "means" the same thing, and the text strings are eventually going to be compared in the future. We wouldn't know if it were needed until the exception got thrown, then we could add it and try again... ick. For that matter, changing the Node would alter the original Document, and this really ought to be a read-only operation.
Advice? Is there some way of telling the Transformer, "look, don't stress your dimwitted little head over whether the output is legit XML in isolation, it's not going to be parsed back in on its own (but you don't know that), just produce the text and let us worry about its context"?
Given your quoted error message "Namespace for prefix 'CustomAttr' has not been declared.",
I'm assuming that your pseudo code is along the lines of:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<MainTag xmlns="http://BlahBlahBlah" xmlns:CustomAttr="http://BlitherBlither">
.... 18 blarzillion lines of XML ....
<Thing CustomAttr:attributeName="borkborkbork" ... />
.... another 27 blarzillion lines ....
</MainTag>
With that assumption, here's my suggestion:
So you want to extract the "Thing" node from the "big" XML. The standard approach is to use a little XSLT to do that. You prepare the XSL transformation with:
Transformer transformer = transformerFactory.newTransformer(new StreamSource(new File("isolate-the-thing-node.xslt")));
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setParameter("elementName", stringWithCurrentThing); // parameterize transformation for each Thing
...
EDIT: #Ti, please note the parameterization instruction above (and below in the xslt).
The file 'isolate-the-thing-node.xslt' could be a flavour of the following:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:custom0="http://BlahBlahBlah"
xmlns:custom1="http://BlitherBlither"
version="1.0">
<xsl:param name="elementName">to-be-parameterized</xsl:param>
<xsl:output encoding="utf-8" indent="yes" method="xml" omit-xml-declaration="no" />
<xsl:template match="/*" priority="2" >
<!--<xsl:apply-templates select="//custom0:Thing" />-->
<!-- changed to parameterized selection: -->
<xsl:apply-templates select="custom0:*[local-name()=$elementName]" />
</xsl:template>
<xsl:template match="node() | #*" priority="1">
<xsl:copy>
<xsl:apply-templates select="node() | #*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Hope that gets you over the "Thing" thing :)
I have managed to parse the provided document, get the Thing node and print it without issues.
Take a look at the Working Example:
Node rootElement = d.getDocumentElement();
System.out.println("Whole document: \n");
System.out.println(nodeToString(rootElement));
Node thing = rootElement.getChildNodes().item(1);
System.out.println("Just Thing: \n");
System.out.println(nodeToString(thing));
nodeToString:
private static String nodeToString(Node node) {
StringWriter sw = new StringWriter();
try {
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.transform(new DOMSource(node), new StreamResult(sw));
} catch (TransformerException te) {
System.out.println("nodeToString Transformer Exception");
}
return sw.toString();
}
Output:
Whole document:
<?xml version="1.0" encoding="UTF-8"?><MainTag xmlns="http://BlahBlahBlah" xmlns:CustomAttr="http://BlitherBlither">
<Thing CustomAttr="borkborkbork"/>
</MainTag>
Just Thing:
<?xml version="1.0" encoding="UTF-8"?><Thing CustomAttr="borkborkbork"/>
When I try the same code with CustomAttr:attributeName as suggested by #marty it fails with the original exception, so it looks like somewhere in your original XML you are prefixing a attribute or node with that custom CustomAttr namespace.
In the latter case you can leverage the problem with setNamespaceAware(true), which will include the namespace information on the Thing node itself.
<?xml version="1.0" encoding="UTF-8"?><Thing xmlns:CustomAttr="http://BlitherBlither" CustomAttr:attributeName="borkborkbork" xmlns="http://BlahBlahBlah"/>

Using regexp in java to modify an xml

I'm trying to change an xml by using regular expressions in java, but I can't find the right way. I have an xml like this (simplified):
<ROOT>
<NODE ord="1" />
<NODE ord="3,2" />
</ROOT>
The xml actually shows a sentence with its nodes, chunks ... in two languages and has more attributes. Each sentence it's loaded in two RichTextAreas (one for the source sentence, and the other for the translated one).
What I need to do is add a style attribute to every node that has an specific value in its ord attribute (this style attribute will show correspondences between two languages, like Google Translate does when you mouse over a word). I know this could be done using DOM (getting all the NODE nodes and then seeing the ord attribute one by one), but I am looking for the fastest way to do the change as it is going to execute in the client side of my GWT app.
When that ord attribute has a single value (like in the first node) it is easy to do just taking the xml as a string and using the replaceAll() function . The problem is when the attribute has composed values (like in the second node).
For example, how could I do to add that attribute if the value I'm looking for is 2? I believe this could be done using regular expressions, but I can't find out how. Any hint or help would be appreciated (even if it doesn't use regexp and replaceAll function).
Thanks in advance.
XPath can do this for you. You could select:
/ROOT/NODE[contains(concat(',', #ord, ','), ',2,')]
Since you intend to use GWT on the client, you could give gwtxslt a try. With it you could specify an XSLT stylesheet to do the transformation (i.e. adding the attribute) for you:
XsltProcessor processor = new XsltProcessor();
processor.importStyleSheet(styleSheetText);
processor.importSource(sourceText);
processor.setParameter("ord", "2");
processor.setParameter("style", "whatever");
String resultString = processor.transform();
// do something with resultString
where styleSheetText could be an XSLT document along the lines of
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="ord" select="''" />
<xsl:param name="style" select="''" />
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
</xsl:copy>
</xsl:template>
<xsl:template match="NODE">
<xsl:copy>
<xsl:apply-templates select="#*" />
<xsl:if test="contains(concat(',', #ord, ','), concat(',', $ord, ','))">
<xsl:attribute name="style">
<xsl:value-of select="$style" />
</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="node()" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Note that I use concat() to prevent partial matches in the comma-separated list that the attribute value of #ord actually is.
String resultString = subjectString.replaceAll("<NODE ord=\"([^\"]*\\b2\\b[^\"]*)\" />", "<NODE ord=\"$1\" style=\"whatever\"/>");
will find any <NODE> tag that has a single ord attribute with a value of "2" (or "1,2" or "2,3" or "1,2,3" but not "12") and adds a style attribute.
This is quick and dirty, and rightfully advised against by many here, but for a one-off quick job it should be OK.
Explanation:
<NODE ord=" # Match <NODE ord:" verbatim
( # Match and capture...
[^"]* # any number of characters except "
\b2\b # "2" as a whole word (surrounded by non-alphanumerics)
[^"]* # any number of characters except "
) # End of capturing group
" /> # Match " /> verbatim
I'm trying to change an xml by using regular expressions in java, but I can't find the right way.
That's because there isn't a right way. Regular expressions are not the right way to manipulate XML. That's because XML is not a regular grammar (which is a technical term in computer science, not a generalized insult.)
It might sound like overkill, but I'd consider using the standard DOM parsers to read the fragment, modify it using setAttribute() calls, and then write it out again. I know you said that efficiency is important, but how long does this really take? Testing shows 60ms on my ageing 2GHz pentium.
This approach will be more robust against comments, things split across lines etc. It is also much more likely to give you well-formed XML. Also things like your requirement of only doing it if certain values are present will become trivial.
public class AddStyleExample {
public static void main(final String[] args) {
String input = "<ROOT> <NODE ord=\"1\" /> <NODE ord=\"3,2\" /> </ROOT>";
try {
final DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(false);
DocumentBuilder builder;
builder = factory.newDocumentBuilder();
final Document doc = builder.parse(new InputSource(
new StringReader(input)));
NodeList tags = doc.getElementsByTagName("NODE");
for (int i = 0; i < tags.getLength(); i++) {
Element node = (Element) tags.item(i);
node.setAttribute("style", "example value");
}
StringWriter writer = new StringWriter();
final StreamResult result = new StreamResult(writer);
final Transformer t = TransformerFactory.newInstance()
.newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.transform(new DOMSource(doc), result);
System.out.println(writer.toString());
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (TransformerException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

Java replace in XML-file

I have created my own XML-file on my Android phone, which looks similar to this
<?xml version="1.0" encoding="utf-8" ?>
<backlogs>
<issue id="1">
<backlog id="0" name="Linux" swid="100" />
<backlog id="0" name="Project Management" swid="101" />
</issue>
<issue id="2">
<backlog id="0" name="Tests" swid="110" />
<backlog id="0" name="Online test" swid="111" />
<backlog id="0" name="Test build" swid="112" />
<backlog id="0" name="Update" swid="113" />
</issue>
</backlogs>
I have then converted it into a String to replace inside the string using Regular Expression, but I have a problem with the Regular Expression. The Regular Expression I just created looks like this
([\n\r]*)<(.*)issue(.*)1(.*)([\n\r]*)(.*)([\n\r]*)(.*)([\n\r]*)(.*)<(.*)/(.*)issue(.*)
I need to replace the specific issue-tag (located with the specific ID) with another issue-tag in another String
The Regular Expression works fine for the tag with ID 1, but not with ID 2 as there is another amount of tags, but is there any way to get around the use of amount?
I hope you understand my question
I finally found a solution for my question, which is
([\n\r]*)<(.*)issue(.*)1[\S\s]*?<(.*)/(.*)issue(.*)
Do not use regex. Please. Use XML parser.
Do you know what is the highest voted SO answer
Use a SAX (or StAX) parser and writer at the same time.
As you read one event, detect whether to write the same event type to the writer without modification, or to do some modifications in the state you are currently in - like swapping an element name or attribute value. This will handle an unlimited amount of elements at the expense of CPU usage; in general it will be pretty light-weight.

Categories

Resources