How to accept revisions / track changes (ins/del) in a docx? - java

In MS-Word 2010 there is an Option under File -> Information to check the document for problems before sharing it. This makes it possible to handle track changes (to new newest version) and remove all comments and annotations from the document at once.
Is this possibility available in docx4j as well or do I need to investiagte the corresponding JAXB-Objects and write a traverse finder?
Doing that manually could be a lot of work since I would have to add the RunIns (w:ins) to the R (w:r) and remove the RunDel (w:del). I also saw a w:del once inside a w:ins. In this case I don't know if this also appears vice versa or in deeper nestings.
Further research brought this XSLT up:
https://github.com/plutext/docx4all/blob/master/docx4all/src/main/java/org/docx4all/util/ApplyRemoteChanges.xslt
I was not able to run this within docx4j but by manually unzipping the docx and extracting the document.xml. After applying the xslt on the plain document.xml I wrapped it in the docx container again to open it with MS-Word. The result was not the same as it would be by accepting the revision with MS-Word itself. More concrete: The XSLT removed the deleted marked text (in a Table), but not a listing dot before the text. This appears quite often in my document.
If this request is not posible to solve in an easy manner, I will change the constraints. It is sufficent for me to have a method for getting all text of a ContentAccessor, as a String. The ContentAccessor could be a P or Tc. The String shall be inside a R there or inside a RunIns (with R inside of that) For this I have a half solution below. The intersting part starts in the line of else if (child instanceof RunIns) {. But as mentioned above I'm not sure how nested del/ins Statements might appear and if this will handle them well. And the results are still not the same as if I would prepare the document with MS-Word before.
//Similar to:
//http://www.docx4java.org/forums/docx-java-f6/how-to-get-all-text-element-of-a-paragraph-with-docx4j-t2028.html
private String getAllTextfromParagraph(ContentAccessor ca) {
String result = "";
List<Object> children = ca.getContent();
for (Object child : children) {
child = XmlUtils.unwrap(child);
if (child instanceof Text) {
Text text = (Text) child;
result += text.getValue();
} else if (child instanceof R) {
R run = (R) child;
result += getTextFromRun(run);
}
else if (child instanceof RunIns) {
RunIns ins = (RunIns) child;
for (Object obj : ins.getCustomXmlOrSmartTagOrSdt()) {
if (obj instanceof R) {
result += getTextFromRun((R) obj);
}
}
}
}
return result.trim();
}
private String getTextFromRun(R run) {
String result = "";
for (Object o : run.getContent()) {
o = XmlUtils.unwrap(o);
if (o instanceof R.Tab) {
Text text = new Text();
text.setValue("\t");
result += text.getValue();
}
if (o instanceof R.SoftHyphen) {
Text text = new Text();
text.setValue("\u00AD");
result += text.getValue();
}
if (o instanceof Br) {
Text text = new Text();
text.setValue(" ");
result += text.getValue();
}
if (o instanceof Text) {
result += ((Text) o).getValue();
}
}
return result;
}

https://github.com/plutext/docx4j/commit/309a8e4008553452ebe675e81def30aab97542a2?w=1 adds a method for transforming just one Part, and sample code to use it to accept changes.
The XSLT is just what you found (relicensed as Apache 2):
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:WX="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:ext="http://www.xmllab.net/wordml2html/ext"
xmlns:java="http://xml.apache.org/xalan/java"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
version="1.0"
exclude-result-prefixes="java msxsl ext o v WX aml w10">
<xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no" indent="yes" />
<xsl:template match="/ | #*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="w:del" />
<xsl:template match="w:ins" >
<xsl:apply-templates select="*"/>
</xsl:template>
</xsl:stylesheet>
You'll need to add support for the other elements identified in the MSDN link. If you do that, I'd be happy to get a pull request

Related

How to update value in xml using Java

How do you edit values on xml that has been appended to a stringbuilder?
We have an xml file looking like the following, which we eventually reads in Java:
<?xml version="1.0" encoding="UTF-8"?>
<urn:receive
xmlns:urn="urn:xxx"
xmlns:ns="xxx"
xmlns:ns1="xxx"
xmlns:urn1="urn:xxx">
<urn:give>
<urn:giveNumber>
<ns1:number>12345678</ns1:number>
</urn:giveNumber>
<urn:giveDates>
<urn1:dateFrom>2021-07-01</urn1:dateFrom>
<urn1:dateTo>2021-09-30</urn1:dateTo>
</urn:giveDates>
</urn:give>
</urn:receive>
The following is a snippet of code that we use to read an xml file by appending to a stringbuilder and eventually saving it to a string with .toString(). Do notice that there is an int for number and string for startDate and for endDate. These values must be inserted into the xml, and replace the number and dates. Keep in mind that we are not allowed to edit the xml file.
public class test {
// Logger to print output in commandprompt
private static final Logger LOGGER = Logger.getLogger(test.class.getName());
public void changeDate() {
number = 44444444;
startDate = "2021-01-01";
endDate = "2021-03-31";
try {
// the XML file for this example
File xmlFile = new File("requests/dates.xml");
Reader fileReader = new FileReader(xmlFile);
BufferedReader bufReader = new BufferedReader(fileReader);
StringBuilder sb = new StringBuilder();
String line = bufReader.readLine();
while( line != null ) {
sb.append(line).append("\n");
line = bufReader.readLine();
}
String request = sb.toString();
LOGGER.info("Request" + request);
} catch (Exception e) {
e.printStackTrace();
}
}
}
How do we replace the number and dates in the xml with number, startDate and endDate, but without editing the xml file?
LOGGER.info("Request" + request); should print the following:
<?xml version="1.0" encoding="UTF-8"?>
<urn:receive
xmlns:urn="urn:xxx"
xmlns:ns="xxx"
xmlns:ns1="xxx"
xmlns:urn1="urn:xxx">
<urn:give>
<urn:giveNumber>
<ns1:number>44444444</ns1:number>
</urn:giveNumber>
<urn:giveDates>
<urn1:dateFrom>2021-01-01</urn1:dateFrom>
<urn1:dateTo>2021-03-31</urn1:dateTo>
</urn:giveDates>
</urn:give>
</urn:receive>
Simple answer: you don't.
You need to parse the XML, and parsing the XML can be done perfectly easily by supplying the parser with the file name; reading the XML into a StringBuilder first is pointless effort.
The easiest way to make a small change to an XML document is to use XSLT, which can be easily invoked from Java. Java comes with an XSLT 1.0 processor built in. XSLT 1.0 is getting rather ancient and you might prefer to use XSLT 3.0 which is much more powerful but requires a third-party library; but for a simple job like this, 1.0 is quite adequate. The stylesheet needed consists of a general rule that copies things unchanged:
<xsl:template match="*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>
and then a couple of rules for changing the things you want to change:
<xsl:param name="number"/>
<xsl:param name="startDate"/>
<xsl:param name="endDate"/>
<xsl:template match="ns1:giveNumber/text()" xmlns:ns1="xxx">
<xsl:value-of select="$number"/>
</xsl:template>
<xsl:template match="urn1:dateFrom/text()" xmlns:urn1="urn:xxx">
<xsl:value-of select="$dateFrom"/>
</xsl:template>
<xsl:template match="urn1:dateTo/text()" xmlns:urn1="urn:xxx">
<xsl:value-of select="$dateTo"/>
</xsl:template>
and then you just run the transformation from Java as described at https://docs.oracle.com/javase/tutorial/jaxp/xslt/transformingXML.html, supplying values for the parameters.

Converting raw data into customised xml

I want to convert the raw file into the below format using Java -
Raw Input:
state | abc
country | FR-FRA
Output:
<data attr ="StateFr">abc</data>
<data attr ="country">FR-FRA</data>
State attribute should be appended with the country code shown above. Can someone help me regarding this.
Java Stream API can help
String raw ="name1|value1\n" +
"name2|value2";
String template = "<data attribute=\"%s\">%s</data>";
String output = Arrays.stream(raw.split("\n"))
.map(rawPair -> rawPair.split("\\|"))
.map(pair -> String.format(template, pair[0], pair[1]))
.collect(Collectors.joining("\n"));
will output
<data attribute="name1">value1</data>
<data attribute="name2">value2</data>
But having specific business logic requires a bit more movements. Get country code first and then decorate you attribute name on stream processing
BiFunction<String, String, String> decorate = (String name, String code) -> {
if ("state".equals(name)) {
return name + code;
} else {
return name;
}
};
Function<String, String> countryCode = (String source) -> {
String head = "country|";
int start = source.indexOf(head) + head.length();
return source.substring(start, start + 2);
};
String code = countryCode.apply(raw);
...
.map(pair -> String.format(template, decorate.apply(pair[0], code), pair[1]))
...
With new requirements
a raw file is large
a raw file has the country code coming next to state
reading file one by one.
it is also required to output transformed entries in the same order they appear in the raw source.
you should
recognize sate and keep it, not producing next entry yet
recognize subsequent country, update state kept and release both state and contry entries
so here I employ sort of shallow buffer for this role
String raw = "name|value1\n" +
"state|some-state1\n" +
"country|fr-fra\n" +
"name|value2\n" +
"state|some-state2\n" +
"country|en-us\n";
class ShallowBuffer {
private String stateKey = "state";
private String countryKey = "country";
private String[] statePairWaitingForCountryCode = null;
private List<String[]> pump(String[] pair) {
if (stateKey.equals(pair[0])) {
statePairWaitingForCountryCode = pair;
return Collections.emptyList();
}
if (countryKey.equals(pair[0])) {
statePairWaitingForCountryCode[0] = statePairWaitingForCountryCode[0] + pair[1].substring(0, 2);
String[] stateRelease = statePairWaitingForCountryCode;
statePairWaitingForCountryCode = null;
return Arrays.asList(stateRelease, pair);
}
return Collections.singletonList(pair);
}
}
ShallowBuffer patience = new ShallowBuffer();
String template = "<data attribute=\"%s\">%s</data>";
String output = Arrays.stream(raw.split("\n"))
.map(rawPair -> rawPair.split("\\|"))
.map(patience::pump)
.flatMap(Collection::stream)
.map(pair -> String.format(template, pair[0], pair[1]))
.collect(Collectors.joining("\n"));
this will output
<data attribute="name">value1</data>
<data attribute="statefr">some-state1</data>
<data attribute="country">fr-fra</data>
<data attribute="name">value2</data>
<data attribute="stateen">some-state2</data>
<data attribute="country">en-us</data>
Shallow buffer is mutable, so you cannot use parallel methods in your stream-chain.
It also mean marking it accesible out of the scope will require synchronisation work.
And you still need to capitalize the first letter of a country code )
Run the following XSLT 3.0 stylesheet:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0" expand-text="yes" xmlns:f="f">
<xsl:template name="xsl:initial-template">
<root>
<xsl:iterate select="unparsed-text-lines('input.txt')">
<xsl:param name="prev-parts" select="()"/>
<xsl:on-completion>
<attribute name="{$prev-parts[1]}">{$prev-parts[2]}</attribute>
</xsl:on-completion>
<xsl:variable name="parts" select="tokenize(., '\|')"/>
<xsl:choose>
<xsl:when test="$parts[1] = 'country'">
<attribute name="{f:titleCase($prev-parts[1])}{f:titleCase(substring-before($parts[2], '-')}">{$prev-parts[2]}</attribute>
</xsl:when>
<xsl:otherwise>
<attribute name="{$prev-parts[1]}>{$prev-parts[2]}</attribute>
</xsl:otherwise>
</xsl:choose>
<xsl:next-iteration>
<xsl:with-param name="prev-parts" select="$parts"/>
</xsl:next-iteration>
</xsl:iterate>
</root>
</xsl:template>
<xsl:function name="f:titleCase">
<xsl:param name="in"/>
<xsl:sequence select="upper-case(substring($in, 1, 1))||substring($in, 2)"/>
</xsl:function>
</xsl:transform>
Note that unlike other solutions presented here, this one will always produce well-formed XML output. (We see an awful lot of problems on StackOverflow from people receiving so-called XML that has been incorrectly generated because it ignores the problem of escaping special characters.)

Converting tab-delimited text file with multiple columns to XML

I'm trying to programmatically convert a text file with multiple columns of info into an XML file with this format:
<ExampleDataSet>
<Example ExID="AA" exampleCode="AA" exampleDescription="THIS IS AN EXAMPLE DESCRIPTION"/>
<Example ExID="BB" exampleCode="BB" exampleDescription="THIS IS AN EXAMPLE DESCRIPTION"/>
<Example ExID="CC" exampleCode="CCC" exampleDescription="THIS IS AN EXAMPLE DESCRIPTION"/>
<Example ExID="DDD" exampleCode="DD" exampleDescription="THIS IS AN EXAMPLE DESCRIPTION"/>
<Example ExID="EEEE" exampleCode="EE" exampleDescription="THIS IS AN EXAMPLE DESCRIPTION"/>
</ExampleDataSet>
I've found other examples that do similar conversions, but on a simpler level. Could anyone point me in the right direction?
You can manually create an XML document using the below. This example creates an XML document with 1 element and the attributes required.
First, create the xml document itself and append the top level element collection header.
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("ExampleDataSet");
doc.AppendChild(node);
Now create a new element row. ( you would need a loop here, 1 per csv row!)
XmlNode eg1 = doc.CreateElement("Example");
Then create each of the attributes of the element and append.
XmlAttribute att1 = doc.CreateAttribute("ExID");
att1.Value = "AA";
XmlAttribute att2 = doc.CreateAttribute("exampleCode");
att2.Value = "AA";
XmlAttribute att3 = doc.CreateAttribute("exampleDescription");
att3.Value = "THIS IS AN EXAMPLE DESCRIPTION";
eg1.Attributes.Append(att3);
eg1.Attributes.Append(att2);
eg1.Attributes.Append(att1);
Finally, append to the parent node.
node.AppendChild(eg1);
You can get the XML string like this if you need it.
string xml = doc.OuterXml;
Or you can save it directly to a file.
doc.Save("C:\\test.xml");
Hope that helps you on your way.
Thanks
In XSLT 3.0 you can write this as, for example:
<xsl:variable name="columns" select="'exId', 'exCode', 'exDesc'"/>
<xsl:template name="xsl:initial-template">
<DatasSet>
<xsl:for-each select="unparsed-text-lines('input.csv')">
<xsl:variable name="tokens" select="tokenize(., '\t')"/>
<Example>
<xsl:for-each select="1 to count($tokens)">
<xsl:attribute name="{$columns[$i]}" select="$tokens[$i]"/>
</xsl:for-each>
</Example>
</xsl:for-each>
</DataSet>
</xsl:template>
I'm not sure why you tagged the question "Java" and "C#" but you can run this using Saxon-HE called from Java or C# or from the command line.
Using xml linq and assuming first row of the file are the column headers
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
XDocument doc = new XDocument();
doc.Add(new XElement("ExampleDataSet"));
XElement root = doc.Root;
StreamReader reader = new StreamReader(FILENAME);
int rowCount = 1;
string line = "";
string[] headers = null;
while((line = reader.ReadLine()) != null)
{
if (rowCount++ == 1)
{
headers = line.Split(new char[] { '\t' }, StringSplitOptions.RemoveEmptyEntries);
}
else
{
string[] arrayStr = line.Split(new char[] { '\t' }, StringSplitOptions.RemoveEmptyEntries);
XElement newRow = new XElement("Example");
root.Add(newRow);
for (int i = 0; i < arrayStr.Count(); i++)
{
newRow.Add(new XAttribute(headers[i], arrayStr[i]));
}
}
}
}
}
}

Generate/get xpath from XML node java

I'm interested in advice/pseudocode code/explanation rather than actual implementation.
I'd like to go through XML document, all of its nodes
Check the node for attribute existence
Case if node doesn't have attribute, get/generate String with value of its xpath
Case if node does have attributes, iterate through attribute list and create xpath for each attribute including the node as well.
Edit
My reason for doing this is: I'm writing automated tests in Jmeter, so for every request I need to verify that request actually did its job so I'm asserting results by getting nodes values with Xpath.
When the request is small it's not a problem to create asserts by hand, but for larger ones it's really a pain.
I'm looking for Java approach.
Goal
My goal is to achieve following from this example XML file :
<root>
<elemA>one</elemA>
<elemA attribute1='first' attribute2='second'>two</elemA>
<elemB>three</elemB>
<elemA>four</elemA>
<elemC>
<elemB>five</elemB>
</elemC>
</root>
to produce the following :
//root[1]/elemA[1]='one'
//root[1]/elemA[2]='two'
//root[1]/elemA[2][#attribute1='first']
//root[1]/elemA[2][#attribute2='second']
//root[1]/elemB[1]='three'
//root[1]/elemA[3]='four'
//root[1]/elemC[1]/elemB[1]='five'
Explained :
If node value/text is not null/zero, get xpath , add = 'nodevalue' for assertion purpose
If node has attributes create assert for them too
Update
I found this example, it doesn't produce the correct results, but I'm looking something like this:
http://www.coderanch.com/how-to/java/SAXCreateXPath
Update:
#c0mrade has updated his question. Here is a solution to it:
This XSLT transformation:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vApos">'</xsl:variable>
<xsl:template match="*[#* or not(*)] ">
<xsl:if test="not(*)">
<xsl:apply-templates select="ancestor-or-self::*" mode="path"/>
<xsl:value-of select="concat('=',$vApos,.,$vApos)"/>
<xsl:text>
</xsl:text>
</xsl:if>
<xsl:apply-templates select="#*|*"/>
</xsl:template>
<xsl:template match="*" mode="path">
<xsl:value-of select="concat('/',name())"/>
<xsl:variable name="vnumPrecSiblings" select=
"count(preceding-sibling::*[name()=name(current())])"/>
<xsl:if test="$vnumPrecSiblings">
<xsl:value-of select="concat('[', $vnumPrecSiblings +1, ']')"/>
</xsl:if>
</xsl:template>
<xsl:template match="#*">
<xsl:apply-templates select="../ancestor-or-self::*" mode="path"/>
<xsl:value-of select="concat('[#',name(), '=',$vApos,.,$vApos,']')"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<root>
<elemA>one</elemA>
<elemA attribute1='first' attribute2='second'>two</elemA>
<elemB>three</elemB>
<elemA>four</elemA>
<elemC>
<elemB>five</elemB>
</elemC>
</root>
produces exactly the wanted, correct result:
/root/elemA='one'
/root/elemA[2]='two'
/root/elemA[2][#attribute1='first']
/root/elemA[2][#attribute2='second']
/root/elemB='three'
/root/elemA[3]='four'
/root/elemC/elemB='five'
When applied to the newly-provided document by #c0mrade:
<root>
<elemX serial="kefw90234kf2esda9231">
<id>89734</id>
</elemX>
</root>
again the correct result is produced:
/root/elemX[#serial='kefw90234kf2esda9231']
/root/elemX/id='89734'
Explanation:
Only elements that have no children elements, or have attributes are matched and processed.
For any such element, if it doesn't have children-elements all of its ancestor-or self elements are processed in a specific mode, named 'path'. Then the "='theValue'" part is output and then a NL character.
All attributes of the matched element are then processed.
Then finally, templates are applied to all children-elements.
Processing an element in the 'path' mode is simple: A / character and the name of the element are output. Then, if there are preceding siblings with the same name, a "[numPrecSiblings+1]` part is output.
Processing of attributes is simple: First all ancestor-or-self:: elements of its parent are processed in 'path' mode, then the [attrName=attrValue] part is output, followed by a NL character.
Do note:
Names that are in a namespace are displayed without any problem and in their initial readable form.
To aid readability, an index of [1] is never displayed.
Below is my initial answer (may be ignored)
Here is a pure XSLT 1.0 solution:
Below is a sample xml document and a stylesheet that takes a node-set parameter and produces one valid XPath expression for every member-node.
stylesheet (buildPath.xsl):
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
>
<xsl:output method="text"/>
<xsl:variable name="theParmNodes" select="//namespace::*[local-name() =
'myNamespace']"/>
<xsl:template match="/">
<xsl:variable name="theResult">
<xsl:for-each select="$theParmNodes">
<xsl:variable name="theNode" select="."/>
<xsl:for-each select="$theNode |
$theNode/ancestor-or-self::node()[..]">
<xsl:element name="slash">/</xsl:element>
<xsl:choose>
<xsl:when test="self::*">
<xsl:element name="nodeName">
<xsl:value-of select="name()"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::*[name(current()) =
name()])"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::*[name(current()) =
name()])"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<xsl:otherwise> <!-- This node is not an element -->
<xsl:choose>
<xsl:when test="count(. | ../#*) = count(../#*)">
<!-- Attribute -->
<xsl:element name="nodeName">
<xsl:value-of select="concat('#',name())"/>
</xsl:element>
</xsl:when>
<xsl:when test="self::text()"> <!-- Text -->
<xsl:element name="nodeName">
<xsl:value-of select="'text()'"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::text())"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::text())"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<xsl:when test="self::processing-instruction()">
<!-- Processing Instruction -->
<xsl:element name="nodeName">
<xsl:value-of select="'processing-instruction()'"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::processing-instruction())"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::processing-instruction())"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<xsl:when test="self::comment()"> <!-- Comment -->
<xsl:element name="nodeName">
<xsl:value-of select="'comment()'"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::comment())"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::comment())"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<!-- Namespace: -->
<xsl:when test="count(. | ../namespace::*) =
count(../namespace::*)">
<xsl:variable name="apos">'</xsl:variable>
<xsl:element name="nodeName">
<xsl:value-of select="concat('namespace::*',
'[local-name() = ', $apos, local-name(), $apos, ']')"/>
</xsl:element>
</xsl:when>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="msxsl:node-set($theResult)"/>
</xsl:template>
</xsl:stylesheet>
xml source (buildPath.xml):
<!-- top level Comment -->
<root>
<nodeA>textA</nodeA>
<nodeA id="nodeA-2">
<?myProc ?>
xxxxxxxx
<nodeB/>
<nodeB xmlns:myNamespace="myTestNamespace">
<!-- Comment within /root/nodeA[2]/nodeB[2] -->
<nodeC/>
<!-- 2nd Comment within /root/nodeA[2]/nodeB[2] -->
</nodeB>
yyyyyyy
<nodeB/>
<?myProc2 ?>
</nodeA>
</root>
<!-- top level Comment -->
Result:
/root/nodeA[2]/nodeB[2]/namespace::*[local-name() = 'myNamespace']
/root/nodeA[2]/nodeB[2]/nodeC/namespace::*[local-name() =
'myNamespace']
Here is how this can be done with SAX:
import java.util.HashMap;
import java.util.Map;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
public class FragmentContentHandler extends DefaultHandler {
private String xPath = "/";
private XMLReader xmlReader;
private FragmentContentHandler parent;
private StringBuilder characters = new StringBuilder();
private Map<String, Integer> elementNameCount = new HashMap<String, Integer>();
public FragmentContentHandler(XMLReader xmlReader) {
this.xmlReader = xmlReader;
}
private FragmentContentHandler(String xPath, XMLReader xmlReader, FragmentContentHandler parent) {
this(xmlReader);
this.xPath = xPath;
this.parent = parent;
}
#Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
Integer count = elementNameCount.get(qName);
if(null == count) {
count = 1;
} else {
count++;
}
elementNameCount.put(qName, count);
String childXPath = xPath + "/" + qName + "[" + count + "]";
int attsLength = atts.getLength();
for(int x=0; x<attsLength; x++) {
System.out.println(childXPath + "[#" + atts.getQName(x) + "='" + atts.getValue(x) + ']');
}
FragmentContentHandler child = new FragmentContentHandler(childXPath, xmlReader, this);
xmlReader.setContentHandler(child);
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
String value = characters.toString().trim();
if(value.length() > 0) {
System.out.println(xPath + "='" + characters.toString() + "'");
}
xmlReader.setContentHandler(parent);
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
characters.append(ch, start, length);
}
}
It can be tested with:
import java.io.FileInputStream;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
public class Demo {
public static void main(String[] args) throws Exception {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
xr.setContentHandler(new FragmentContentHandler(xr));
xr.parse(new InputSource(new FileInputStream("input.xml")));
}
}
This will produce the desired output:
//root[1]/elemA[1]='one'
//root[1]/elemA[2][#attribute1='first]
//root[1]/elemA[2][#attribute2='second]
//root[1]/elemA[2]='two'
//root[1]/elemB[1]='three'
//root[1]/elemA[3]='four'
//root[1]/elemC[1]/elemB[1]='five'
With jOOX (a jquery API port to Java, disclaimer - I work for the company behind the library), you can almost achieve what you want in a single statement:
// I'm assuming this:
import static org.joox.JOOX.$;
// And then...
List<String> coolList = $(document).xpath("//*[not(*)]").map(
context -> $(context).xpath() + "='" + $(context).text() + "'"
);
If document is your sample document:
<root>
<elemA>one</elemA>
<elemA attribute1='first' attribute2='second'>two</elemA>
<elemB>three</elemB>
<elemA>four</elemA>
<elemC>
<elemB>five</elemB>
</elemC>
</root>
This will produce
/root[1]/elemA[1]='one'
/root[1]/elemA[2]='two'
/root[1]/elemB[1]='three'
/root[1]/elemA[3]='four'
/root[1]/elemC[1]/elemB[1]='five'
By "almost", I mean that jOOX does not (yet) support matching/mapping attributes. Hence, your attributes will not produce any output. This will be implemented in the near future, though.
private static void buildEntryList( List<String> entries, String parentXPath, Element parent ) {
NamedNodeMap attrs = parent.getAttributes();
for( int i = 0; i < attrs.getLength(); i++ ) {
Attr attr = (Attr)attrs.item( i );
//TODO: escape attr value
entries.add( parentXPath+"[#"+attr.getName()+"='"+attr.getValue()+"']");
}
HashMap<String, Integer> nameMap = new HashMap<String, Integer>();
NodeList children = parent.getChildNodes();
for( int i = 0; i < children.getLength(); i++ ) {
Node child = children.item( i );
if( child instanceof Text ) {
//TODO: escape child value
entries.add( parentXPath+"='"+((Text)child).getData()+"'" );
} else if( child instanceof Element ) {
String childName = child.getNodeName();
Integer nameCount = nameMap.get( childName );
nameCount = nameCount == null ? 1 : nameCount + 1;
nameMap.put( child.getNodeName(), nameCount );
buildEntryList( entries, parentXPath+"/"+childName+"["+nameCount+"]", (Element)child);
}
}
}
public static List<String> getEntryList( Document doc ) {
ArrayList<String> entries = new ArrayList<String>();
Element root = doc.getDocumentElement();
buildEntryList(entries, "/"+root.getNodeName()+"[1]", root );
return entries;
}
This code works with two assumptions: you aren't using namespaces and there are no mixed content elements. The namespace limitation isn't a serious one, but it'd make your XPath expression much harder to read, as every element would be something like *:<name>[namespace-uri()='<nsuri>'][<index>], but otherwise it's easy to implement. Mixed content on the other hand would make the use of xpath very tedious, as you'd have to be able to individually address the second, third and so on text node within an element.
use w3c.dom
go recursively down
for each node there is easy way to get it's xpath: either by storing it as array/list while #2, or via function which goes recursively up until parent is null, then reverses array/list of encountered nodes.
something like that.
UPD:
and concatenate final list in order to get final xpath.
don't think attributes will be a problem.
I've done a similar task once. The main idea used was that you can use indexes of the element in the xpath. For example in the following xml
<root>
<el />
<something />
<el />
</root>
xpath to the second <el/> will be /root[1]/el[2] (xpath indexes are 1-based). This reads as "take the first root, then take the second one from all elements with the name el". So element something does not affect indexing of elements el. So you can in theory create an xpath for each specific element in your xml. In practice I've accomplished this by walking the tree recursevely and remembering information about elements and their indexes along the way.
Creating xpath referencing specific attribute of the element then was just adding '/#attrName' to element's xpath.
I have written a method to return the absolute path of an element in the Practical XML library. To give you an idea of how it works, here's an extract form one of the unit tests:
assertEquals("/root/wargle[2]/zargle",
DomUtil.getAbsolutePath(child3a));
So, you could recurse through the document, apply your tests, and use this to return the XPath. Or, what is probably better, is that you could use the XPath-based assertions from that same library.
I did the exact same thing last week for processing my xml to solr compliant format.
Since you wanted a pseudo code: This is how I accomplished that.
// You can skip the reference to parent and child.
1_ Initialize a custom node object: NodeObjectVO {String nodeName, String path, List attr, NodeObjectVO parent, List child}
2_ Create an empty list
3_ Create a dom representation of xml and iterate thro the node. For each node, get the corresponding information. All the information like Node name,attribute names and value should be readily available from dom object. ( You need to check the dom NodeType, code should ignore processing instruction and plain text nodes.)
// Code Bloat warning.
4_ The only tricky part is get path. I created an iterative utility method to get the xpath string from NodeElement. (While(node.Parent != null ) { path+=node.parent.nodeName}.
(You can also achieve this by maintaining a global path variable, that keeps track of the parent path for each iteration.)
5_ In the setter method of setAttributes (List), I will append the object's path with all the available attributes. (one path with all available attributes. Not a list of path with each possible combination of attributes. You might want to do someother way. )
6_ Add the NodeObjectVO to the list.
7_ Now we have a flat (not hierrarchial) list of custom Node Objects, that have all the information I need.
(Note: Like I mentioned, I maintain parent child relationship, you should probably skip that part. There is a possibility of code bloating, especially while getparentpath. For small xml this was not a problem, but this is a concern for large xml).

How do I remove namespaces from xml, using java dom?

I have the following code
DocumentBuilderFactory dbFactory_ = DocumentBuilderFactory.newInstance();
Document doc_;
DocumentBuilder dBuilder = dbFactory_.newDocumentBuilder();
StringReader reader = new StringReader(s);
InputSource inputSource = new InputSource(reader);
doc_ = dBuilder.parse(inputSource);
doc_.getDocumentElement().normalize();
Then I can do
doc_.getDocumentElement();
and get my first element but the problem is instead of being job the element is tns:job.
I know about and have tried to use:
dbFactory_.setNamespaceAware(true);
but that is just not what I'm looking for, I need something to completely get rid of namespaces.
Any help would be appreciated,
Thanks,
Josh
Use the Regex function. This will solve this issue:
public static String removeXmlStringNamespaceAndPreamble(String xmlString) {
return xmlString.replaceAll("(<\\?[^<]*\\?>)?", ""). /* remove preamble */
replaceAll("xmlns.*?(\"|\').*?(\"|\')", "") /* remove xmlns declaration */
.replaceAll("(<)(\\w+:)(.*?>)", "$1$3") /* remove opening tag prefix */
.replaceAll("(</)(\\w+:)(.*?>)", "$1$3"); /* remove closing tags prefix */
}
For Element and Attribute nodes:
Node node = ...;
String name = node.getLocalName();
will give you the local part of the node's name.
See Node.getLocalName()
You can pre-process XML to remove all namespaces, if you absolutely must do so. I'd recommend against it, as removing namespaces from an XML document is in essence comparable to removing namespaces from a programming framework or library - you risk name clashes and lose the ability to differentiate between once-distinct elements. However, it's your funeral. ;-)
This XSLT transformation removes all namespaces from any XML document.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
</xsl:copy>
</xsl:template>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="node()|#*" />
</xsl:element>
</xsl:template>
<xsl:template match="#*">
<xsl:attribute name="{local-name()}">
<xsl:apply-templates select="node()|#*" />
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Apply it to your XML document. Java examples for doing such a thing should be plenty, even on this site. The resulting document will be exactly of the same structure and layout, just without namespaces.
Rather than
dbFactory_.setNamespaceAware(true);
Use
dbFactory_.setNamespaceAware(false);
Although I agree with Tomalak: in general, namespaces are more helpful than harmful. Why don't you want to use them?
Edit: this answer doesn't answer the OP's question, which was how to get rid of namespace prefixes. RD01 provided the correct answer to that.
Tomalak, one fix of your XSLT (in 3rd template):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="node() | #*" />
</xsl:copy>
</xsl:template>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="node() | #*" />
</xsl:element>
</xsl:template>
<xsl:template match="#*">
<!-- Here! -->
<xsl:copy>
<xsl:apply-templates select="node() | #*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
public static void wipeRootNamespaces(Document xml) {
Node root = xml.getDocumentElement();
NodeList rootchildren = root.getChildNodes();
Element newroot = xml.createElement(root.getNodeName());
for (int i=0;i<rootchildren.getLength();i++) {
newroot.appendChild(rootchildren.item(i).cloneNode(true));
}
xml.replaceChild(newroot, root);
}
The size of the input xml also needs to be considered when choosing the solution. For large xmls, in the size of ~100k, possible if your input is from a web service, you also need to consider the garbage collection implications when you manipulate a large string. We used String.replaceAll before, and it caused frequent OOM in production with a 1.5G heap size because of the way replaceAll is implemented.
You can reference http://app-inf.blogspot.com/2013/04/pitfalls-of-handling-large-string.html for our findings.
I am not sure how XSLT deals with large String objects, but we ended up parsing the string manualy to remove prefixes in one parse to avoid creating additional large java objects.
public static String removePrefixes(String input1) {
String ret = null;
int strStart = 0;
boolean finished = false;
if (input1 != null) {
//BE CAREFUL : allocate enough size for StringBuffer to avoid expansion
StringBuffer sb = new StringBuffer(input1.length());
while (!finished) {
int start = input1.indexOf('<', strStart);
int end = input1.indexOf('>', strStart);
if (start != -1 && end != -1) {
// Appending anything before '<', including '<'
sb.append(input1, strStart, start + 1);
String tag = input1.substring(start + 1, end);
if (tag.charAt(0) == '/') {
// Appending '/' if it is "</"
sb.append('/');
tag = tag.substring(1);
}
int colon = tag.indexOf(':');
int space = tag.indexOf(' ');
if (colon != -1 && (space == -1 || colon < space)) {
tag = tag.substring(colon + 1);
}
// Appending tag with prefix removed, and ">"
sb.append(tag).append('>');
strStart = end + 1;
} else {
finished = true;
}
}
//BE CAREFUL : use new String(sb) instead of sb.toString for large Strings
ret = new String(sb);
}
return ret;
}
Instead of using TransformerFactory and then calling transform on it (which was injecting the empty namespace, I transformed as follows:
OutputStream outputStream = new FileOutputStream(new File(xMLFilePath));
OutputFormat outputFormat = new OutputFormat(doc, "UTF-8", true);
outputFormat.setOmitComments(true);
outputFormat.setLineWidth(0);
XMLSerializer serializer = new XMLSerializer(outputStream, outputFormat);
serializer.serialize(doc);
outputStream.close();
I also faced the namespace issue and was unable to read XML file in java. below is the solution:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(false);// this is imp code that will deactivate namespace in xml
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("XML/"+ fileName);

Categories

Resources