whitespace aware reading/writing of XML

whitespace aware reading/writing of XML - java

I need to change some elements of an XML file which are under source control and write the file with no other differences to allow the developers to easily review the changes.
In detail I have a set of elements which need to have an id attribute in the xml code. I find these elements with a xpath expression and add an ID to it. But when the dom is written again, the formatting differs a bit.
the order of the attributes is changes to alphabetical
the definition of the namespace is moved to the elements (<ns1:root xmlns:ns1="abc" xmlns:ns2="xzy"><ns2:element/></ns1:root> changes to <root xmlns="abc"><element xmlns="xzy"/></root>)
linebreaks and indetion change
The xml is read with javax.xml.parsers.SAXParser (namespaceaware: true) and written with javax.xml.transform.TransformerFactory (indent: yes).
The best way to preserve the formatting would be to alter the source string, is there a good way to do this without diving in too deep into the xml parsing thing?
Or is there a way to parse the xml to dom whitespace aware?

the order of the attributes is changes to alphabetical
The order of the attributes is irrelevant as per the spec. If you have built a piece of software that relies on the order of the attributes in an XML file, then that software is broken, plain and easy.
the definition of the namespace is moved to the elements
That is irrelevant as well.
linebreaks and indetion change
So is this.
The best way to preserve the formatting would be to alter the source string,
Absolutely not. Don't do that, that's wrong on every level. XML parsers are complex things because XML parsing is a complex thing. If it was as simple as doing a bunch of string search-and-replace operations, then XML parsers would do that instead of being complex.
XML is identical when the DOM it creates is identical. There a countless ways to serialize a DOM. You are at fault if any part of your program relies on the serialized representation of the DOM, instead on the DOM itself.
In any case, most serializers do offer some settings that influence their behavior. If you use the same serializer with the same configuration then you can expect a predictable outcome. That might help a little (i.e. when checking the file into a source control system), but it should not be a reason to start relying on it at the code level.

Related

How to preserve the Attributes order in a XML after parsing and modifications in java?

First of all, some premise.
I am aware of the existence of several identical questions on the site but in none of these I have found a definitive solution to the problem.
I know that the order of the attributes of xml files is absolutely irrelevant for the purposes of data consistency or the ability to integrate with software that actually treat xml as such and not as strings. However, I have to keep it because I am going to modify files that will be visually checked by the operators with WinMerge or with Tortoise’s check for modifications command.
I have used libraries like DOM, STAX and JDOM with poor results.
In the files where I only have to modify the text of an element, I have no problem and if there is some different formatting I can easily modify it considering it as a string.
With attributes it is more complicated. These are sorted in an other order(please do not question whether this is correct or not is not inherent to the question) and on winmerge looks like if all the document is was modified.
here is a (cutted and with semirandom textcontent) example of my xml first and after the modification
<?xml version="1.0" encoding="UTF-8"?>
<sca:composite xmi:version="2.0"
xmlns:xmi="http://www.omg.org/XMI"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:BW="http://xsd.tns.tibco.com/amf/models/sca/implementationtype/BW" xmlns:XMLSchema="http://www.w3.org/2001/XMLSchema"
xmlns:compositeext="http://schemas.tibco.com/amx/3.0/compositeext"
xmlns:productAvailabilityResp="http://www.example.org/ERTETERET"
xmlns:property="http://ns.tibco.com/bw/property"
xmlns:rest="http://xsd.tns.tibco.com/bERTERTETE"
xmlns:sca="http://www.3453434FDSSDFSD.org/xmlns/sca/1.0"
xmlns:scact="http://xsd.tns.tibco.com/23E23E2E23Ee"
xmlns:scaext="http://2D2333DD32s"
xmi:id="_uKDz4IaiEeipW88nT3HxEA"
targetNamespace="http://tns.tibco.com/D23D32DD2232D2D2"
name="Q1231W1y" compositeext:version="1.0.0"
compositeext:description="TO EDIT VALUE"
ompositeext:formatVersion="2">
</sca:composite>
and
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<sca:composite xmlns:sca="http://www.SDFSDF.org/xmlns/sca/1.0"
xmlns:BW="http://xsd.tns.tibco.com/amf/models/sca/SDFS/BW"
xmlns:XMLSchema="http://www.w3.org/2001/XMLSchema"
xmlns:compositeext="http://schemas.tibco.com/amx/3.0/compositeext"
xmlns:productAvailabilityResp="http://www.example.org/SDFSDFSD"
xmlns:property="http://ns.tibco.com/bw/property"
xmlns:rest="http://xsd.tns.tibco.com/SDFSF"
xmlns:scact="http://xsd.tns.tibco.com/amf/models/sca/SDFSD"
xmlns:scaext="http://xsd.tns.tibco.com/amf/models/sca/extensions"
xmlns:xmi="http://www.omg.org/XMI"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
compositeext:description="test EDITED VALUE"
compositeext:formatVersion="2"
compositeext:version="1.0.0"
name="ERFERFRFE"
targetNamespace="http://tns.tibco.com/bw/composite/ERFERFREy"
xmi:id="_uKDz4IaiEeipW88nT3HxEA"
xmi:version="2.0">
</sca:composite>
Could we try together to find a solution?
Edit like suggested from Federico:
What I need to do is to change the value from a single Attribute and the textcontent from an element, I can do do both of those things. But when I write back the file I find a different order of the attributes and a different formatting:
<?xml version="1.0" encoding="UTF-8"?>
<sca:composite //same attributes
compositeext:description="TO EDIT VALUE"
//same other attributes>
other stuff
</sca:composite>
ps: my intent is making a versioner for tibco BW6 projects outside the designer

From my understanding, your program reads the XML input stream from a file with STaX, DOM or SAX, then you do some modifications to elements or attributes, and finally your program will write the data to another XML file.
A requirement is that the detailed structured of the output file resembles that of the input file as close as possible, after the changes made. That means – among other conditions – that elements and attributes have to be in the same order in the output document as they were in the input document.
XML demands that the sequence of elements remains as is, but (as you said already), the attributes can be in any order without any influence on the semantics of the XML document.
Your problem is, that neither DOM or SAX nor STaX allow you to influence the sequence of the attributes for the elements.
Does this description match with your problem?
I am using a large XML file as "poor man's database"; that means that I manipulate that XML file with a text editor and that I have a bunch of little programs that create reports from that XML file. One of these will sort the "records" in the XML file, and this requires to read it, manipulate the data and to write it afterwards.
I had the same (at least similar issue) as you: some attributes are at arbitrary locations afterwards. When searching through the text file in the editor, this causes a lot of friction.
So instead of using SAX, DOM or STaX for the output, I wrote my own library, that defined a comparator for each element type that is used to sort the attributes of that element type.
Some implementations of the comparator used a list with attribute names that defined the order, and that allowed me to have the attributes ordered like this:
<element sortkey="…" id="…" subject="…" date="…" parent="…" …
If you treat the xmi:… things and the namespace definitions all as attributes, the code for such an "XMLWriter" is quite straightforward.
If the order of the attributes may differ for each individual element (even those with the same name), you have to modify that approach in a way that you have to store the attribute sequence with each element instance on reading.
But perhaps XML processing is not the right approach for you at all …
Maybe an approach like that of using sed or awk fits better to your needs.
This means basically that you search for a certain sequence in the text file (using a regular expression or by line and column number or a combination of both), replace what you find there and start over for the next change on another location.
Edit: I did not mean to integrate either sed or awk into the solution; what I meant was to adopt only the basic approach of how these tools work, and to implement that in the program. Both tools are really powerful, but from what I understand, only a fraction of their features is needed, so that a full integration of one or the other into the program might be overkill – nevertheless, it is possible: A starting point for an integration of awk is awk.sourceforge.net. It can be integrated even through JSR-223 (Scripting).
For an integration of sed, a look to the tools4j/unix4j project on github could be helpful.

Comparing two xml files using JAVA

I have to xml files say abc.xml & 123.xml which are almost similar, i mean has the same content, but the second one i.e, 123.xml has more content than the earlier one.
I want to read both the files using Java, and compare whether the content present in abc.xml for each tag is same as that in 123.xml, something like object comparison.
Please suggest me how to read the xml file using java and start comparing.
Thanks.

if you just want to compare then use this:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new File("file1.xml"));
doc1.normalizeDocument();
Document doc2 = db.parse(new File("file2.xml"));
doc2.normalizeDocument();
Assert.assertTrue(doc1.isEqualNode(doc2));
else see this
http://xmlunit.sourceforge.net/

I would go for the XMLUnit.
The features it provides :
the differences between two pieces of XML
The outcome of transforming a piece of XML using XSLT
The evaluation of an XPath expression on a piece of XML
The validity of a piece of XML
Individual nodes in a
piece of XML that are exposed by DOM Traversal
Good Luck!

I would use JAXB to generate Java objects from the XML files and then compare the Java files. They would make the handling much easier.

In general, if you know that you have two files with identical structure but slightly different and unordered content you are going to have to "read" the files to compare the contents.
If you have the XML Schema for your XML files then you could use JAXB to create a set of classes that will represent the specific DOM that is defined by your XML schema. The benefit of this approach is that you will not have to parse the XML file through generic functions for elements and attributes but rather through the actual fields that make sense to your problem.
Of course, to be able to detect the presence of the same entry across both files you are going to have to "match" them through some common field (for example, some ID).
To help you with the duplicates discovery process you could use some relevant data structure from Java's collections, like the Set (or one of its derivatives)
I hope this helps.

Well if you just want to compare and display then you can use Guiffy
It is a good tool. If u want to do the processing in backend then you must use DOM parser load both files to 2 DOM objects and compare attribute by attribute.

The right approach depends on two factors:
(a) how much control do you want over how the comparison is done? For example, do you need to control whether whitespace is significant, whether comments should be ignored, whether namespace prefixes should be ignored, whether redundant namespace declarations should be ignored, whether the XML declaration should be ignored?
(b) what answer do you want? (i) a boolean: same/different, (ii) a list of differences suitable for a human to process, (iii) a list of differences suitable for an application to process.
The two techniques I use are: (a) convert both files to Canonical XML and then compare strings. This gives very little control and only gives a boolean result. (b) compare the two trees using the XPath 2.0 deep-equal() function or the extended Saxon version saxon:deep-equal(). The Saxon version gives more control over how the comparison is done, and a more detailed report of the differences found (for human reading, not for application use).
If you want to write Java code, you could of course implement your own comparison logic - for example you could find an open source implementation of XPath deep-equal, and modify it to meet your requirements. It's only a hundred or so lines of code.

it's a bit overkill, but if your XML has schema, you can convert it into EMF metamodel & then use EMF Compare to compare.

Reading Huge XML File using StAX and XPath

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.
The sample content of the file
<transactions>
<txn id="1">
<name> product 1</name>
<price>29.99</price>
</txn>
<txn id="2">
<name> product 2</name>
<price>59.59</price>
</txn>
</transactions>
The (technical)user is expected to give the input tag name like <txn>.
We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.
There are few technical things we have to consider here
The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM
Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.
Looking for suggestions. Thanks in advance.

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]
As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
Document document = getParser().parse( source );
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.
SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.
This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.
Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.
I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)
In this situation, you will have to use
<p:for-each>
<p:iteration-source select="//transactions/txn"/>
<!-- you processing on a small file -->
</p:for-each>
You can even wrapp each of the resulting transformation with a single line of XProc
<p:wrap-sequence wrapper="transactions"/>
Hope this helps

A fun solution for processing huge XML files >10GB.
Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527

Streaming Transformations for XML (STX) might be what you need.

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.
For fast reading of the whole data StAX will be OK.
If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

Xml Query in java?

I am new to this validation process in Java...
-->XML file named Validation Limits
-->Structure of the XML
parameter /parameter
lowerLimit /lowerLimit
upperLimit /upperLimit
enable /enable
-->Depending the the enable status, 'true or false', i must perform the validation process for the respective parameter
--> what could be the best possible method to perform this operation...
I have parsed the xml (DOM) [forgot this to mention earlier] and stored the values in the arrays but is complicated with lot of referencing that take place from one array to another. If any better method that could replace array procedure will be helpful
Thank you in advance.

Try using a DOM or SAX parser, they will do the parsing for you. You can find some good, free tutorials in the internet.
The difference between DOM and SAX is as follows: DOM loads the XML into a tree structure which you can browse through (i.e. the whole XML is loaded), whereas SAX parses the document and triggers events (calls methods) in the process. Both have advantages and disadvantages, but personally, for reasonably sized XML files, I would use DOM.
So, in your case: use DOM to get a tree of your XML document, locate the attribute, see other elements depending on it.
Also, you can achieve this in pure XML, using XML Schema, although this might be too much for simple needs.

ignore some XML tags in SAX

I'm parsing an XML document using SAX in Java.
I'm working with the XML that describes research publications in different fields.
Among others there are elements like "abstract" that shortly describes what the reserch paper is about. The basic HTML formatting is allowed in that field, but I don't want the SAX to threat the HTML tags (like i,b,u,sub,sup an so on) as real XML tags and fire strartElement() and endElement() events on that elements.
Is there a way to tell to SAX to ignore some predefined set of XML tags and to pass theirs XML code as is to the characters() method?

I suspect not, without some work. I would perhaps slot in different SAX handlers as you encounter different elements, and push/pop them off a stack. So when you encounter an <abstract> element, you slot in a new handler that the SAX parser delegates to, and that is intelligent enough to process your HTML elements as you require. Not a trivial solution, I'm afraid.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.