I have .xml files inside a package in my Java project that contains data in the following format...
<?xml version="1.0"?>
<postcodes>
<entry postcode='AB1 0AA' latitude='7.101478' longitude='2.242852' />
</postcodes>
I currently have overrided the startElement() in my custom DefaultHandler to the following;
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (attributes.getValue("postcode") == "AB43 8TZ"){
System.out.println("The postcode 'AB43 8TZ', has a latitude of "+attributes.getValue("latitude")+" and a longitude of "+attributes.getValue("longitude"));
}
}
I know the code is working outside of this method because I previously tested it by having it print out all of the attributes for each element and that worked fine. Now however, it does nothing, as if it never found that postcode value. (I know it's there because it's a copy paste job from the XML source)
Extra details; Apologies for originally leaving out important details. Some of these files have up to 50k lines, so storing them in memory is a no no if at all possible. As such, I am using SAX. As a side, I use the words "from these files from within my project" because I also can't find how to reference a file from within the same project rather than from an absolute directory.
(From comments as requested by OP.)
First, you cannot compare strings with the == operator. Use equals() instead. See the question How do I compare strings in Java? for more information.
Second, not every element has the postcode attribute, so it is possible that you will be invoking equals() on a null object, leading to NullPointerException. Do it the other way around, e.g.
"AB43 8TZ".equals(attributes.getValue("postcode"))
You would use an XML parser. Luckily, JDK offers these out-of-the-box in form of JAXP. Now, there are several ways to do it, as there are few major "flavours" of parsing XML. For this task, I believe DOM parser would be easiest to use. You could do it like that:
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document document = builder.parse(new File("name/of/the/file.xml"));
Element root = document.getDocumentElement();
and then use DOM traversal API.
Edit: it was not clear from the original question that the data you want to process is large. In that case, DOM parser is indeed not a good solution, precisely due to memory consumption. For the purpose of parsing large XML documents, SAX and StAX parsers were invented. You might find them a little more cumbersome to use, due to their streaming nature, but that's also the source of their efficiency. Linked Oracle JAXP tutorial has sections on SAX and StAX as well.
Assuming you can read the XML relatively quickly using SAX or DOM, I would parse it in advance, and use the attributes to construct a map of postcode vs long/lang e.g.
Map<String, Pair<BigDecimal,BigDecimal>>
and simply lookup using Map.get(String)
I note that you say:
Some of these files have up to 50k lines, so storing them in memory is
a no no if at all possible
I wouldn't worry about that at all. A map of 50k entries isn't going to be a major deal.
You can use the javax.xml.xpath APIs included in the JDK/JRE and use XPath to specify the data you wish to retrieve from the XML document.
Example
Xml parser, one item
Related
I am experimenting with VTD XML because I frequently need to modify huge XML files (2-10GB or more).
I am try to write an XPath Query result back to a file.
Writing huge files in VTD XML is not obvious to me though:
The method getBytes() is "not implemented" for XMLMemMappedBuffer (see https://jar-download.com/javaDoc/com.ximpleware/vtd-xml/2.13/com/ximpleware/extended/XMLMemMappedBuffer.html)
One of the authors (?) gives a code example in this thread (last post, 2010-04-21): https://sourceforge.net/p/vtd-xml/discussion/379067/thread/a2e03ede/
However, the example is outdated as
long la = vnh.getElementFragment();
returns an Array long[] (see https://jar-download.com/java-documentation-javadoc.php?a=vtd-xml&g=com.ximpleware&v=2.13)
Adapting the relevant lines like this
long[] la = vnh.getElementFragment();
vnh.getXML().writeToFileOutputStream(new FileOutputStream("c:/text2.xml"), (int)la[0], (int)la[1]);
results in the following error:
Exception in thread "main" java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(Unknown Source)
at sun.nio.ch.FileChannelImpl.transferTo(Unknown Source)
at com.ximpleware.extended.XMLMemMappedBuffer.writeToFileOutputStream(XMLMemMappedBuffer.java:104)
at WriteXML.main(WriteXML.java:16)
Questions:
Is this error due to any obvious mistake in the code?
What tools would you use to handle huge XML files (~10GB)
efficiently? (Does not have to be Java.)
My goal is to do simple
transformations or split the xml and write back to file with great
performance. Thanks!
Can't answer your first question, but as to the second, if you're looking for different technology then streaming XSLT 3.0 is one to explore: can't tell whether it's actually suitable without seeing more detail on your requirement.
First of all, to process XML of huge size as you mentioned, I suggest that you load xml into memory using mem-map mode. And since vtd-xml doesn't alter the underlying byte format of xml, you can easily imagine saving a lot of back-and-forth encoding/decoding byte-moving operations and the performance advantage thereof.
As you have pointed out, XMLMemMappedBuffer getBytes is not implemented... this is to avoid excessive memory usage when the fragment is very large...
your work around is to use XMLMemMappedBuffer's writeToFileOutputStream() method to directly dump it to output. In other words, if you know the offset and length of the fragment... getBytes is often bypass-able.
Below is the signature document of that method.
public void writeToFileOutputStream(java.io.FileOutputStream ost,
long os,
long len)
throws java.io.IOException
write the segment (denoted by its offset and length) into an output file stream
I need to parse, modify and write back Java source files. I investigated some options but it seams that I miss the point.
The output of the parsed AST when written back to file always screwed up the formatting using a standard format but not the original one.
Basically I want something that can do: content(write(parse(sourceFile))).equals(content(sourceFile)).
I tried the JavaParser but failed. I might use the Eclipse JDT's parser as a stand alone parser but this feels heavy. I also would like to avoid doing my own stuff. The Java parser for instance has information about column and line already but writing it back seams to ignore these information.
I would like to know how I can achieve parsing and writing back while the output looks the same as the input (intents, lines, everything). Basically a solution that is preserving the original formatting.
[Update]
The modifications I want to do is basically everything that is possible with the AST like adding, removing implemented interfaces, remove / add final to local variables but also generate source methods and constructors.
The idea is to add/remove anything but the rest needs to remain untouched especially the formatting of methods and expressions if the resulting line is larger than the page margin.
You may try using antlr4 with its java8 grammar file
The grammar skips all whitespaces by default but based on token positions you may be able to reconstruct the source being close to the original one
The output of a parser generated by REx is a sequence of events written to this interface:
public interface EventHandler
{
public void reset(CharSequence input);
public void startNonterminal(String name, int begin);
public void endNonterminal(String name, int end);
public void terminal(String name, int begin, int end);
public void whitespace(int begin, int end);
}
where the integers are offsets into the input. The event stream can be used to construct a parse tree. As the event stream completely covers all of the input, the resulting data structure can represent it without loss.
There is sample driver, implementing XmlSerializer on top of this interface. That streams out an XML parse tree, which is just markup added to the input. Thus the string value of the XML document is identical to the original input.
For seeing it in action, use the Java 7 sample grammar and generate a parser using command line
-ll 2 -backtrack -tree -main -java
Then run the main method of the resulting Java.java, passing in some Java source file name.
Our DMS Software Reengineering Toolkit with its Java Front End can do this.
DMS is a program transformation system (PTS), designed to parse source code to an internal representation (usually ASTs), let you make changes to those trees, and regenerate valid output text for the modified tree.
Good PTSes will preserve your formatting/layout at places where you didn't change the code or generate nicely formatted results, including comments in the original source. They will also let you write source-to-source transformations in the form of:
if you see *this* pattern, replace it by *that* pattern
where pattern is written in the surface syntax of your targeted language (in this case, Java). Writing such transformations is usually a lot easier than writing procedural code to climb up and down the tree, inspecting and hacking individual nodes.
DMS has all these properties, including OP's request for idempotency of the null transform.
[Reacting to another answer: yes, it has a Java 8 grammar]
I am in a confusion that can you help me out this question i.e., I am directly reading xml message using xpath and reading values from it and also i am trying to convert xml to json and reading values because of light weight object so which one is best approach to read values. I am attaching following snippet code.
Below is the code to read values from xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(source);
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
String eventNumber = xpath.evaluate("/event/eventnumber", document);
Below is to convert xml to json
JSONObject xmlJSONObj = XML.toJSONObject(xml1);
This really depends on what you will be using the data for. If you need to do a lot of parsing/traversing through the data, JSON is much faster due to the nature of JSON APIs. So in the case that you will be needing to do a lot of data extraction/examination on a single file, I would convert the XML to JSON.
If you only need to find a single field in the XML file or only do a small amount of parsing/data extraction, I would stick to XML. It is not worth the extra processing of converting an XML file to JSON if you are only going to do a small amount of traversing through the XML file. The processing time it takes to convert the XML to JSON and then traverse the JSON is more costly than the processing time it takes to traverse through the XML file once.
Another discussion that was published here:
JSON and XML comparison
Bottom line in my humble opinion since you receive the message in XML it won't increase efficient to convert it to JSON and then parse it - so stick to XML
With your input being XML, converting to JSON for the sake of speed looks like no good idea. I guess, with your other approach, you're losing time on the DOM creation and then even much more on xpath. The fastest solution is IMHO a SAX parser. It creates no objects and only calls you on events you're interested in.
I'm building an XSD to generate JAXB objects in Java. Then I ran into this:
<TotalBugs>
<Bug1>...</Bug1>
<Bug2>...</Bug2>
...
<BugN>...</BugN>
</TotalBugs>
How do I build a sequence of elements where the index of the sequence is in the element name? Specifically, how do I get the 1 in Bug1
You don't want to do it in this way, XML has a top-down order by nature. Consequently, you don't have to enumerate yourself:
<totalBugs>
<bug><!-- Here comes 1st bug --></bug>
<bug><!-- Here comes 2nd bug --></bug>
...
<bug><!-- Here comes last bug --></bug>
</totalBugs>
You can access the 1st bug node in the list by the XPath expression:
/totalBugs/bug[1]
Note, indexes start by W3C standard at 1. Please refer to for further readings to w3schools.
I'm pretty sure XSD won't support what you need. However you can use <xsd:any> for that bit of the schema, then use something lower-level than JAXB to generate the XML for that particular part. (I think your generated classes will have fields like protected List<Element> any; which you can fill in using DOM).
There are many pretty good json libs lika GSon. But for XML I know only Xerces/JDOM and both have tedious API.
I don't like to use unnecessary objects like DocumentFactory, XpathExpressionFactory, NodeList and so on.
So in the light of native xml support in languages such as groovy/scala I have a question.
Is there are minimalistic java XML IO framework?
PS XStream/JAxB good for serialization/deserialization, but in this case I'm looking for streaming some data in XML with XPath for example.
The W3C DOM model is unpleasant and cumbersome, I agree. JDOM is already pretty simple. The only other DOM API that I'm aware of that is simpler is XOM.
What about StAX? With Java 6 you don't even need additional libs.
Dom4J rocks. It's very easy and understandable
Sample Code:
public static void main(String[] args) throws Exception {
final String xml = "<root><foo><bar><baz name=\"phleem\" />"
+ "<baz name=\"gumbo\" /></bar></foo></root>";
Document document = DocumentHelper.parseText(xml);
// simple collection views
for (Element element : (List<Element>) document
.getRootElement()
.element("foo")
.element("bar")
.elements("baz")) {
System.out.println(element.attributeValue("name"));
}
// and easy xpath support
List<Element> elements2 = (List<Element>)
document.createXPath("//baz").evaluate(document);
for (final Element element : elements2) {
System.out.println(element.attributeValue("name"));
}
}
Output:
phleem
gumbo
phleem
gumbo
try VTD-XML. Its almost 3 to 4 times faster than DOM parsers with outstanding memory footprint.
Deppends on how complex your java objects are: are they self-containing etc (like graph nodes). If your objects are simple, you can use Google gson - it is the simpliest API(IMO).
In Xstream things start get messy when you need to debug.Also you need to be carefull when you choose an aprpriate Driver for XStream.
JDOM and XOM are probably the simplest. DOM4J is more powerful but more complex. DOM is just horrible. Processing XML in Java will always be more complex than processing JSON, because JSON was designed for structured data while XML was designed for documents, and documents are more complex than structured data. Why not use a language that was designed for XML instead, specifically XSLT or XQuery?
NanoXML is very small, below 50kb. I've found this today and I'm really impressed.