Streaming xml in java - java

I am trying to read large XML file, I want only to read cars owners and I can't load whole xml to memory, how to do that ?
The XML file:
<root>
<message>
<car>
<owner>adam</owner>
</car>
<desk>
<owner>sam</owner>
<game>
<owner>dorothy</owner>
</game>
<pen>
<owner>dorothy</owner>
</pen>
</desk>
</message>
</root>
For example this code does not know exactly what it reads.. how to be sure that we are reading car owners ?
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLEventReader reader = xmlInputFactory.createXMLEventReader(new FileInputStream(entry.toFile()));
while (reader.hasNext()) {
XMLEvent nextEvent = reader.nextEvent();
if (nextEvent.isStartElement()) {
StartElement startElement = nextEvent.asStartElement();
log.info(startElement.getName().toString());
switch (startElement.getName().getLocalPart()) {
case "owner":
// whose owner. .. ?

Sturdy but viable solution is to create a small state machine, capture events as they go and mutate state accordingly
If entering car node - store car reference
If entering owner node AND you have entered car node previously, store owner of a car
When exiting car node return car-owner pair
Repeat and handle nesting and/or node level to accept only car>owner.

Related

Java Stax how to get only value of specific child nodes

I use Stax for get nodeName and nodeValue of my xml file (size 90 MB) :
<?xml version="1.0" encoding="UTF-8"?>
<name1>
<type>
<coord>67</coord>
<umc>57657</umc>
</type>
<lang>
<eng>989</eng>
<spa>123</spa>
</lang>
</name1>
<name2>
<type>
<coord>534</coord>
<umc>654654</umc>
</type>
<lang>
<eng>354</eng>
<spa>2424</spa>
</lang>
</name2>
<name3>
<type>
<coord>23432</coord>
<umc>14324</umc>
</type>
<lang>
<eng>141</eng>
<spa>142</spa>
</lang>
</name3>
I can get localName but not child nodes... if I want to get the value for all child nodes different of 'spa' how can I process to get that ?
Java:
XMLStreamReader dataXML = factory.createXMLStreamReader(new FileReader(path));
while (dataXML.hasNext())
{
int type = dataXML.next();
switch(type)
{
case XMLStreamReader.START_ELEMENT:
System.out.println(dataXML.getLocalName());
break;
case XMLStreamReader.CHARACTERS:
System.out.println(dataXML.getText());
break;
}
}
To keep track of element being parsed it's needed to introduce variable holding the current tag name as well as the variable with the tag name(s) of interest:
String localname = null;
String tagName = "spa";
while (dataXML.hasNext()) {
int type = dataXML.next();
switch (type) {
case XMLStreamReader.SPACE:
continue;
case XMLStreamReader.START_ELEMENT:
localname = dataXML.getLocalName();
System.out.println(dataXML.getLocalName());
break;
case XMLStreamReader.CHARACTERS:
if (!tagName.equals(localname)) {
System.out.println(dataXML.getText());
}
break;
}
}
In case there are several tags you want to handle, variable tagName could be replaced with a list:
List<String> tagNames = new ArrayList<>();
tagNames.add("spa");
And the check would be following:
if (!tagNames.contains(localname)) {
System.out.println(dataXML.getText());
}
You use StAX parsing. It means You pull events from a parser. StAX parsing doesn't have any information about detail structure of Your document.
Please check Differences between DOM, SAX or StAX and Java StAX parser
If You want to get children of Your XML element, You need to track it by Yourself.
If You really want children being accessed in a convenient way - use DOM parsing strategy. But as You've mentioned, Your document is ~90MB what may be really heavy to load it fully.

How to make XStream parse partial input from StAX

I am new to Stax and XStream. I am trying to unmarshall some common elements from huge XML stream (there might be between 1.5 million and 2.5 million elements to unmarshal)
I have tried to Stax to parse the stream to get to an element of interest and then call xStream to unMarshall the XML up to the EndElement.
XMLStreamReader reader = xmlInputFactory.createXMLStreamReader(fis);
while (reader.hasNext()) {
if (reader.isStartElement() && reader.getLocalName().toLowerCase().equals("person")) {
break;
}
reader.next();
}
StaxDriver sd = new StaxDriver();
AbstractPullReader rd = sd.createStaxReader(reader);
XStream xstream = new XStream(sd);
xstream.registerConverter(new PersonConverter());
Person p = (Person) xstream.unmarshal(rd);
I create a test input
<Persons>
<Person>
<name>A</name>
</Person>
<Person>
<name>B</name>
</Person>
<Person>
<name>C</name>
</Person>
</Persons>
The problem with this, is that first my converter is not called. Second, I get a CannotResolveClassException for the element "name" in Person and XStream doesn't create my Person object.
What did I miss in my code?
When you instantiate an AbstractPullReader it will read the first open-element event from the stream, establishing the "root" element. Because you've already read the first Person event it will advance to the next one (name), which it doesn't know how to unmarshal.
You'll have to do two things to make your example work:
First, alias the element name Person to your java class
xstream.alias("Person", Person.class);
Second, only advance the SAX cursor up to the element before the one you want to read:
while (reader.hasNext()) {
if (reader.isStartElement() && reader.getLocalName().equals("Persons")) {
break;
}
reader.next();
}

XmlStreamReader behaves randomly at the start

I would expect the XmlStreamReader to start at the start of the document (obviously) and then jump to the root of the XML document when I call next() on it. However, scaringly, I see it jump to the first tag with text inside and always omitting the root and often(???) the second tag.
the document looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<objektliste xmlns="http://www.pixelboxx.de/ns/erco/translations/1.0">
<uebersetzungen key="122671" attribute="7505">
<thumbnail>abrakadabra.jpg</thumbnail>
<text sprache="1031">We like the abla abla abla</text>
<text sprache="2057">We like the spoonBlaBlaBla[en]</text>
<text sprache="1036">Wicher</text>
</uebersetzungen>
<uebersetzungen key="122679" attribute="7505">
<thumbnail>122679.jpg</thumbnail>
<text sprache="1031">Kiefer</text>
<text sprache="1036">franek</text>
</uebersetzungen>
</objektliste>
Am I going insane, is my eclipse going insane or I don't see something obvious?
The program seems to always omit "objektliste" and usually jump to "thumbnail" first, even though in previous debug sessions it seemed to behave even more random.
help!!!
btw, the code is extremely simple:
XMLStreamReader streamReader = factory.createXMLStreamReader( is);
while( streamReader.hasNext())
{
//event type 7 here, everything seems to be ok.
streamReader.next();
//bang! armaggeddon - skips the root, jumps to thumbnail.
The call to streamReader.next() is event based .
The next() method causes the reader to read the next parse event. The next() method returns an integer which identifies the type of event just read.
The event type can be determined using getEventType().
I think you may be experiencing issues with the end element events showing up and you were not expecting it.
Using the following code, I see the proper order being processed as expected:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = factory.createXMLStreamReader(is);
while( streamReader.hasNext()) {
int eventType = streamReader.next();
switch (eventType) {
case XMLStreamReader.START_ELEMENT:
String elementName = streamReader.getLocalName();
System.out.println("Element: " + elementName);
break;
case XMLStreamReader.END_ELEMENT:
break;
}
}
Which generates the expected output:
Element: objektliste
Element: uebersetzungen
Element: thumbnail
Element: text
Element: text
Element: text
Element: uebersetzungen
Element: thumbnail
Element: text
Element: text

Extracting Values From an XML File Either using XPath, SAX or DOM for this Specific Scenario

I am currently working on an academic project, developing in Java and XML. Actual task is to parse XML, passing required values preferably in HashMap for further processing. Here is the short snippet of actual XML.
<root>
<BugReport ID = "1">
<Title>"(495584) Firefox - search suggestions passes wrong previous result to form history"</Title>
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> &gt; (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> &gt; Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
</BugReport>
There are many commenter like 'Justin Dolske' who have commented on this report and what I actually looking for is the list of commenter and all sentences they have written in a whole XML file. Something like if(from == justin dolske) getHisAllSentences(). Similarly for other commenters (for all). I have tried many different ways to get the sentences only for 'Justin dolske' or other commenters, even in a generic form for all using XPath, SAX and DOM but failed. I am quite new to these technologies including JAVA and any don't know how to achieve it.
Can anyone guide me specifically how could I get it with any of above technologies or is there any other better strategy to do it?
(Note: Later I want to put it in a hashmap such as like this HashMap (key, value) where key = name of commenter (justin dolske) and value is (all sentences))
Urgent help will be highly appreciated.
There're several ways using which you can achieve your requirement.
One way would be use JAXB. There're several tutorials available on this on the web, so feel free to refer to them.
You can also think of creating a DOM and then extracting data from it and then put it into your HashMap.
One reference implementation would be something like this:
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class XMLReader {
private HashMap<String,ArrayList<String>> namesSentencesMap;
public XMLReader() {
namesSentencesMap = new HashMap<String, ArrayList<String>>();
}
private Document getDocument(String fileName){
Document document = null;
try{
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(fileName));
}catch(Exception exe){
//handle exception
}
return document;
}
private void buildNamesSentencesMap(Document document){
if(document == null){
return;
}
//Get each Turn block
NodeList turnList = document.getElementsByTagName("Turn");
String fromName = null;
NodeList sentenceNodeList = null;
for(int turnIndex = 0; turnIndex < turnList.getLength(); turnIndex++){
Element turnElement = (Element)turnList.item(turnIndex);
//Assumption: <From> element
Element fromElement = (Element) turnElement.getElementsByTagName("From").item(0);
fromName = fromElement.getTextContent();
//Extracting sentences - First check whether the map contains
//an ArrayList corresponding to the name. If yes, then use that,
//else create a new one
ArrayList<String> sentenceList = namesSentencesMap.get(fromName);
if(sentenceList == null){
sentenceList = new ArrayList<String>();
}
//Extract sentences from the Turn node
try{
sentenceNodeList = turnElement.getElementsByTagName("Sentence");
for(int sentenceIndex = 0; sentenceIndex < sentenceNodeList.getLength(); sentenceIndex++){
sentenceList.add(((Element)sentenceNodeList.item(sentenceIndex)).getTextContent());
}
}finally{
sentenceNodeList = null;
}
//Put the list back in the map
namesSentencesMap.put(fromName, sentenceList);
}
}
public static void main(String[] args) {
XMLReader reader = new XMLReader();
reader.buildNamesSentencesMap(reader.getDocument("<your_xml_file>"));
for(String names: reader.namesSentencesMap.keySet()){
System.out.println("Name: "+names+"\tTotal Sentences: "+reader.namesSentencesMap.get(names).size());
}
}
}
Note: This is just a demonstration and you would need to modify it to suit your need. I've created it based on your XML to show one way of doing it.
I suggest to use JAXB to creates a Data Model reflecting your XML structure.
One done, you can load the XML into Java instances.
Put each 'Turn' into a Map< String, List< Turn >>, using Turn.From as key.
Once done, you'll can write:
List< Turn > justinsTurn = allTurns.get( "'Justin Dolske'" );

Java - parse xml string with variable tagnames?

I'm trying to parse an XML string, and the tagnames are variable; I haven't seen any examples on how to pull the information out without knowing them. For example, I will always know the <response> and <data> tags below, but what falls inside/outside of them could be anything from <employee> to you name it.
<?xml version="1.0" encoding="UTF-8"?>
<response>
<generic>
....
</generic>
<data>
<employee>
<name>Seagull</name>
<id>3674</id>
<age>34</age>
</employee>
<employee>
<name>Robin</name>
<id>3675</id>
<age>25</age>
</employee>
</data>
</response>
You could parse it into a generic dom object and traverse it. For example, you could use dom4j to do this.
From the dom4j quick start guide:
public void treeWalk(Document document) {
treeWalk( document.getRootElement() );
}
public void treeWalk(Element element) {
for ( int i = 0, size = element.nodeCount(); i < size; i++ ) {
Node node = element.node(i);
if ( node instanceof Element ) {
treeWalk( (Element) node );
}
else {
// do something....
}
}
}
public Document parse(URL url) throws DocumentException {
SAXReader reader = new SAXReader();
Document document = reader.read(url);
return document;
}
I have seen similar situation in the projects.
If you are going to deal with large XMLs, you can use Stax or Sax parser to read the XML. On every step (like on reaching end element), enter the data into a Map or a dta structure of your choice, where you keep tag names as the key and value as value in the Map. Finally once you have the parsing done, use this Map to figure out which object to build as finally you would have a proper entity representation of the information in the XML
If XML is small,use DOM and directly build the entity object by reading the specific tag (like employee> or use XPATh to where you expect the tag to be present, giving you hint of the entity. Build that object directly by reading the specific information from the XML.

Categories

Resources