I use Stax for get nodeName and nodeValue of my xml file (size 90 MB) :
<?xml version="1.0" encoding="UTF-8"?>
<name1>
<type>
<coord>67</coord>
<umc>57657</umc>
</type>
<lang>
<eng>989</eng>
<spa>123</spa>
</lang>
</name1>
<name2>
<type>
<coord>534</coord>
<umc>654654</umc>
</type>
<lang>
<eng>354</eng>
<spa>2424</spa>
</lang>
</name2>
<name3>
<type>
<coord>23432</coord>
<umc>14324</umc>
</type>
<lang>
<eng>141</eng>
<spa>142</spa>
</lang>
</name3>
I can get localName but not child nodes... if I want to get the value for all child nodes different of 'spa' how can I process to get that ?
Java:
XMLStreamReader dataXML = factory.createXMLStreamReader(new FileReader(path));
while (dataXML.hasNext())
{
int type = dataXML.next();
switch(type)
{
case XMLStreamReader.START_ELEMENT:
System.out.println(dataXML.getLocalName());
break;
case XMLStreamReader.CHARACTERS:
System.out.println(dataXML.getText());
break;
}
}
To keep track of element being parsed it's needed to introduce variable holding the current tag name as well as the variable with the tag name(s) of interest:
String localname = null;
String tagName = "spa";
while (dataXML.hasNext()) {
int type = dataXML.next();
switch (type) {
case XMLStreamReader.SPACE:
continue;
case XMLStreamReader.START_ELEMENT:
localname = dataXML.getLocalName();
System.out.println(dataXML.getLocalName());
break;
case XMLStreamReader.CHARACTERS:
if (!tagName.equals(localname)) {
System.out.println(dataXML.getText());
}
break;
}
}
In case there are several tags you want to handle, variable tagName could be replaced with a list:
List<String> tagNames = new ArrayList<>();
tagNames.add("spa");
And the check would be following:
if (!tagNames.contains(localname)) {
System.out.println(dataXML.getText());
}
You use StAX parsing. It means You pull events from a parser. StAX parsing doesn't have any information about detail structure of Your document.
Please check Differences between DOM, SAX or StAX and Java StAX parser
If You want to get children of Your XML element, You need to track it by Yourself.
If You really want children being accessed in a convenient way - use DOM parsing strategy. But as You've mentioned, Your document is ~90MB what may be really heavy to load it fully.
Related
I have a large XML file and below is an extract from it:
...
<LexicalEntry id="Ait~ifAq_1">
<Lemma partOfSpeech="n" writtenForm="اِتِّفاق"/>
<Sense id="Ait~ifAq_1_tawaAfuq_n1AR" synset="tawaAfuq_n1AR"/>
<WordForm formType="root" writtenForm="وفق"/>
</LexicalEntry>
<LexicalEntry id="tawaA&um__1">
<Lemma partOfSpeech="n" writtenForm="تَوَاؤُم"/>
<Sense id="tawaA&um__1_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
<WordForm formType="root" writtenForm="وأم"/>
</LexicalEntry>
<LexicalEntry id="tanaAgum_2">
<Lemma partOfSpeech="n" writtenForm="تناغُم"/>
<Sense id="tanaAgum_2_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
<WordForm formType="root" writtenForm="نغم"/>
</LexicalEntry>
<Synset baseConcept="3" id="tawaAfuq_n1AR">
<SynsetRelations>
<SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
<SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
<SynsetRelation relType="hypernym" targets="ext_noun_NP_420"/>
</SynsetRelations>
<MonolingualExternalRefs>
<MonolingualExternalRef externalReference="13971065-n" externalSystem="PWN30"/>
</MonolingualExternalRefs>
</Synset>
...
I want to extract specific information from it. For a given writtenForm whether from <Lemma> or <WordForm>, the programme takes the value of synset from <Sense> of that writtenForm (same <LexicalEntry>) and searches for all the value id of <Synset> that have the same value as the synset from <Sense>. After that, the programme gives us all the relations of that Synset, i.e it displays the value of relType and returns to <LexicalEntry> and looks for the value synset of <Sense> who have the same value of targets then displays its writtenForm.
I think it's a little bit complicated but the result should be like this:
اِتِّفاق hyponym تَوَاؤُم, اِنْسِجام
One of the solutions is the use of the Stream reader because of the memory consumption. but I don't how should I proceed to get what I want. help me please.
The SAX Parser is different from DOM Parser.It is looking only on the current item it can't see on the future items until they become the current item . It is one of the many you can use when XML file is extremely big . Instead of it there are many out there . To name a few:
SAX PARSER
DOM PARSER
JDOM PARSER
DOM4J PARSER
STAX PARSER
You can find for all them tutorials here.
In my opinion after learning it go straight to use DOM4J or JDOM for commercial product.
The logic of SAX Parser is that you have a MyHandler class which is extending DefaultHandler and #Overrides some of it's methods:
XML FILE:
<?xml version="1.0"?>
<class>
<student rollno="393">
<firstname>dinkar</firstname>
<lastname>kad</lastname>
<nickname>dinkar</nickname>
<marks>85</marks>
</student>
<student rollno="493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>vinni</nickname>
<marks>95</marks>
</student>
<student rollno="593">
<firstname>jasvir</firstname>
<lastname>singn</lastname>
<nickname>jazz</nickname>
<marks>90</marks>
</student>
</class>
Handler class:
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class UserHandler extends DefaultHandler {
boolean bFirstName = false;
boolean bLastName = false;
boolean bNickName = false;
boolean bMarks = false;
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes)
throws SAXException {
if (qName.equalsIgnoreCase("student")) {
String rollNo = attributes.getValue("rollno");
System.out.println("Roll No : " + rollNo);
} else if (qName.equalsIgnoreCase("firstname")) {
bFirstName = true;
} else if (qName.equalsIgnoreCase("lastname")) {
bLastName = true;
} else if (qName.equalsIgnoreCase("nickname")) {
bNickName = true;
}
else if (qName.equalsIgnoreCase("marks")) {
bMarks = true;
}
}
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("student")) {
System.out.println("End Element :" + qName);
}
}
#Override
public void characters(char ch[],
int start, int length) throws SAXException {
if (bFirstName) {
System.out.println("First Name: "
+ new String(ch, start, length));
bFirstName = false;
} else if (bLastName) {
System.out.println("Last Name: "
+ new String(ch, start, length));
bLastName = false;
} else if (bNickName) {
System.out.println("Nick Name: "
+ new String(ch, start, length));
bNickName = false;
} else if (bMarks) {
System.out.println("Marks: "
+ new String(ch, start, length));
bMarks = false;
}
}
}
Main Class :
import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("input.txt");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
XPath was designed for exactly this. Java provides support for it in the javax.xml.xpath package.
To do what you want, the code will look something like this:
List<String> findRelations(String word,
Path xmlFile)
throws XPathException {
String xmlLocation = xmlFile.toUri().toASCIIString();
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("word") ? word : null));
String id = xpath.evaluate(
"//LexicalEntry[WordForm/#writtenForm=$word or Lemma/#writtenForm=$word]/Sense/#synset",
new InputSource(xmlLocation));
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("id") ? id : null));
NodeList matches = (NodeList) xpath.evaluate(
"//Synset[#id=$id]/SynsetRelations/SynsetRelation",
new InputSource(xmlLocation),
XPathConstants.NODESET);
List<String> relations = new ArrayList<>();
int matchCount = matches.getLength();
for (int i = 0; i < matchCount; i++) {
Element match = (Element) matches.item(i);
String relType = match.getAttribute("relType");
String synset = match.getAttribute("targets");
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("synset") ? synset : null));
NodeList formNodes = (NodeList) xpath.evaluate(
"//LexicalEntry[Sense/#synset=$synset]/WordForm/#writtenForm",
new InputSource(xmlLocation),
XPathConstants.NODESET);
int formCount = formNodes.getLength();
StringJoiner forms = new StringJoiner(",");
for (int j = 0; j < formCount; j++) {
forms.add(
formNodes.item(j).getNodeValue());
}
relations.add(
String.format("%s %s %s", word, relType, forms));
}
return relations;
}
Some basic XPath information:
XPath uses a single file-path-like string to match parts of an XML document. The parts can be any structural part of the document: text, elements, attributes, even things like comments.
A Java XPath expression can attempt to match exactly one part, or several parts, or can even concatenate all matched parts as a String.
In an XPath expression, a name by itself represents an element. For example, WordForm in XPath means any <WordForm …> element in the XML document.
A name starting with # represents an attribute. For example, #writtenForm refers to any writtenForm=… attribute in the XML document.
A slash indicates a parent and child in an XML document. LexicalEntry/Lemma means any <Lemma> element which is a direct child of a <LexicalEntry> element. Synset/#id means the id=… attribute of any <Synset> element.
Just as a path starting with / indicates an absolute (root-relative) path in Unix, an XPath starting with a slash indicates an expression relative to the root of an XML document.
Two slashes means a descendant which may be a direct child, a grandchild, a great-grandchild, etc. Thus, //LexicalEntry means any LexicalEntry in the document; /LexicalEntry only matches a LexicalEntry element which is the root element.
Square brackets indicate match qualifiers. Synset[#baseConcept='3'] matches any <Synset> element with an baseConcept attribute whose value is the string "3".
XPath can refer to variables, which are defined externally, using Unix-shell-like $ substitutions, like $word. How those variables are passed to an XPath expression depends on the engine. Java uses the setXPathVariableResolver method. Variable names are in a completely separate namespace from node names, so it is of no consequence if a variable name is the same as an element name or attribute name in the XML document.
So, the XPath expressions in the code mean:
//LexicalEntry[WordForm/#writtenForm=$word or Lemma/#writtenForm=$word]/Sense/#synset
Match any <LexicalEntry> element anywhere in the XML document which has either
a WordForm child with a writtenForm attribute whose value is equal to the word variable
a Lemma child with a writtenForm attribute whose value is equal to the word variable
and for every such <LexicalEntry> element, return the value of the synset attribute of any <Sense> element which is a direct child of the <LexicalEntry> element.
The word variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
//Synset[#id=$id]/SynsetRelations/SynsetRelation
Match any <Synset> element anywhere in the XML document whose id attribute is equal to the id variable. For each such <Synset> element, look for any direct SynsetRelations child element, and return each of its direct SynsetRelation children.
The id variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
//LexicalEntry[Sense/#synset=$synset]/WordForm/#writtenForm
Match any <LexicalEntry> element anywhere in the XML document which has a <Sense> child element which has a synset attribute whose value is identical to the synset variable. For each matched element, find any <WordForm> child element and return that element’s writtenForm attribute.
The synset variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
Logically, what the above should amount to is:
Locate the synset value for the requested word.
Use the synset value to locate SynsetRelation elements.
Locate writtenForm values corresponding to the targets value of each matched SynsetRelation.
If this XML file is too large to represent in memory, use SAX.
You will want to write your SAX parser to maintain a location. To do this, I typically use a StringBuffer, but a Stack of Strings would work just as nicely. This portion will be important because it will permit you to keep track of the path back to the root of the document, which will allow you to understand where in the document you are at a given point in time (useful when trying to only extract a little information).
The main logic flow looks like:
1. When entering a node, add the node's name to the stack.
2. When exiting a node, pop the node's name (top element) off the stack.
3. To know your location, read your current branch of the XML from the bottom of the stack to the top of the stack.
4. When entering a region you care about, clear the buffer you will capture the characters into
5. When exiting a region you care about, flush the buffer into the data structure you will return back as your output.
This way you can efficiently skip over all the branches of the XML tree that you don't care about.
In my current project, we are in the process of re-factoring a java class that constructs an XML document. In previous versions of the product delivered to the customer, the XML document is built with lower case elements and attributes:
<rootElement attr = "abc">
<childElement childAttr = "xyz"/>
</rootElement>
But now we have a requirement to build the XML document with TitleCase element and attributes. The user will set a flag in a properties file to indicate whether the document should be built in lower case or title case. If the flag is configured to build the document in TitleCase, the resultant document will look like:
<RootElement Attr = "abc">
<ChildElement ChildAttr = "xyz">
</RootElement>
Various approaches to solve the problem:
1. Plugging in a transformer to convert lowercase XML document to TitleCase XML document. But this will impact the overall performance, as we deal with huge XML files spanning more than 10,000 lines.
2. Create two separate maps with corr. XML elements and attributes.
For eg:
lowercase map: rootelement -> rootElement, attr -> attr ....
TitelCase map: rootlement -> RootElement, attr -> Attr ....
Based on the property set by the user, the corr. map will be chosen and XML element/attributes from this map will be used to build the XML document.
3. Using enum to define constants and its corr. values.
public enum XMLConstants {
ROOTELEMENT("rootElement", "RootElement"),
ATTRIBUTE("attr", "Attr");
private String lowerCase;
private String titleCase;
private XMLConstants(String aLowerCase, String aTitleCase){
titleCase = aTitleCase;
lowerCase = aLowerCase;
}
public String getValue(boolean isLowerCase){
if(isLowerCase){
return lowerCase;
} else {
return titleCase;
}
}
}
--------------------------------------------------------------
// XML document builder
if(propertyFlag){
isLowerCase = false;
} else {
isLowerCase = true;
}
....
....
createRootElement(ROOTELEMENT.getValue(isLowerCase));
createAttribute(ATTRIBUTE.getValue(isLowerCase));
Please help me choose the right option keeping in mind the performance aspect of the entire solution. If you have any other suggestions, please let me know.
// set before generate XML
boolean isUpperCase;
// use function for each tag/attribute name instead of string constant
// smth. like getInCase("rootElement")
String getInCase(String initialName) {
String intialFirstCharacter = initialName.substring(0, 1);
String actualFirstCharacter;
if (isUpperCase) {
actualFirstCharacter = intialFirstCharacter.toUpperCase();
} else {
actualFirstCharacter = intialFirstCharacter.toLowerCase();
}
return actualFirstCharacter + initialName.substring(1);
}
I am writing a simulator which communicates with a client's piece of software over a local socket. The communication language is XML. I have written some code which works - parsing the incoming XML string into Document via the DocumentBuilder interface.
I have been encountering a problem with CDATA (Having never seen it before). Basically, I need to access fields within the CDATA tag and change them. I load up a 'template' XML document (to reply to the messages with) and use values received in the first message inside the response. Some of the fields that need to be changed are in this CDATA tag (clear what I mean below).
public static String getOutputMessage(String input) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document inputDoc, outputDoc;
Element messageElement = (Element)inputDoc.getElementsByTagName("TRANS").item(0);
messageType = messageElement.getAttribute("name");
if (messageType.equals("processTransaction")){
outputDoc = db.parse(path+"processTransaction\\posPrintReceipt.xml");
outputDoc = changeContent(outputDoc, "PAN_NUMBER", transaction.getPan_number());
outputDoc = changeContent(outputDoc, "TOKEN", transaction.getToken());
outputDoc = changeContent(outputDoc, "TOTAL_AMOUNT", transaction.getTotal_amount());
outputDoc = changeContent(outputDoc, "TRANSACTION_TIME", transaction.getTransaction_time());
outputDoc = changeContent(outputDoc, "TRANSACTION_DATE", transaction.getTransaction_date());
}
}
private static Document changeContent(Document doc,String tag,String value) {
System.out.println("Changing: ["+tag+" : "+value+"]");
NodeList nodes=doc.getElementsByTagName(tag);
Node node = nodes.item(0);
Node parent=node.getParentNode();
node.setTextContent(value);
System.out.println(doc.getElementsByTagName(tag).item(0) + " " + node.getTextContent());
parent.replaceChild(node, doc.getElementsByTagName(tag).item(0));
return doc;
}
The functions above work on normal Elements but below is an example XML message I have to read and change some values such as
<RLSOLVE_MSG version="5.0">
<MESSAGE>
<SOURCE_ID>DP01</SOURCE_ID>
<TRANS_NUM>000001</TRANS_NUM>
</MESSAGE>
<POI_MSG type="interaction">
<INTERACTION name="posPrintReceipt">
<RECEIPT type="merchant" format="xml">
<![CDATA[<RECEIPT>
<AUTH_CODE>06130</AUTH_CODE>
<CARD_SCHEME>VISA</CARD_SCHEME>
<CURRENCY_CODE>GBP</CURRENCY_CODE>
<CUSTOMER_PRESENCE>internet</CUSTOMER_PRESENCE>
<FINAL_AMOUNT>1.00</FINAL_AMOUNT>
<MERCHANT_NUMBER>8888888</MERCHANT_NUMBER>
<PAN_NUMBER>454420******0382</PAN_NUMBER>
<PAN_EXPIRY>12/15</PAN_EXPIRY>
<TERMINAL_ID>04176421</TERMINAL_ID>
<TOKEN>454420bbbbbkqrm0382</TOKEN>
<TOTAL_AMOUNT>1.00</TOTAL_AMOUNT>
<TRANSACTION_DATA_SOURCE>keyed</TRANSACTION_DATA_SOURCE>
<TRANSACTION_DATE>14/02/2014</TRANSACTION_DATE>
<TRANSACTION_NUMBER>000001</TRANSACTION_NUMBER>
<TRANSACTION_RESPONSE>06130</TRANSACTION_RESPONSE>
<TRANSACTION_TIME>17:13:17</TRANSACTION_TIME>
<TRANSACTION_TYPE>purchase</TRANSACTION_TYPE>
<VERIFICATION_METHOD>unknown</VERIFICATION_METHOD>
<DUPLICATE>false</DUPLICATE>
</RECEIPT>]]>
</RECEIPT>
</INTERACTION>
</POI_MSG>
CDATA is an encoding mechanism to include arbitrary data within an XML file. Everything within CDATA is parsed as a single string when loading the XML into a Document instance. If you need to access the contents of the CDATA as a DOM document, you will need to instantiate a second Document object from the string contents, make your changes, then serialize that back to a string and put the string back into a CDATA in the original document.
I dont think CDATA section will be parsed as other regular elements in the XML. CDATA section is purely to escape any syntax checks. My suggestion would be use a element to represent the data in CDATA section. If you still want to use CDATA section, I guess you'll need parse the section as a string and then load the data into a Document.
I want to parse xml elemets using java.I m succeeded in some part...But not sure how to do rest..I have xml as,
<MainTag>
<userid>user1</userid>
<country>US</country>
<city>LA</city>
<phone>
<number>1111111111</number>
</phone>
<phone>
<number>222222222</number>
</phone>
</MainTag>
<MainTag>
<userid>user2</userid>
<country>Aus</country>
<city>MB</city>
<phone>
<number>23233</number>
</phone>
<phone>
<number>8787822</number>
</phone>
<phone>
<number>10101</number>
</phone>
I am able to parse xml elements such as country,city etc as below.
public void endelement()
{
if (someText.equalsIgnoreCase("country"))
{
pojo.setCountry(Val);
}
else if(someText.equalsIgnoreCase("city"))
{
pojo.setCity(Val);
}
}
public void stratelement()
{
............
}
But in case of phone how I can parse it ? since one user has multiple phone nos.
I want to find multiple phone nos for particular user.
for e.g. in above xml
for user1 there are two phone nos.
for user2 there are three phone nos.
Can anybody help in this ? Thanks in advance.
I would recommend using JAXB, since it appears you are attempting to bind your xml to a POJO.
Looking at the code you have written here (and assuming that the example xml you have provided is a snippet of well formed xml), I am guess that your pojo object should have a member for phone numbers that is of type List<String>, and your pojo should have a method that allows you to add a phone number to the List (perhaps addPhoneNumber(String phoneNumber) {...})
First, that is not a well-formed XML (as it has two root elements) and you can't parse it with any parser API unless it is well-formed. Now, to parse the XML you would normally use the APIs meant for it like SAX, DOM or StAX or even better the JAXB binding API.
Since you seem to be new to this, I suggest you start learning JAXP. Use StAX instead of DOM or SAX.
you can use DocumetBuilderFactory java default class if you know the incoming xml format for example how many node it has and the names it is very simple look at this code ;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
//documentBuilder instance
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse("employees.xml");
}catch(ParserConfigurationException pce) {
pce.printStackTrace();
}catch(SAXException se) {
se.printStackTrace();
}catch(IOException ioe) {
ioe.printStackTrace();
}
//and than get root element
Element de= dom.getDocumentElement();
//get the nodelist of main element
NodeList nl = de.getElementsByTagName("Employee");
if(nl != null && nl.getLength() > 0) {
for(int i = 0 ; i < nl.getLength();i++) {
//get the employee element
Element el = (Element)nl.item(i);
}
}
//and then get data
private void getEmployee(Element el) {
//for each <employee> element get values
String name = getTextValue(el,"Name");
int id = getIntValue(el,"Id");
int age = getIntValue(el,"Age");
//get any element attribute
//String type = el.getAttribute("type");
}
thats all
I'm trying to parse an XML string, and the tagnames are variable; I haven't seen any examples on how to pull the information out without knowing them. For example, I will always know the <response> and <data> tags below, but what falls inside/outside of them could be anything from <employee> to you name it.
<?xml version="1.0" encoding="UTF-8"?>
<response>
<generic>
....
</generic>
<data>
<employee>
<name>Seagull</name>
<id>3674</id>
<age>34</age>
</employee>
<employee>
<name>Robin</name>
<id>3675</id>
<age>25</age>
</employee>
</data>
</response>
You could parse it into a generic dom object and traverse it. For example, you could use dom4j to do this.
From the dom4j quick start guide:
public void treeWalk(Document document) {
treeWalk( document.getRootElement() );
}
public void treeWalk(Element element) {
for ( int i = 0, size = element.nodeCount(); i < size; i++ ) {
Node node = element.node(i);
if ( node instanceof Element ) {
treeWalk( (Element) node );
}
else {
// do something....
}
}
}
public Document parse(URL url) throws DocumentException {
SAXReader reader = new SAXReader();
Document document = reader.read(url);
return document;
}
I have seen similar situation in the projects.
If you are going to deal with large XMLs, you can use Stax or Sax parser to read the XML. On every step (like on reaching end element), enter the data into a Map or a dta structure of your choice, where you keep tag names as the key and value as value in the Map. Finally once you have the parsing done, use this Map to figure out which object to build as finally you would have a proper entity representation of the information in the XML
If XML is small,use DOM and directly build the entity object by reading the specific tag (like employee> or use XPATh to where you expect the tag to be present, giving you hint of the entity. Build that object directly by reading the specific information from the XML.