In my current project, we are in the process of re-factoring a java class that constructs an XML document. In previous versions of the product delivered to the customer, the XML document is built with lower case elements and attributes:
<rootElement attr = "abc">
<childElement childAttr = "xyz"/>
</rootElement>
But now we have a requirement to build the XML document with TitleCase element and attributes. The user will set a flag in a properties file to indicate whether the document should be built in lower case or title case. If the flag is configured to build the document in TitleCase, the resultant document will look like:
<RootElement Attr = "abc">
<ChildElement ChildAttr = "xyz">
</RootElement>
Various approaches to solve the problem:
1. Plugging in a transformer to convert lowercase XML document to TitleCase XML document. But this will impact the overall performance, as we deal with huge XML files spanning more than 10,000 lines.
2. Create two separate maps with corr. XML elements and attributes.
For eg:
lowercase map: rootelement -> rootElement, attr -> attr ....
TitelCase map: rootlement -> RootElement, attr -> Attr ....
Based on the property set by the user, the corr. map will be chosen and XML element/attributes from this map will be used to build the XML document.
3. Using enum to define constants and its corr. values.
public enum XMLConstants {
ROOTELEMENT("rootElement", "RootElement"),
ATTRIBUTE("attr", "Attr");
private String lowerCase;
private String titleCase;
private XMLConstants(String aLowerCase, String aTitleCase){
titleCase = aTitleCase;
lowerCase = aLowerCase;
}
public String getValue(boolean isLowerCase){
if(isLowerCase){
return lowerCase;
} else {
return titleCase;
}
}
}
--------------------------------------------------------------
// XML document builder
if(propertyFlag){
isLowerCase = false;
} else {
isLowerCase = true;
}
....
....
createRootElement(ROOTELEMENT.getValue(isLowerCase));
createAttribute(ATTRIBUTE.getValue(isLowerCase));
Please help me choose the right option keeping in mind the performance aspect of the entire solution. If you have any other suggestions, please let me know.
// set before generate XML
boolean isUpperCase;
// use function for each tag/attribute name instead of string constant
// smth. like getInCase("rootElement")
String getInCase(String initialName) {
String intialFirstCharacter = initialName.substring(0, 1);
String actualFirstCharacter;
if (isUpperCase) {
actualFirstCharacter = intialFirstCharacter.toUpperCase();
} else {
actualFirstCharacter = intialFirstCharacter.toLowerCase();
}
return actualFirstCharacter + initialName.substring(1);
}
Related
I have a large XML file and below is an extract from it:
...
<LexicalEntry id="Ait~ifAq_1">
<Lemma partOfSpeech="n" writtenForm="اِتِّفاق"/>
<Sense id="Ait~ifAq_1_tawaAfuq_n1AR" synset="tawaAfuq_n1AR"/>
<WordForm formType="root" writtenForm="وفق"/>
</LexicalEntry>
<LexicalEntry id="tawaA&um__1">
<Lemma partOfSpeech="n" writtenForm="تَوَاؤُم"/>
<Sense id="tawaA&um__1_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
<WordForm formType="root" writtenForm="وأم"/>
</LexicalEntry>
<LexicalEntry id="tanaAgum_2">
<Lemma partOfSpeech="n" writtenForm="تناغُم"/>
<Sense id="tanaAgum_2_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
<WordForm formType="root" writtenForm="نغم"/>
</LexicalEntry>
<Synset baseConcept="3" id="tawaAfuq_n1AR">
<SynsetRelations>
<SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
<SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
<SynsetRelation relType="hypernym" targets="ext_noun_NP_420"/>
</SynsetRelations>
<MonolingualExternalRefs>
<MonolingualExternalRef externalReference="13971065-n" externalSystem="PWN30"/>
</MonolingualExternalRefs>
</Synset>
...
I want to extract specific information from it. For a given writtenForm whether from <Lemma> or <WordForm>, the programme takes the value of synset from <Sense> of that writtenForm (same <LexicalEntry>) and searches for all the value id of <Synset> that have the same value as the synset from <Sense>. After that, the programme gives us all the relations of that Synset, i.e it displays the value of relType and returns to <LexicalEntry> and looks for the value synset of <Sense> who have the same value of targets then displays its writtenForm.
I think it's a little bit complicated but the result should be like this:
اِتِّفاق hyponym تَوَاؤُم, اِنْسِجام
One of the solutions is the use of the Stream reader because of the memory consumption. but I don't how should I proceed to get what I want. help me please.
The SAX Parser is different from DOM Parser.It is looking only on the current item it can't see on the future items until they become the current item . It is one of the many you can use when XML file is extremely big . Instead of it there are many out there . To name a few:
SAX PARSER
DOM PARSER
JDOM PARSER
DOM4J PARSER
STAX PARSER
You can find for all them tutorials here.
In my opinion after learning it go straight to use DOM4J or JDOM for commercial product.
The logic of SAX Parser is that you have a MyHandler class which is extending DefaultHandler and #Overrides some of it's methods:
XML FILE:
<?xml version="1.0"?>
<class>
<student rollno="393">
<firstname>dinkar</firstname>
<lastname>kad</lastname>
<nickname>dinkar</nickname>
<marks>85</marks>
</student>
<student rollno="493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>vinni</nickname>
<marks>95</marks>
</student>
<student rollno="593">
<firstname>jasvir</firstname>
<lastname>singn</lastname>
<nickname>jazz</nickname>
<marks>90</marks>
</student>
</class>
Handler class:
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class UserHandler extends DefaultHandler {
boolean bFirstName = false;
boolean bLastName = false;
boolean bNickName = false;
boolean bMarks = false;
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes)
throws SAXException {
if (qName.equalsIgnoreCase("student")) {
String rollNo = attributes.getValue("rollno");
System.out.println("Roll No : " + rollNo);
} else if (qName.equalsIgnoreCase("firstname")) {
bFirstName = true;
} else if (qName.equalsIgnoreCase("lastname")) {
bLastName = true;
} else if (qName.equalsIgnoreCase("nickname")) {
bNickName = true;
}
else if (qName.equalsIgnoreCase("marks")) {
bMarks = true;
}
}
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("student")) {
System.out.println("End Element :" + qName);
}
}
#Override
public void characters(char ch[],
int start, int length) throws SAXException {
if (bFirstName) {
System.out.println("First Name: "
+ new String(ch, start, length));
bFirstName = false;
} else if (bLastName) {
System.out.println("Last Name: "
+ new String(ch, start, length));
bLastName = false;
} else if (bNickName) {
System.out.println("Nick Name: "
+ new String(ch, start, length));
bNickName = false;
} else if (bMarks) {
System.out.println("Marks: "
+ new String(ch, start, length));
bMarks = false;
}
}
}
Main Class :
import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("input.txt");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
XPath was designed for exactly this. Java provides support for it in the javax.xml.xpath package.
To do what you want, the code will look something like this:
List<String> findRelations(String word,
Path xmlFile)
throws XPathException {
String xmlLocation = xmlFile.toUri().toASCIIString();
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("word") ? word : null));
String id = xpath.evaluate(
"//LexicalEntry[WordForm/#writtenForm=$word or Lemma/#writtenForm=$word]/Sense/#synset",
new InputSource(xmlLocation));
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("id") ? id : null));
NodeList matches = (NodeList) xpath.evaluate(
"//Synset[#id=$id]/SynsetRelations/SynsetRelation",
new InputSource(xmlLocation),
XPathConstants.NODESET);
List<String> relations = new ArrayList<>();
int matchCount = matches.getLength();
for (int i = 0; i < matchCount; i++) {
Element match = (Element) matches.item(i);
String relType = match.getAttribute("relType");
String synset = match.getAttribute("targets");
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("synset") ? synset : null));
NodeList formNodes = (NodeList) xpath.evaluate(
"//LexicalEntry[Sense/#synset=$synset]/WordForm/#writtenForm",
new InputSource(xmlLocation),
XPathConstants.NODESET);
int formCount = formNodes.getLength();
StringJoiner forms = new StringJoiner(",");
for (int j = 0; j < formCount; j++) {
forms.add(
formNodes.item(j).getNodeValue());
}
relations.add(
String.format("%s %s %s", word, relType, forms));
}
return relations;
}
Some basic XPath information:
XPath uses a single file-path-like string to match parts of an XML document. The parts can be any structural part of the document: text, elements, attributes, even things like comments.
A Java XPath expression can attempt to match exactly one part, or several parts, or can even concatenate all matched parts as a String.
In an XPath expression, a name by itself represents an element. For example, WordForm in XPath means any <WordForm …> element in the XML document.
A name starting with # represents an attribute. For example, #writtenForm refers to any writtenForm=… attribute in the XML document.
A slash indicates a parent and child in an XML document. LexicalEntry/Lemma means any <Lemma> element which is a direct child of a <LexicalEntry> element. Synset/#id means the id=… attribute of any <Synset> element.
Just as a path starting with / indicates an absolute (root-relative) path in Unix, an XPath starting with a slash indicates an expression relative to the root of an XML document.
Two slashes means a descendant which may be a direct child, a grandchild, a great-grandchild, etc. Thus, //LexicalEntry means any LexicalEntry in the document; /LexicalEntry only matches a LexicalEntry element which is the root element.
Square brackets indicate match qualifiers. Synset[#baseConcept='3'] matches any <Synset> element with an baseConcept attribute whose value is the string "3".
XPath can refer to variables, which are defined externally, using Unix-shell-like $ substitutions, like $word. How those variables are passed to an XPath expression depends on the engine. Java uses the setXPathVariableResolver method. Variable names are in a completely separate namespace from node names, so it is of no consequence if a variable name is the same as an element name or attribute name in the XML document.
So, the XPath expressions in the code mean:
//LexicalEntry[WordForm/#writtenForm=$word or Lemma/#writtenForm=$word]/Sense/#synset
Match any <LexicalEntry> element anywhere in the XML document which has either
a WordForm child with a writtenForm attribute whose value is equal to the word variable
a Lemma child with a writtenForm attribute whose value is equal to the word variable
and for every such <LexicalEntry> element, return the value of the synset attribute of any <Sense> element which is a direct child of the <LexicalEntry> element.
The word variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
//Synset[#id=$id]/SynsetRelations/SynsetRelation
Match any <Synset> element anywhere in the XML document whose id attribute is equal to the id variable. For each such <Synset> element, look for any direct SynsetRelations child element, and return each of its direct SynsetRelation children.
The id variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
//LexicalEntry[Sense/#synset=$synset]/WordForm/#writtenForm
Match any <LexicalEntry> element anywhere in the XML document which has a <Sense> child element which has a synset attribute whose value is identical to the synset variable. For each matched element, find any <WordForm> child element and return that element’s writtenForm attribute.
The synset variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
Logically, what the above should amount to is:
Locate the synset value for the requested word.
Use the synset value to locate SynsetRelation elements.
Locate writtenForm values corresponding to the targets value of each matched SynsetRelation.
If this XML file is too large to represent in memory, use SAX.
You will want to write your SAX parser to maintain a location. To do this, I typically use a StringBuffer, but a Stack of Strings would work just as nicely. This portion will be important because it will permit you to keep track of the path back to the root of the document, which will allow you to understand where in the document you are at a given point in time (useful when trying to only extract a little information).
The main logic flow looks like:
1. When entering a node, add the node's name to the stack.
2. When exiting a node, pop the node's name (top element) off the stack.
3. To know your location, read your current branch of the XML from the bottom of the stack to the top of the stack.
4. When entering a region you care about, clear the buffer you will capture the characters into
5. When exiting a region you care about, flush the buffer into the data structure you will return back as your output.
This way you can efficiently skip over all the branches of the XML tree that you don't care about.
I did some research and it seems that is standard Jsoup make this change. I wonder if there is a way to configure this or is there some other Parser I can be converted to a document of Jsoup, or some way to fix this?
Unfortunately not, the constructor of Tag class changes the name to lower case:
private Tag(String tagName) {
this.tagName = tagName.toLowerCase();
}
But there are two ways to change this behavour:
If you want a clean solution, you can clone / download the JSoup Git and change this line.
If you want a dirty solution, you can use reflection.
Example for #2:
Field tagName = Tag.class.getDeclaredField("tagName"); // Get the field which contains the tagname
tagName.setAccessible(true); // Set accessible to allow changes
for( Element element : doc.select("*") ) // Iterate over all tags
{
Tag tag = element.tag(); // Get the tag of the element
String value = tagName.get(tag).toString(); // Get the value (= name) of the tag
if( !value.startsWith("#") ) // You can ignore all tags starting with a '#'
{
tagName.set(tag, value.toUpperCase()); // Set the tagname to the uppercase
}
}
tagName.setAccessible(false); // Revert to false
Here is a code sample (version >= 1.11.x):
Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true));
Document doc = parser.parseInput(html, baseUrl);
There is ParseSettings class introduced in version 1.9.3.
It comes with options to preserve case for tags and attributes.
You must use xmlParser instead of htmlParser and the tags will remain unchanged. One line does the trick:
String html = "<camelCaseTag>some text</camelCaseTag>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
I am using 1.11.1-SNAPSHOT version which does not have this piece of code.
private Tag(String tagName) {
this.tagName = tagName.toLowerCase();
}
So I checked ParseSettings as suggested above and changed this piece of code from:
static {
htmlDefault = new ParseSettings(false, false);
preserveCase = new ParseSettings(true, true);
}
to:
static {
htmlDefault = new ParseSettings(true, true);
preserveCase = new ParseSettings(true, true);
}
and skipped test cases while building JAR.
I am currently working on an academic project, developing in Java and XML. Actual task is to parse XML, passing required values preferably in HashMap for further processing. Here is the short snippet of actual XML.
<root>
<BugReport ID = "1">
<Title>"(495584) Firefox - search suggestions passes wrong previous result to form history"</Title>
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> > (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> > Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
</BugReport>
There are many commenter like 'Justin Dolske' who have commented on this report and what I actually looking for is the list of commenter and all sentences they have written in a whole XML file. Something like if(from == justin dolske) getHisAllSentences(). Similarly for other commenters (for all). I have tried many different ways to get the sentences only for 'Justin dolske' or other commenters, even in a generic form for all using XPath, SAX and DOM but failed. I am quite new to these technologies including JAVA and any don't know how to achieve it.
Can anyone guide me specifically how could I get it with any of above technologies or is there any other better strategy to do it?
(Note: Later I want to put it in a hashmap such as like this HashMap (key, value) where key = name of commenter (justin dolske) and value is (all sentences))
Urgent help will be highly appreciated.
There're several ways using which you can achieve your requirement.
One way would be use JAXB. There're several tutorials available on this on the web, so feel free to refer to them.
You can also think of creating a DOM and then extracting data from it and then put it into your HashMap.
One reference implementation would be something like this:
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class XMLReader {
private HashMap<String,ArrayList<String>> namesSentencesMap;
public XMLReader() {
namesSentencesMap = new HashMap<String, ArrayList<String>>();
}
private Document getDocument(String fileName){
Document document = null;
try{
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(fileName));
}catch(Exception exe){
//handle exception
}
return document;
}
private void buildNamesSentencesMap(Document document){
if(document == null){
return;
}
//Get each Turn block
NodeList turnList = document.getElementsByTagName("Turn");
String fromName = null;
NodeList sentenceNodeList = null;
for(int turnIndex = 0; turnIndex < turnList.getLength(); turnIndex++){
Element turnElement = (Element)turnList.item(turnIndex);
//Assumption: <From> element
Element fromElement = (Element) turnElement.getElementsByTagName("From").item(0);
fromName = fromElement.getTextContent();
//Extracting sentences - First check whether the map contains
//an ArrayList corresponding to the name. If yes, then use that,
//else create a new one
ArrayList<String> sentenceList = namesSentencesMap.get(fromName);
if(sentenceList == null){
sentenceList = new ArrayList<String>();
}
//Extract sentences from the Turn node
try{
sentenceNodeList = turnElement.getElementsByTagName("Sentence");
for(int sentenceIndex = 0; sentenceIndex < sentenceNodeList.getLength(); sentenceIndex++){
sentenceList.add(((Element)sentenceNodeList.item(sentenceIndex)).getTextContent());
}
}finally{
sentenceNodeList = null;
}
//Put the list back in the map
namesSentencesMap.put(fromName, sentenceList);
}
}
public static void main(String[] args) {
XMLReader reader = new XMLReader();
reader.buildNamesSentencesMap(reader.getDocument("<your_xml_file>"));
for(String names: reader.namesSentencesMap.keySet()){
System.out.println("Name: "+names+"\tTotal Sentences: "+reader.namesSentencesMap.get(names).size());
}
}
}
Note: This is just a demonstration and you would need to modify it to suit your need. I've created it based on your XML to show one way of doing it.
I suggest to use JAXB to creates a Data Model reflecting your XML structure.
One done, you can load the XML into Java instances.
Put each 'Turn' into a Map< String, List< Turn >>, using Turn.From as key.
Once done, you'll can write:
List< Turn > justinsTurn = allTurns.get( "'Justin Dolske'" );
I'm trying to create a SAML response. One of the attributes that makes up the assertion is called address and the attribute value needs to be a custom type that is defined in an XSD. How do I add custom attribute value types to the response?
If your attribute value XML is in String form:
String yourXMLFragment = "...";
AttributeStatementBuilder attributeStatementBuilder =
(AttributeStatementBuilder) builderFactory.getBuilder(AttributeStatement.DEFAULT_ELEMENT_NAME);
AttributeStatement attributeStatement = attributeStatementBuilder.buildObject();
AttributeBuilder attributeBuilder =
(AttributeBuilder) builderFactory.getBuilder(Attribute.DEFAULT_ELEMENT_NAME);
Attribute attr = attributeBuilder.buildObject();
attr.setName("yourAttributeName");
XSAnyBuilder sb2 = (XSAnyBuilder) builderFactory.getBuilder(XSAny.TYPE_NAME);
XSAny attrAny = sb2.buildObject(AttributeValue.DEFAULT_ELEMENT_NAME, XSAny.TYPE_NAME);
attrAny.setTextContent(yourXMLFragment.trim());
attr.getAttributeValues().add(attrAny);
attributeStatement.getAttributes().add(attr);
Actually this above does not yeld correct results. The above example can be used only to create xsany with text content not xml content (xml content gets escaped).
So after digging in opensaml sources the following did work as needed:
public XSAny createXSAny(Element dom)
{
XSAnyBuilder anyBuilder = (XSAnyBuilder) Configuration.getBuilderFactory().getBuilder(XSAny.TYPE_NAME);
XSAny any = anyBuilder.buildObject(AttributeValue.DEFAULT_ELEMENT_NAME, XSAny.TYPE_NAME);
// this builds only the root element not the whole dom
XSAny xo=anyBuilder.buildObject(dom);
// set/populate dom so whole dom gets into picture
xo.setDOM(dom);
any.getUnknownXMLObjects().add(xo);
return any;
}
I'm using javax.xml.transform.Transformer class to transform the DOM source into XML string. I have some empty elements in DOM tree, and these become one tag which I don't want.
How do I prevent <sampletag></sampletag> from becoming <sampletag/>?
I hade the same problem.
This is the function to get that result.
public static String fixClosedTag(String rawXml){
LinkedList<String[]> listTags = new LinkedList<String[]>();
String splittato[] = rawXml.split("<");
String prettyXML="";
int counter = 0;
for(int x=0;x<splittato.length;x++){
String tmpStr = splittato[x];
int indexEnd = tmpStr.indexOf("/>");
if(indexEnd>-1){
String nameTag = tmpStr.substring(0, (indexEnd));
String oldTag = "<"+ nameTag +"/>";
String newTag = "<"+ nameTag +"></"+ nameTag +">";
String tag[]=new String [2];
tag[0] = oldTag;
tag[1] = newTag;
listTags.add(tag);
}
}
prettyXML = rawXml;
for(int y=0;y<listTags.size();y++){
String el[] = listTags.get(y);
prettyXML = prettyXML.replaceAll(el[0],el[1]);
}
return prettyXML;
}
If you want to control how XML is formatted, provide your own ContentHandler to prettify XML into "text". It should not matter to the receiving end (unless human) whether it receives <name></name> or <name/> - they both mean the same thing.
The two representations are equivalent to an XML parser, so it doesn't matter.
If you want to process XML with anything else than an XML-parser, you will end up with a lot of work and an XML-parser anyway.
If the process you are sending it through NEEDS the element not to be self-closing (which it should not), you can force the element not to be self-closing by placing content inside of it.
How does the PDF converter handle XML comments or processing instructions?
<sampletag>!<--Sample Comment--></sampletag>
<sampletag><?SampleProcessingInstruction?></sampletag>
I tried below to prevent transform empty tags into single tag :
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.METHOD,"html")
It's retaining empty tags.