Extracting XML Attributes in Java(ISBNDB)

Extracting XML Attributes in Java(ISBNDB) - java

So I'm writing an android app that needs to grab book price data from the web. I found isbndb.com which seems to provide good reasourses and price comparison. The only issue is that their xml files are a bit complex.
I am new to parsing XML in Java and don't know too much. I know how to parse basic xml files. With simple tags. I usually use the DocumentBuilder and the DocumentBuilderFactory However this is the part of the file which I'm trying to parse.
<Prices price_time="2012-04-08T20:05:49Z">
<Price store_isbn="" store_title="Discworld: Thief of Time" store_url="http://isbndb.com/x/book/thief_of_time/buy/isbn/ebay.html" store_id="ebay" currency_code="USD" is_in_stock="1" is_historic="0" check_time="2008-12-09T12:00:51Z" is_new="0" currency_rate="1" price="0.99"/>
<Price store_isbn="" store_title="" store_url="http://bookshop.blackwell.com/bobus/scripts/home.jsp?action=search&type=isbn&term=0061031321&source=1154376025" store_id="blackwell" currency_code="USD" is_in_stock="0" is_historic="0" is_new="1" check_time="2011-11-08T02:54:15Z" currency_rate="1" price="7.99"/>
</Prices>
What I am trying to do is grab the info in the attribute values such as store_isbn or store_title. If anyone could help me with this I would really appreciate it.
Thanks

You can use the above mentioned link for parsing xml and For retrieving the attribute values you can use following.
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
System.out.println("Start Element :" + attributes.getValue("store_title"));
}
attributes.getValue("store_title") method will be used for parsing attribute values. Hope it will help.

Related

Create a DOCX reading data from Oracle database

I have a student database (Oracle 11G), I need to create a module(separate) which will generate a student's details in a well-formatted word document. When I give the student ID, I need all the info(Kind of a biodata) of the student in a docx file which is very presentable. I'm not sure how to start, I was exploring Python-docx and java DOCX4j. I need suggestion how can I achieve this. Is there any tool I can do this
Your help is highly appreciated

You could extract the data from Oracle into an XML format, then use content control data binding in your Word document to bind elements in the XML.
All you need to do is inject the XML into the docx as a custom xml part, and Word will display the results automatically.
docx4j can help you to the inject the XML. If you don't want to rely on Word to display the results, then you can use docx4j to also apply the bindings.
Or you could try simple variable replacement: https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/VariableReplace.java

If you want a simple way to format your Word document directly from Java, you can try pxDoc.
The screenshot below provide an example of code and document generated from an Authors/Books model: whatever the way you request the data from your database, it is easy to render them in a well formatted document.
simple document generation example
Regarding your use case, you could also generate a document for all students at once. In the context of the screenshot example:
for (author:library.authors) {
var filename = 'c:/MyDocuments/'+author.name+'.docx'
document fileName:filename {
/** Content of my document */
}

Edit the link of a XML Entity with java

I am trying to edit the link of an entity in a XML file with Java.
In fact, the original link is an internet link and I would like to convert it into a local link when the document is getting parsed.
By the way, I will download the content that we can get at this link.
This is the original kind of link :
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.website.com/category/xml_schema/ISOEntities">
This is the result i would like to have
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "C:\data\xml\entities\ISOEntities">
So what I would like to do here is basically when the original link is detected, I would like to import the data ISOEntities from the local link (and not the internet link) but, without changing the original link (I will not write in the file to change the link).
How can i do that ?
Thanks for your help !

The appropriate way will vary depending which XML library you are using to parse the data, but the essential concept is to plug in some configuration to your parser that intercepts requests to load a particular entity and redirects them to the local cached copy. For the SAX and DOM parsers of javax.xml.parsers this means an EntityResolver:
EntityResolver resolver = new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId) {
if("ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML".equals(publicId)) {
return new InputSource("file:/C:/data/xml/entities/ISOEntities");
} else {
return null; // use the default resolution logic
}
}
};
You can pass that entity resolver to the XMLReader (for SAX) or the DocumentBuilder (for DOM) and it will load the ISO entities from your local copy. The same mechanism will work for any other XML library that uses SAX or DOM internally to do its parsing (e.g. JDOM, Dom4J, XOM, ...) if you can pass in a suitably-configured XMLReader with your custom entity resolver.

Parsing xml without namespace

I have a parsing problem that appears when I try to parse from a String, containg a xml, to a org.w3c.dom.Document.
Here is a example of a xml String that i'm trying to parse:
<enviNFe xmlns="http://www.portalfiscal.inf.br/nfe" versao="2.00">
<idLote>123</idLote>
<NFe xmlns="http://www.portalfiscal.inf.br/nfe">
...
</NFe>
</enviNFe>
The problem is, that after que String had been parsed, by the following code:
private Document documentFactory(String xml) throws SAXException,
IOException, ParserConfigurationException, DocumentException, TransformerException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document document = factory.newDocumentBuilder().parse(
new ByteArrayInputStream(xml.getBytes()));
return document;
}
The tag NFe loads without the namespace (xmlns="http://www.portalfiscal.inf.br/nfe")
I want to know why this happens, and what I could do to solve this.
Any help will be great.
Thanks, and sorry for my english.
------EDIT----
For better understanding:
This xml will be signed right after de parsing, and will be sent to a Government's server(Brazil).
After this, I do another request to this server, to verify if it was processed or not. If it was, I will get a positive response in case of any error.
The first problem I had, was that the xml was malformed. This happened because I was sending the xml without that namespace in the tag NFe.
To solve this I added it(namespace) right in the File, after the xml had being signed.
This problem in fact had been solved, but another occurred: the difference in the signature.
Because I signs the xml without the namespace, and send with it.

From what i can put together from your various comments, i think you are misunderstanding how xml works. you indicate that you manually added the namespace to the NFe element. however, in your xml example, the NFe node already has that namespace.
In this xml:
<enviNFe xmlns="http://www.portalfiscal.inf.br/nfe" versao="2.00">
<idLote>123</idLote>
<NFe>
...
</NFe>
</enviNFe>
all of the nodes have the "http://www.portalfiscal.inf.br/nfe" namespace. by putting the xmlns="..." attribute on the parent node, the namespace is applied to that node and all of the child nodes with the same prefix (in this case, no prefix).

It is returning the correct document. To test it you can just walk through your document.
doc.getFirstChild().getFirstChild().getNextSibling().getNextSibling().getNextSibling().getNamespaceURI();
Or try to get the tag by it's name:
NodeList tags = doc.getElementsByTagNameNS("http://www.portalfiscal.inf.br/nfe", "NFe");

Are there any advantages to using an XSLT stylesheet compared to manually parsing an XML file using a DOM parser

For one of our applications, I've written a utility that uses java's DOM parser. It basically takes an XML file, parses it and then processes the data using one of the following methods to actually retrieve the data.
getElementByTagName()
getElementAtIndex()
getFirstChild()
getNextSibling()
getTextContent()
Now i have to do the same thing but i am wondering whether it would be better to use an XSLT stylesheet. The organisation that sends us the XML file keeps changing their schema meaning that we have to change our code to cater for these shema changes. Im not very familiar with XSLT process so im trying to find out whether im better of using XSLT stylesheets rather than "manual parsing".
The reason XSLT stylesheets looks attractive is that i think that if the schema for the XML file changes i will only need to change the stylesheet? Is this correct?
The other thing i would like to know is which of the two (XSLT transformer or DOM parser) is better performance wise. For the manual option, i just use the DOM parser to parse the xml file. How does the XSLT transformer actually parse the file? Does it include additional overhead compared to manually parsing the xml file? The reason i ask is that performance is important because of the nature of the data i will be processing.
Any advice?
Thanks
Edit
Basically what I am currently doing is parsing an xml file and process the values in some of the xml elements. I don't transform the xml file into any other format. I just extract some value, extract a row from an Oracle database and save a new row into a different table. The xml file I parse just contains reference values I use to retrieve some data from the database.
Is xslt not suitable in this scenario? Is there a better approach that I can use to avoid code changes if the schema changes?
Edit 2
Apologies for not being clear enough about what i am doing with the XML data. Basically there is an XML file which contains some information. I extract this information from the XML file and use it to retrieve more information from a local database. The data in the xml file is more like reference keys for the data i need in the database. I then take the content i extracted from the XML file plus the content i retrieved from the database using a specific key from the XML file and save that data into another database table.
The problem i have is that i know how to write a DOM parser to extract the information i need from the XML file but i was wondering whether using an XSLT stylesheet was a better option as i wouldnt have to change the code if the schema changes.
Reading the responses below it sounds like XSLT is only used for transorming and XML file to another XML file or some other format. Given that i dont intend to transform the XML file, there is probably no need to add the additional overhead of parsing the XSLT stylesheet as well as the XML file.

Transforming XML documents into other formats is XSLT's reason for being. You can use XSLT to output HTML, JSON, another XML document, or anything else you need. You don't specify what kind of output you want. If you're just grabbing the contents of a few elements, then maybe you won't want to bother with XSLT. For anything more, XSLT offers an elegant solution. This is primarily because XSLT understands the structure of the document it's working on. Its processing model is tree traversal and pattern matching, which is essentially what you're manually doing in Java.
You could use XSLT to transform your source data into the representation of your choice. Your code will always work on this structure. Then, when the organization you're working with changes the schema, you only have to change your XSLT to transform the new XML into your custom format. None of your other code needs to change. Why should your business logic care about the format of its source data?

You are right that XSLT's processing model based on a rule-based event-driven approach makes your code more resilient to changes in the schema.
Because it's a different processing model to the procedural/navigational approach that you use with DOM, there is a learning and familiarisation curve, which some people find frustrating; if you want to go this way, be patient, because it will be a while before the ideas click into place. Once you are there, it's much easier than DOM programming.
The performance of a good XSLT processor will be good enough for your needs. It's of course possible to write very inefficient code, just as it is in any language, but I've rarely seen a system where XSLT was the bottleneck. Very often the XML parsing takes longer than the XSLT processing (and that's the same cost as with DOM or JAXB or anything else.)
As others have said, a lot depends on what you want to do with the XML data, which you haven't really explained.

I think that what you need is actually an XPath expression. You could configure that expression in some property file or whatever you use to retrieve your setup parameters.
In this way, you'd just change the XPath expression whenever your customer hides away the info you use in yet another place.
Basically, an XSLT is an overkill, you just need an XPath expression. A single XPath expression will allow to home in onto each value you are after.
Update
Since we are now talking about JDK 1.4 I've included below 3 different ways of fetching text in an XML file using XPath. (as simple as possible, no NPE guard fluff I'm afraid ;-)
Starting from the most up to date.
0. First the sample XML config file
<?xml version="1.0" encoding="UTF-8"?>
<config>
<param id="MaxThread" desc="MaxThread" type="int">250</param>
<param id="rTmo" desc="RespTimeout (ms)" type="int">5000</param>
</config>
1. Using JAXP 1.3 standard part of Java SE 5.0
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder;
try {
builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
XPathExpression expr = XPathFactory.newInstance().newXPath().compile(XPATH_FOR_PRM_MaxThread);
Object result = expr.evaluate(doc, XPathConstants.NUMBER);
if ( result instanceof Double ) {
System.out.println( ((Double)result).intValue() );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. Using JAXP 1.2 standard part of Java SE 1.4-2
import javax.xml.parsers.*;
import org.apache.xpath.XPathAPI;
import org.w3c.dom.*;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
Node param = XPathAPI.selectSingleNode( doc, XPATH_FOR_PRM_MaxThread );
if ( param instanceof Text ) {
System.out.println( Integer.decode(((Text)(param)).getNodeValue() ) );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. Using JAXP 1.1 standard part of Java SE 1.4 + jdom + jaxen
You need to add these 2 jars (available from www.jdom.org - binaries, jaxen is included).
import java.io.File;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.xpath.XPath;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
SAXBuilder sxb = new SAXBuilder();
Document doc = sxb.build(new File(CFG_FILE));
Element root = doc.getRootElement();
XPath xpath = XPath.newInstance(XPATH_FOR_PRM_MaxThread);
Text param = (Text) xpath.selectSingleNode(root);
Integer maxThread = Integer.decode( param.getText() );
System.out.println( maxThread );
} catch (Exception e) {
e.printStackTrace();
}
}
}

Since performance is important, I would suggest using a SAX parser for this. JAXB will give you roughly the same performance as DOM parsing PLUS it will be much easier and maintainable. Handling the changes in the schema also should not affect you badly if you are using JAXB, just get the new schema and regenerate the classes. If you have a bridge between the JAXB and your domain logic, then the changes can be absorbed in that layer without worrying about XML. I prefer treating XML as just a message that is used in the messaging layer. All the application code should be agnostic of XML schema.

Parsing XML file with preserving information about the line number

I am creating a tool that analyzes some XML files (XHTML files to be precise). The purpose of this tool is not only to validate the XML structure, but also to check the value of some attributes.
So I created my own org.xml.sax.helpers.DefaultHandler to handle events during the XML parsing. One of my requirements is to have the information about the current line number. So I decided to add a org.xml.sax.helpers.LocatorImpl to my own DefaultHandler. This solves almost all my problems, except one regarding the XML attributes.
Let's take an example:
<rootNode>
<foo att1="val1"/>
<bar att2="val2"
answerToEverything="43"
att3="val3"/>
</rootNode>
One of my rules indicates that if the attribute answerToEverything is defined on the node bar, its value should not be different from 42.
When encountering such XML, my tool should detect an error. As I want to give a precise error message to the user, such as:
Error in file "foo.xhtml", line #4: answerToEverything only allow "42" as value.
my parser must be able to keep the line number during the parsing, even for attributes. If we consider the following implementation for my own DefaultHandler class:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start element <" + qName + ">" + x());
for (int i = 0; i < attributes.getLength(); i++) {
System.out.println("Att '" + attributes.getQName(i) + "' = '" + attributes.getValue(i) + "' at " + locator.getLineNumber() + ":" + locator.getColumnNumber());
}
}
then for the node >bar>, it will display the following output:
Start element at 5:23
Att 'att2' = 'val2' at 5:23
Att 'answerToEverything' = '43' at 5:23
Att 'att3' = 'val3' at 5:23
As you can see, the line number is wrong because the parser will consider the whole node, including its attributes as one block.
Ideally, if the interface ContentHandler would have defined the startAttribute and startElementBeforeReadingAttributes methods, I wouldn't have any problem here :o)
So my question is how can I solve my problem?
For information, I am using Java 6
ps: Maybe another title for this question could be Java SAX parsing with attributes parsing events, or something like that...

I think that only way to implement this is to create your own InputStream (or Reader) that counts lines and somehow communicates with your SAX handler. I have not tried to implement this myself but I believe it is possible. I wish you good luck and would be glad if you succeed to do this and post your results here.

Look for an open source XML editor, its parser might have this information.
Editors don't use the same kind of parser that an application that just uses xml for data would use. Editors need more information, like you say line numbers and I would also think information about whitespace characters. A parser for an editor should not lose any information about characters in the file. That is the way you can implement for example a format function or "select enclosing element" (Alt-Shift-Up in Eclipse).

In both XmlBeans and JAXB it is possible to preserve line number information. You could consider using one of these tools (it is easier in XmlBeans).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.