Edit the link of a XML Entity with java

Edit the link of a XML Entity with java - java

I am trying to edit the link of an entity in a XML file with Java.
In fact, the original link is an internet link and I would like to convert it into a local link when the document is getting parsed.
By the way, I will download the content that we can get at this link.
This is the original kind of link :
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.website.com/category/xml_schema/ISOEntities">
This is the result i would like to have
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "C:\data\xml\entities\ISOEntities">
So what I would like to do here is basically when the original link is detected, I would like to import the data ISOEntities from the local link (and not the internet link) but, without changing the original link (I will not write in the file to change the link).
How can i do that ?
Thanks for your help !

The appropriate way will vary depending which XML library you are using to parse the data, but the essential concept is to plug in some configuration to your parser that intercepts requests to load a particular entity and redirects them to the local cached copy. For the SAX and DOM parsers of javax.xml.parsers this means an EntityResolver:
EntityResolver resolver = new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId) {
if("ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML".equals(publicId)) {
return new InputSource("file:/C:/data/xml/entities/ISOEntities");
} else {
return null; // use the default resolution logic
}
}
};
You can pass that entity resolver to the XMLReader (for SAX) or the DocumentBuilder (for DOM) and it will load the ISO entities from your local copy. The same mechanism will work for any other XML library that uses SAX or DOM internally to do its parsing (e.g. JDOM, Dom4J, XOM, ...) if you can pass in a suitably-configured XMLReader with your custom entity resolver.

Related

which is the best way for fetching value from XML : JAXB or DOM?

Which one is the efficient way for reading xml. I'm aware of two ways:
1)JAXB:
By annotating my classes with jaxb annotation we get the xml in java object vice versa using Marshalling & Unmarshalling of object.
2)DOM:
Using dom parser for parsing the xml and using xpath values from xml can be accessed.
Example of DOM:
File fXmlFile = new File("/Users/link1/input.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
As per the business demands, I'm expecting to use the fastest way and the better way between the above two. Suggestions and few tactics would be appreciated.

First question to ask: does your XML always have the same structure and can this structure be mapped on a hierarchy of Java objects?
If Yes -> either use JAXB or Jackson XmlMapper
If No (the structure of your XML varies) -> Do you require random access to the data in your XML with many reads and possibly some writes (after which you convert the data back to XML)?
2.1. If Yes -> use DOM (It is designed for in memory handling of the XML Document Tree, but has more overhead)
2.2. If No (more efficient XML parsing) -> Do you need to parse all information in the XML or do you need XML validation?
2.2.1 If Yes -> use SAX (it is included in the JDK and allows for validation)
2.2.2 If No -> use StAX (it is an XML pull parser that allows reading some values in the XML without having to parse the full XML, but it does not offer validation.)

Resolving which version of an XML Schema to use for XML documents with a version attribute

I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Junk xmlns="urn:com:initech:tps"
xmlns:xsi="http://www3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
VersionAttribute="2.0">
There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:
LSInput resolveResource(String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI)
Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?

I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.
I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":
The screening or classification of XML documents is a common problem,
especially in XML middleware. Routing XML documents to specific
processors may require analysis of both the document type and the
document content. The problem here is obtaining the required
information from the document with the least possible overhead.
Traditional parsers such as DOM or SAX are not well suited to this
task. DOM, for example, parses the whole document and constructs a
complete document tree in memory before it returns control to the
client. Even DOM parsers that employ deferred node expansion, and thus
are able to parse a document partially, have high resource demands
because the document tree must be at least partially constructed in
memory. This is simply not acceptable for screening purposes.
The code to get the version information will look like:
def map = [:]
def startElementCount = 0
def inputStream = new File(inputFile).newInputStream()
try {
XMLStreamReader reader =
XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (startElementCount > 0) return map
startElementCount += 1
map.rootElementName = reader.localName
for (int i = 0; i < reader.attributeCount; i++) {
if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
map.versionIdentifier = reader.getAttributeValue(i).toString()
return map
}
}
}
}
} finally {
inputStream.close()
}
Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.

My Suggestion
Parse the Document using SAX or DOM
Get the version attribute
Use the Validator.validate(Source) method and and use the already parsed Document (from step 1) as shown below
Building DOMSource from parsed document
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new File(args[0]));
domSource = new DOMSource(document);

Are there any advantages to using an XSLT stylesheet compared to manually parsing an XML file using a DOM parser

For one of our applications, I've written a utility that uses java's DOM parser. It basically takes an XML file, parses it and then processes the data using one of the following methods to actually retrieve the data.
getElementByTagName()
getElementAtIndex()
getFirstChild()
getNextSibling()
getTextContent()
Now i have to do the same thing but i am wondering whether it would be better to use an XSLT stylesheet. The organisation that sends us the XML file keeps changing their schema meaning that we have to change our code to cater for these shema changes. Im not very familiar with XSLT process so im trying to find out whether im better of using XSLT stylesheets rather than "manual parsing".
The reason XSLT stylesheets looks attractive is that i think that if the schema for the XML file changes i will only need to change the stylesheet? Is this correct?
The other thing i would like to know is which of the two (XSLT transformer or DOM parser) is better performance wise. For the manual option, i just use the DOM parser to parse the xml file. How does the XSLT transformer actually parse the file? Does it include additional overhead compared to manually parsing the xml file? The reason i ask is that performance is important because of the nature of the data i will be processing.
Any advice?
Thanks
Edit
Basically what I am currently doing is parsing an xml file and process the values in some of the xml elements. I don't transform the xml file into any other format. I just extract some value, extract a row from an Oracle database and save a new row into a different table. The xml file I parse just contains reference values I use to retrieve some data from the database.
Is xslt not suitable in this scenario? Is there a better approach that I can use to avoid code changes if the schema changes?
Edit 2
Apologies for not being clear enough about what i am doing with the XML data. Basically there is an XML file which contains some information. I extract this information from the XML file and use it to retrieve more information from a local database. The data in the xml file is more like reference keys for the data i need in the database. I then take the content i extracted from the XML file plus the content i retrieved from the database using a specific key from the XML file and save that data into another database table.
The problem i have is that i know how to write a DOM parser to extract the information i need from the XML file but i was wondering whether using an XSLT stylesheet was a better option as i wouldnt have to change the code if the schema changes.
Reading the responses below it sounds like XSLT is only used for transorming and XML file to another XML file or some other format. Given that i dont intend to transform the XML file, there is probably no need to add the additional overhead of parsing the XSLT stylesheet as well as the XML file.

Transforming XML documents into other formats is XSLT's reason for being. You can use XSLT to output HTML, JSON, another XML document, or anything else you need. You don't specify what kind of output you want. If you're just grabbing the contents of a few elements, then maybe you won't want to bother with XSLT. For anything more, XSLT offers an elegant solution. This is primarily because XSLT understands the structure of the document it's working on. Its processing model is tree traversal and pattern matching, which is essentially what you're manually doing in Java.
You could use XSLT to transform your source data into the representation of your choice. Your code will always work on this structure. Then, when the organization you're working with changes the schema, you only have to change your XSLT to transform the new XML into your custom format. None of your other code needs to change. Why should your business logic care about the format of its source data?

You are right that XSLT's processing model based on a rule-based event-driven approach makes your code more resilient to changes in the schema.
Because it's a different processing model to the procedural/navigational approach that you use with DOM, there is a learning and familiarisation curve, which some people find frustrating; if you want to go this way, be patient, because it will be a while before the ideas click into place. Once you are there, it's much easier than DOM programming.
The performance of a good XSLT processor will be good enough for your needs. It's of course possible to write very inefficient code, just as it is in any language, but I've rarely seen a system where XSLT was the bottleneck. Very often the XML parsing takes longer than the XSLT processing (and that's the same cost as with DOM or JAXB or anything else.)
As others have said, a lot depends on what you want to do with the XML data, which you haven't really explained.

I think that what you need is actually an XPath expression. You could configure that expression in some property file or whatever you use to retrieve your setup parameters.
In this way, you'd just change the XPath expression whenever your customer hides away the info you use in yet another place.
Basically, an XSLT is an overkill, you just need an XPath expression. A single XPath expression will allow to home in onto each value you are after.
Update
Since we are now talking about JDK 1.4 I've included below 3 different ways of fetching text in an XML file using XPath. (as simple as possible, no NPE guard fluff I'm afraid ;-)
Starting from the most up to date.
0. First the sample XML config file
<?xml version="1.0" encoding="UTF-8"?>
<config>
<param id="MaxThread" desc="MaxThread" type="int">250</param>
<param id="rTmo" desc="RespTimeout (ms)" type="int">5000</param>
</config>
1. Using JAXP 1.3 standard part of Java SE 5.0
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder;
try {
builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
XPathExpression expr = XPathFactory.newInstance().newXPath().compile(XPATH_FOR_PRM_MaxThread);
Object result = expr.evaluate(doc, XPathConstants.NUMBER);
if ( result instanceof Double ) {
System.out.println( ((Double)result).intValue() );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. Using JAXP 1.2 standard part of Java SE 1.4-2
import javax.xml.parsers.*;
import org.apache.xpath.XPathAPI;
import org.w3c.dom.*;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
Node param = XPathAPI.selectSingleNode( doc, XPATH_FOR_PRM_MaxThread );
if ( param instanceof Text ) {
System.out.println( Integer.decode(((Text)(param)).getNodeValue() ) );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. Using JAXP 1.1 standard part of Java SE 1.4 + jdom + jaxen
You need to add these 2 jars (available from www.jdom.org - binaries, jaxen is included).
import java.io.File;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.xpath.XPath;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
SAXBuilder sxb = new SAXBuilder();
Document doc = sxb.build(new File(CFG_FILE));
Element root = doc.getRootElement();
XPath xpath = XPath.newInstance(XPATH_FOR_PRM_MaxThread);
Text param = (Text) xpath.selectSingleNode(root);
Integer maxThread = Integer.decode( param.getText() );
System.out.println( maxThread );
} catch (Exception e) {
e.printStackTrace();
}
}
}

Since performance is important, I would suggest using a SAX parser for this. JAXB will give you roughly the same performance as DOM parsing PLUS it will be much easier and maintainable. Handling the changes in the schema also should not affect you badly if you are using JAXB, just get the new schema and regenerate the classes. If you have a bridge between the JAXB and your domain logic, then the changes can be absorbed in that layer without worrying about XML. I prefer treating XML as just a message that is used in the messaging layer. All the application code should be agnostic of XML schema.

Error accessing w3.org when applying a XSLT

I'm applying a xslt to a HTML file (already filtered and tidied to make it parseable as XML).
My code looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.xslt = transformerFactory.newTransformer(xsltSource);
xslt.transform(sanitizedXHTML, result);
However, I receive error for every doctype found like this:
ERROR: 'Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/loose.dtd'
I have no issue accessing the dtds from my browser.
I have little control over the HTML being parsed, and can't rip the DOCTYPE since I need them for entities.
Any help is welcome.
EDIT:
I tried to disable DTD validation like this:
private Source getSource(StreamSource sanitizedXHTML) throws ParsingException {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(false);
spf.setValidating(false); // Turn off validation
XMLReader rdr;
try {
rdr = spf.newSAXParser().getXMLReader();
} catch (SAXException e) {
throw new ParsingException(e);
} catch (ParserConfigurationException e) {
throw new ParsingException(e);
}
InputSource inputSrc = new InputSource(sanitizedXHTML.getInputStream());
return new SAXSource(rdr, inputSrc);
}
and then just calling it...
Source source = getSource(sanitizedXHTML);
xslt.transform(source, result);
The error persists.
EDIT 2:
Wrote a entity resolver, and got HTML 4.01 Transitional DTD on my local disk. However, I get this error now:
ERROR: 'The declaration for the entity "HTML.Version" must end with '>'.'
The DTD is as is, downloaded from w3.org

I have some suggestions in an answer to a related question.
In particular, when parsing the XML document, you might want to turn DTD validation off, to prevent the parser from trying to fetch the DTD. Alternatively, you might use your own entity resolver to return a local copy of the DTD instead of fetching it over the network.
Edit: Just calling setValidating(false) on the SAX Parser Factory might not be enough to prevent the parser from loading the external DTD. The parser may need the DTD for other purposes, such as entity definitions. (Perhaps you could change your HTML sanitization/preprocessing phase to replace all entity references with the equivalent numeric character entity references, eliminating the need for the DTD?)
I don't think there is a standard SAX feature flag which would ensure that external DTD loading is completely disabled, so you might have to use something specific to your parser. So if you are using Xerces, for example, you might want to look up Xerces-specific features and call setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) just to be sure.

Assuming you want the DTD loaded (for your entities), you will need to use a resolver. The basic problem that you are encountering is that the W3C limits access to the urls for the DTDs for performance reasons (they don't get any performance if they don't).
Now you should be working with a local copy of the DTD and using a catalog to handle this. You should take a look at the Apache Commons Resolver. If you don't know how to use a catalog, they're well documented in Norm Walsh's article
Of course, you will have problems if you do validate. That's an SGML DTD and you are trying to use it for XML. This will not work (probably)

How to read an XML file with Java?

I don't need to read complex XML files. I just want to read the following configuration file with a simplest XML reader
<config>
<db-host>localhost</db-host>
<db-port>3306</db-port>
<db-username>root</db-username>
<db-password>root</db-password>
<db-name>cash</db-name>
</config>
How to read the above XML file with a XML reader through Java?

I like jdom:
SAXBuilder parser = new SAXBuilder();
Document docConfig = parser.build("config.xml");
Element elConfig = docConfig.getRootElement();
String host = elConfig.getChildText("host");

Since you want to parse config files, I think commons-configuration would be the best solution.
Commons Configuration provides a generic configuration interface which enables a Java application to read configuration data from a variety of sources (including XML)

You could use a simple DOM parser to read the xml representation.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
dom = db.parse("config.xml");

If you just need a simple solution that's included with the Java SDK (since 5.0), check out the XPath package. I'm sure others perform better, but this was all I needed. Here's an example:
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.xml.sax.InputSource;
...
try {
XPath xpath = XPathFactory.newInstance().newXPath();
InputSource inputSource = new InputSource("strings.xml");
// result will equal "Save My Changes" (see XML below)
String result = xpath.evaluate("//string", inputSource);
}
catch(XPathExpressionException e) {
// do something
}
strings.xml
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="saveLabel">Save My Changes</string>
</resources>

There are several XML parsers for Java. One I've used and found particularly developer friendly is JDOM. And by developer friendly, I mean "java oriented" (i.e., you work with objects in your program), instead of "document oriented", as some other tools are.

I would recommend Commons Digester, which allows you to parse a file without writing reams of code. It uses a series of rules to determine what action is should perform when encountering a given element or attribute (a typical rule might be to create a particular business object).

For a similar use case in my application I used JaxB. With Jaxb, reading XML files is like interacting with Java POJOs. But to use JAXB you need to have the xsd for this xml file. You can look for more info here

If you want to be able to read and write objects to XML directly, you can use XStream

Although I have not tried XPath yet as it has just come to my attention now, I have tried a few solutions and have not found anything that works for this scenario.
I decided to make a library that fulfills this need as long as you are working under the assumptions mentioned in the readme. It has the advantage that it uses SAX to parse the entire file and return it to the user in the form of a map so you can lookup values as key -> value.
https://github.com/daaso-consultancy/ConciseXMLParser
If something is missing kindly inform me of the missing item as I only develop it based on the needs of others and myself.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.