Dynamic XML Shredding in Java Without Using a Database

Dynamic XML Shredding in Java Without Using a Database - java

Is there a "standardized" way (i.e., code pattern, or, even better, open source library) in Java for dynamically flattening ("shredding") a hierarchical XML file, of large size and unknown structure, with output not redirected to an RDBMS but directly accessible?
I am looking at a transformation like the one mentioned in this question, but all the code examples I have seen use some SQL command to inject the flattened XML input to a database table, via an RDBMS (e.g., MySQL).
What I would like to do is progressively extract the XML data into a string, or, at least, into a text file, which could be post-processed afterwards, without going through any RDBMS.
EDIT:
After working further on the issue, there are a couple of solutions using XSLT (including a fully parameterizable one) in this question.

You could do it with JDOM (see example below, jdom.jar has to be on the classpath). But beware, the whole dom is in memory. If the XML is bigger you should use XSLT or a SAX parser.
import java.io.IOException;
import java.io.StringReader;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.junit.Test;
public class JDomFlatten {
#Test
public void testFlatten() {
final String xml = "<grandparent name=\"grandpa bob\">"//
+ "<parent name=\"papa john\">"//
+ "<children>"//
+ "<child name=\"mark\" />"//
+ "<child name=\"cindy\" />"//
+ "</children>"//
+ "</parent>"//
+ "<parent name=\"papa henry\">"//
+ "<children>" //
+ "<child name=\"mary\" />"//
+ "</children>"//
+ "</parent>" //
+ "</grandparent>";
final StringReader stringReader = new StringReader(xml);
final SAXBuilder builder = new SAXBuilder();
try {
final Document document = builder.build(stringReader);
final Element grandparentElement = document.getRootElement();
final StringBuilder outString = new StringBuilder();
for (final Object parentElementObject : grandparentElement.getChildren()) {
final Element parentElement = (Element) parentElementObject;
for (final Object childrenElementObject : parentElement.getChildren()) {
final Element childrenElement = (Element) childrenElementObject;
for (final Object childElementObject : childrenElement.getChildren()) {
final Element childElement = (Element) childElementObject;
outString.append(grandparentElement.getAttributeValue("name"));
outString.append(" ");
outString.append(parentElement.getAttributeValue("name"));
outString.append(" ");
outString.append(childElement.getAttributeValue("name"));
outString.append("\n");
}
}
}
System.out.println(outString);
} catch (final JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (final IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Related

XML / Java: Precise line and character positions whilst parsing tags and attributes?

I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web interface) where the document is invalid.
Ultimately I want to set the caret in a to be at the invalid tag or just inside the open quote of the invalid attribute. (I’m not using XML Schema at this point because the exact format of the attributes matters in a way that cannot be validated by schema alone. I may even want report some attributes as being invalid part-way through the attribute’s value. Or similarly, part-way through the text between a start and end tag.)
I’ve tried using SAX (org.xml.sax) and the Locator interface. This works up to a point but isn’t nearly good enough. It will only report the read position after an event; for example, the character immediately after an open tag ends, for startElement(). I can’t just subtract back the length of the tag name because attributes, self-closing tags and/or newlines within the open tag will throw this out. (And Locator provides no information about the position of attributes at all.)
Ideally I was looking to use an event-based approach, as I already have a SAX handler that is building an in-house DOM-like representation or further processing. However, I would be interested in knowing about any DOM or DOM-like library that includes exact position information for the model’s elements.
Has any one solved this issue, or any like it, with the required level of precision?

XML parsers will (and should) smooth over certain things like additional whitespace, so exact mapping back to the character stream is not feasible.
You should rather look into getting a lexer or 'token stream generator' for increased detail, in other words go to the detail level below XML parsers.
There is a few general frameworks for writing lexers in java. This ANTLR 3-based page has a nice overview of lexer vs parser and section one some rudimentory XML Lexer examples.
I'd also like to comment that for a user with a web interface, maybe you should consider a pure client-side (i.e. javascript) solution.

I wrote a quick xml file that gets the line numbers and throws an exception in the case of an unwanted attribute and gives the text where the error was thrown.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Stack;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.log4j.Logger;
import org.w3c.dom.Document;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class LocatorTestSAXReader {
private static final Logger logger = Logger.getLogger(LocatorTestSAXReader.class);
private static final String XML_FILE_PATH = "lib/xml/test-instance1.xml";
public Document readXMLFile(){
Document doc = null;
SAXParser parser = null;
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
try {
parser = saxFactory.newSAXParser();
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.newDocument();
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
StringBuilder text = new StringBuilder();
DefaultHandler eleHandler = new DefaultHandler(){
private Locator locator;
#Override
public void characters(char[] ch, int start, int length){
String thisText = new String(ch, start, length);
if(thisText.matches(".*[a-zA-z]+.*")){
text.append(thisText);
logger.debug("element text: " + thisText);
}
}
#Override
public void setDocumentLocator(Locator locator){
this.locator = locator;
}
#Override
public void startElement(final String uri, final String localName, final String qName,
final Attributes attributes)
throws SAXException {
int lineNum = locator.getLineNumber();
logger.debug("I am now on line " + lineNum + " at element " + qName);
int len = attributes.getLength();
for(int i=0;i<len;i++){
String attVal = attributes.getValue(i);
String attName = attributes.getQName(i);
logger.debug("att " + attName + "=" + attVal);
if(attName.startsWith("bad")){
throw new SAXException("found attr : " + attName + "=" + attVal + " that starts with bad! at line : " +
locator.getLineNumber() + " at element " + qName + "\nelement occurs below text : " + text);
}
}
}
};
try {
parser.parse(new FileInputStream(new File(XML_FILE_PATH)), eleHandler);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
}
with regards to the text, depending on where in the xml file the error occurs, there may not be any text. So with this xml:
<?xml version="1.0"?>
<root>
<section>
<para>This is a quick doc to test the ability to get line numbers via the Locator object. </para>
</section>
<section bad:attr="ok">
<para>another para.</para>
</section>
</root>
if the bad attr is in the first element the text will be blank. In this case, the exception thrown was:
org.xml.sax.SAXException: found attr : bad:attr=ok that starts with bad! at line : 6 at element section
element occurs below text : This is a quick doc to test the ability to get line numbers via the Locator object.
When you say you tried using the Locator object, what exactly was the problem?

Java program to get variable name from a place holder available in a XML file

Hope you all are good.
I have one requirement where I need to pick the variable name from placeholder which is available in a XML file.
I have one XML file that have all placeholders and those placeholders are starting with a $ symbol.
My Task is to get that placeholder and from it I need to get the variable name.
for example, If XML file have placeholder like $Variable1 then it will get Variable1 from that placeholder.
Following is the code that I am using :
public static String replaceConfigParam(String xmlFile, Object structure) {
for (String key : getConstants(xmlFile)) {
String actualKey = (new StringBuilder().append("$*").append(key).append("$")).toString();
try {
String value = BeanUtils.getProperty(structure, key);
if (value != null) {
xmlFile = xmlFile.replace(actualKey, value);
}
} catch (IllegalAccessException e) {
logger.error("failed to get the property from object " + e.getMessage());
} catch (InvocationTargetException e) {
logger.error("failed to get the property from object" + e.getMessage());
} catch (NoSuchMethodException e) {
logger.error("failed to get the property from object " + e.getMessage());
} catch (Exception e) {
logger.error("failed to get value from the property " + e.getMessage());
}
}
return xmlFile;
Following is the getConstant method:
private static List<String> getConstants(String domainConfig) {
String[] arr = domainConfig.split("\\$");
List<String> paramsExtracted = new ArrayList<String>();
for (String key : arr) {
paramsExtracted.add(key.replace("$", ""));
}
return paramsExtracted;
}
following is the XML file which have $ in it and I need to extract the variable from the same file:
<tunnel>
<units>
<entry name="tunnel.1">
<ip>
<entry name="$ABC"/>
</ip>
<interface-management-profile>mgt</interface-management-profile>
</entry>
</units>
</tunnel>
Thanks in advance.

Still unsure what you are actually asking. If you have a specific problem, you should describe the problem. Make it clear what you want to know, where your problems are and what you need help with. https://stackoverflow.com/help/how-to-ask
I am assuming your question is:
"How to extract the entry name property for all variables identified by a $-sign at the start".
You can do this with regular expressions, but since you are working with XML, you may use Xpath + an xpath parser. See here:
import java.io.IOException;
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class ExtractVar {
static String xml = "<tunnel>" +
"<units>" +
"<entry name=\"tunnel.1\">"+
"<ip>"+
" <entry name=\"$ABC\"/>"+
" </ip>"+
" <interface-management-profile>mgt</interface-management-profile>"+
" </entry>"+
" </units>"+
"</tunnel>";
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = db.parse(new InputSource(new StringReader(xml)));
XPathFactory xpathFactory = XPathFactory.newInstance();
XPathExpression xpath = xpathFactory.newXPath().compile("//entry");
NodeList entryNodes = (NodeList) xpath.evaluate(document, XPathConstants.NODESET);
for(int i =0; i<entryNodes.getLength(); i++) {
Node n = entryNodes.item(i);
String nodeValue = n.getAttributes().getNamedItem("name").getNodeValue();
if(nodeValue.startsWith("$")) {
System.out.println(nodeValue.substring(1, nodeValue.length()));
}
}
}
}
The following code does:
parse the document (your xml) into a DOM model.
Create an Xpath expression for the property you wish to analyse. In this case, you want all nodes named "entry" regardless of where they are in the document. This is achieved by doing a // at the start of the Xpath //entry.
Retrieve the name attribute of the entry node and check if it starts with a dollar-sign.
4.Print the Attribute value if it does start with a $ sign.
The code then prints:
ABC
I hope this is what you were looking for.
Alternatively this can be achieved with pure regex capturing every String that is surrounded by quote characters and starts with a dollar sign, as follows:
Pattern pattern = Pattern.compile("\"\\$(.*)\"");
Matcher matcher = pattern.matcher(xml);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
Artur

SLOW SPEED in using SAX Parser to parse XML data and save it to mysql localhost (JAVA)

I am programming in JAVA for my current program with the problem.
I have to parse a big .rdf file(XML format) which is 1.60 GB in size,
and then insert the parsed data to mysql localhost server.
After googling, I decided to use SAX parser in my code.
Many sites encouraged using SAX parser over DOM parser,
saying that SAX parser is much faster than DOM parser.
However, when I executed my code which uses SAX parser, I found out that
my program executes so slow.
One senior in my lab told me that the slow speed issue might have occurred
from file I/O process.
In the code of 'javax.xml.parsers.SAXParser.class',
'InputStream' is used for file input, which could make the job slow compared
to using 'Scanner' class or 'BufferedReader' class.
My question is..
1. Are SAX parsers good for parsing large-scale xml documents?
My program took 10 minutes to parse a 14MB sample file and insert data
to mysql localhost.
Actually, another senior in my lab who made a similar program
as mine but using DOM parser parses the 1.60GB xml file and saves data
in an hour.
How can I use 'BufferedReader' instead of using 'InputStream',
while using the SAX parser library?
This is my first question asking to stackoverflow, so any kinds of advices would be thankful and helpful. Thank you for reading.
Added part after receiving initial feedbacks
I should have uploaded my code to clarify my problem, I apologize for it..
package xml_parse;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class Readxml extends DefaultHandler {
Connection con = null;
String[] chunk; // to check /A/, /B/, /C/ kind of stuff.
public Readxml() throws SQLException {
// connect to local mysql database
con = DriverManager.getConnection("jdbc:mysql://localhost/lab_first",
"root", "2030kimm!");
}
public void getXml() {
try {
// obtain and configure a SAX based parser
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
// obtain object for SAX parser
SAXParser saxParser = saxParserFactory.newSAXParser();
// default handler for SAX handler class
// all three methods are written in handler's body
DefaultHandler default_handler = new DefaultHandler() {
String topic_gate = "close", category_id_gate = "close",
new_topic_id, new_catid, link_url;
java.sql.Statement st = con.createStatement();
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
if (qName.equals("Topic")) {
topic_gate = "open";
new_topic_id = attributes.getValue(0);
// apostrophe escape in SQL query
new_topic_id = new_topic_id.replace("'", "''");
if (new_topic_id.contains("International"))
topic_gate = "close";
if (new_topic_id.equals("") == false) {
chunk = new_topic_id.split("/");
for (int i = 0; i < chunk.length - 1; i++)
if (chunk[i].length() == 1) {
topic_gate = "close";
break;
}
}
if (new_topic_id.startsWith("Top/"))
new_topic_id.replace("Top/", "");
}
if (topic_gate.equals("open") && qName.equals("catid"))
category_id_gate = "open";
// add each new link to table "links" (MySQL)
if (topic_gate.equals("open") && qName.contains("link")) {
link_url = attributes.getValue(0);
link_url = link_url.replace("'", "''"); // take care of
// apostrophe
// escape
String insert_links_command = "insert into links(link_url, catid) values('"
+ link_url + "', " + new_catid + ");";
try {
st.executeUpdate(insert_links_command);
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void characters(char ch[], int start, int length)
throws SAXException {
if (category_id_gate.equals("open")) {
new_catid = new String(ch, start, length);
// add new row to table "Topics" (MySQL)
String insert_topics_command = "insert into topics(topic_id, catid) values('"
+ new_topic_id + "', " + new_catid + ");";
try {
st.executeUpdate(insert_topics_command);
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
if (qName.equals("Topic"))
topic_gate = "close";
if (qName.equals("catid"))
category_id_gate = "close";
}
};
// BufferedInputStream!!
String filepath = null;
BufferedInputStream buffered_input = null;
/*
* // Content filepath =
* "C:/Users/Kim/Desktop/2016여름/content.rdf.u8/content.rdf.u8";
* buffered_input = new BufferedInputStream(new FileInputStream(
* filepath)); saxParser.parse(buffered_input, default_handler);
*
* // Adult filepath =
* "C:/Users/Kim/Desktop/2016여름/ad-content.rdf.u8"; buffered_input =
* new BufferedInputStream(new FileInputStream( filepath));
* saxParser.parse(buffered_input, default_handler);
*/
// Kids-and-Teens
filepath = "C:/Users/Kim/Desktop/2016여름/kt-content.rdf.u8";
buffered_input = new BufferedInputStream(new FileInputStream(
filepath));
saxParser.parse(buffered_input, default_handler);
System.out.println("Finished.");
} catch (SQLException sqex) {
System.out.println("SQLException: " + sqex.getMessage());
System.out.println("SQLState: " + sqex.getSQLState());
} catch (Exception e) {
e.printStackTrace();
}
}
}
This is my whole code of my program..
My original code from yesterday tried file I/O like the following way
(instead of using 'BufferedInputStream')
saxParser.parse("file:///C:/Users/Kim/Desktop/2016여름/content.rdf.u8/content.rdf.u8",
default_handler);
I expected some speed improvements in my program after I used
'BufferedInputStream', but speed didn't improve at all.
I am having trouble figuring out the bottleneck causing the speed issue.
Thank you very much.
the rdf file being read in the code is about 14 MB in size, and it takes about
11 minutes for my computer to execute this code.

Are SAX parsers good for parsing large-scale xml documents?
Yes clearly SAX and StAX parsers are the best choices to parse big XML documents as they are low memory and CPU consumers which is not the case of DOM parsers that load everything into memory which is clearly not the right choice in this case.
Response Update:
Regarding your code for me your slowness issue is more related to how you store your data in your database. Your current code executes your queries in auto commit mode while you should use the transactional mode for better performances as you have a lot of data to insert, read this for a better understanding. To reduce the round trips between the database and your application you should also consider using batch update like in this good example.

With a SAX parser you should be able to achieve a parsing speed of 1Gb/minute without too much difficulty. If it's taking 10min to parse 14Mb then either you are doing something wrong, or the time is being spent doing something other than SAX parsing (e.g. database updating).

You can keep with the SAX parser, and use a BufferedInputStream rather than a BufferedReader (as you then need not guess the charset encoding of the XML).
It could be for XML in general, that extra files are read: DTDs and such. For instance there is a huge number of named entities for (X)HTML. The usage of an XML catalog for having those remote files locally then helps enormously.
Maybe you can switch off validation.
Also you might compare network traffic versus calculation power using gzip compression. By setting headers and inspecting headers, a GZipInputStream-by-case might be more efficient (or not).

Parsing XML file containing HTML entities in Java without changing the XML

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf

Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.

I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}

Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);

Parsing xml file contents without knowing xml file structure

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.
So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?

Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:
InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)
Then you can start with the root document element and use familiar DOM methods from there on out:
Element root = doc.getDocumentElement(); // perform DOM operations starting here.
As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.
Consider the following example:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class XML {
public static void main (String[] args) throws Exception {
String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";
// parse
InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
// process
Node objects = doc.getDocumentElement();
for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
if (object instanceof Element) {
Element e = (Element)object;
if (e.getTagName().equalsIgnoreCase("circle")) {
String color = e.getAttribute("color");
System.out.println("It's a " + color + " circle!");
} else if (e.getTagName().equalsIgnoreCase("rectangle")) {
String text = e.getTextContent();
System.out.println("It's a rectangle that says \"" + text + "\".");
} else {
System.out.println("I don't know what a " + e.getTagName() + " is for.");
}
}
}
}
}
The input XML document (hard-coded for example) is:
<objects>
<circle color='red'/>
<circle color='green'/>
<rectangle>hello</rectangle>
<glumble/>
</objects>
The output is:
It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.