Apache Tika : How to use XPath queries

Apache Tika : How to use XPath queries - java

I am parsing an XML file using Apache Tika. I would like to extract certain tags with their content from the XML and store them in a HashMap. Right now, i can extract the entire content of the XML but the tags are lost
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try
{
inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
which shows me the entire content of the XML
now, i want to extract certain parts of the XML, and since Tika allows XPath queries, i tried this
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[#nameType='Person']");
ContentHandler xhandler = new MatchingContentHandler(
new ToXMLContentHandler(), divContentMatcher);
AutoDetectParser parser = new AutoDetectParser();
Metadata xmetadata = new Metadata();
try (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
parser.parse(stream, xhandler, xmetadata);
System.out.println(xhandler.toString());
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
but it does not show any output! i was hoping it would only give me the nodes specified in the XQuery.
Any idea what's going on?
by the way, here is the corresponding XML
<Product productID="xvc22" shortProductID="x" language="en">
<ProductStatus statusType="Published" />
<Source>
<Publisher sequence="1" primaryIndicator="Yes">
<PublisherID idType="Shortname">jjkjkj</PublisherID>
<PublisherID idType="BM">6666</PublisherID>
<PublisherName nameType="Legal">ABT</PublisherName>
<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>
</Publisher>
</Source>
</Product>
also, when i test the query on
http://www.freeformatter.com/xpath-tester.html
i see the correct result i.e.
Element='<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>'
is this some syntax issue with JAVA or Tika?
EDIT
Note that if i parse without Tika, it works
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[#nameType='Person']");
System.out.println(expr.evaluate(doc, XPathConstants.STRING));
this prints out
pppp
lkkk
which is perfect. so why cant Tika parse the XPath query?

Related

JAVA how to find and delete the structure of sentences?

I have a xml file, and its structure is like this.
<?xml version="1.0" encoding="MS949"?>
<pmd-cpd>
<duplication lines="123" tokens"123">
<file line="1" path="..">
<file line="1" path="..">
<codefragment><![CDATA[........]]></codefragment>
</duplication>
<duplication>
...
</duplication>
</pmd-cpd>
I want to delete 'codefragment' node, because my parser make an error 'invalid XML character(0x1). '
My parsing code is like this,
private void parseXML(File f){
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = null;
Document document = null;
try {
builder = factory.newDocumentBuilder();
document = builder.parse(f);
}catch(...)
The error happens in document = builder.parse(f); so I cannot use parser to delete the codefragment node.
This is why I want to delete these lines without the parser.
How can I delete this node without the parser...?

This is a followup answer to OP's self-answer, and the comment I made to that answer. Here's the recap, plus some extra:
Never do String += String in a loop. Use StringBuilder.
Read the XML in blocks, not lines.
Don't use String.replaceAll(). It has to recompile the regex every time, a regex you already have. Use Matcher.replaceAll().
Remember to close() the Reader. Better yet, use try-with-resources.
No need to save the clean XML back out, just use it directly.
Since XML is usually in UTF-8, read the file as UTF-8.
Don't print and ignore errors. Let caller handle errors.
private static void parseXML(File f) throws IOException, ParserConfigurationException, SAXException {
StringBuilder xml = new StringBuilder();
try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(f),
StandardCharsets.UTF_8))) {
Pattern badChars = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]+");
char[] cbuf = new char[1024];
for (int len; (len = in.read(cbuf)) != -1; )
xml.append(badChars.matcher(CharBuffer.wrap(cbuf, 0, len)).replaceAll(""));
}
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document document = domBuilder.parse(new InputSource(new StringReader(xml.toString())));
// insert code using DOM here
}

How I solved this problem was, to remove the bad characters such as x01, save as new XML file, and then parse the new file.
Because I could not even parse my old xml file, I could not remove the node with parser.
So removing invalid character and saving as a new file code was like this.
//save the xml string as a new file.
public static Document stringToDom(String xmlSource)
throws SAXException, ParserConfigurationException, IOException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(new InputSource(new StringReader(xmlSource)));
}
//get the file and remove bad characters in it
private static void cleanString(File fileName) {
try {
BufferedReader in = new BufferedReader(new FileReader(fileName));
String xmlLines, cleanXMLString="";
Pattern p = null;
Matcher m = null;
p = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]");
while (((xmlLines = in.readLine()) != null)){
m = p.matcher(xmlLines);
if (m.find()){
cleanXMLString = cleanXMLString + xmlLines.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "")+"\n";
}else
cleanXMLString = cleanXMLString + xmlLines+"\n";
}
Document doc = stringToDom(cleanXMLString);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File("\\new\\"+fileName.getName()));
transformer.transform(source, result);
} catch (IOException | SAXException | ParserConfigurationException | TransformerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Maybe, that's not good method since it takes quite long time for even a small file(under 5MB).
But if your file is small, you can try this...

Cannot create XML Document from String

I am trying to create an org.w3c.dom.Document form an XML string. I am using this How to convert string to xml file in java as a basis. I am not getting an exception, the problem is that my document is always null. The XML is system generated and well formed. I wish to convert it to a Document object so that I can add new Nodes etc.
public static org.w3c.dom.Document stringToXML(String xmlSource) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream input = IOUtils.toInputStream(xmlSource); //uses Apache commons to obtain InputStream
BOMInputStream bomIn = new BOMInputStream(input); //create BOMInputStream from InputStream
InputSource is = new InputSource(bomIn); // InputSource with BOM removed
Document document = builder.parse(new InputSource(new StringReader(xmlSource)));
Document document2 = builder.parse(is);
System.out.println("Document=" + document.getDoctype()); // always null
System.out.println("Document2=" + document2.getDoctype()); // always null
return document;
}
I have tried these things: I created a BOMInputStream thinking that a BOM was causing the conversion to fail. I thought that this was my issue but passing the BOMInputStream to the InputSource doesn't make a difference. I have even tried passing a literal String of simple XML and nothing but null. The toString method returns [#document:null]
I am using Xpages, a JSF implementation that uses Java 6. Full name of Document class used to avoid confusion with Xpage related class of the same name.

Don't rely on what toString is telling you. It is providing diagnostic information that it thinks is useful about the current class, which is, in this case, nothing more then...
"["+getNodeName()+": "+getNodeValue()+"]";
Which isn't going to help you. Instead, you will need to try and transform the model back into a String, for example...
String text
= "<fruit>"
+ "<banana>yellow</banana>"
+ "<orange>orange</orange>"
+ "<pear>yellow</pear>"
+ "</fruit>";
InputStream is = null;
try {
is = new ByteArrayInputStream(text.getBytes());
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(is);
System.out.println("Document=" + document.toString()); // always null
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.setOutputProperty(OutputKeys.METHOD, "xml");
tf.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
ByteArrayOutputStream os = null;
try {
os = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(document);
StreamResult sr = new StreamResult(os);
tf.transform(domSource, sr);
System.out.println(new String(os.toByteArray()));
} finally {
try {
os.close();
} finally {
}
}
} catch (ParserConfigurationException | SAXException | IOException | TransformerConfigurationException exp) {
exp.printStackTrace();
} catch (TransformerException exp) {
exp.printStackTrace();
} finally {
try {
is.close();
} catch (Exception e) {
}
}
Which outputs...
Document=[#document: null]
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<fruit>
<banana>yellow</banana>
<orange>orange</orange>
<pear>yellow</pear>
</fruit>

You can try using this: http://www.wissel.net/blog/downloads/SHWL-8MRM36/$File/SimpleXMLDoc.java

Do I need server code in order to edit server files remotely?

I have a web server with an xml file that at some point is going to hold the information for posts on the website. This is the xml's structure.
<?xml version="1.0" encoding="ISO-8859-1"?>
<posts>
<post>
<date>7/9/2013 6:44 PM</date>
<category>general</category>
<poster>elfenari</poster>
<title>Test Post</title>
<content>This is a test post for the website</content>
</post>
</posts>
I created an applet using swing in netbeans, with this as the code to create the xml from the UI objects in the applet.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(url.openStream());
Element root = doc.getDocumentElement();
Element ePost = doc.createElement("post");
Element eDate = doc.createElement("date");
eDate.setTextContent(time);
Element eCategory = doc.createElement("category");
eCategory.setTextContent(category);
Element eTitle = doc.createElement("title");
eTitle.setTextContent(title);
Element ePoster = doc.createElement("poster");
ePoster.setTextContent(poster);
Element eContent = doc.createElement("content");
eContent.setTextContent(post);
ePost.appendChild(eDate);
ePost.appendChild(eCategory);
ePost.appendChild(eTitle);
ePost.appendChild(ePoster);
ePost.appendChild(eContent);
root.appendChild(ePost);
TransformerFactory transfac = TransformerFactory.newInstance();
Transformer trans = transfac.newTransformer();
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
trans.transform(source, result);
String xmlString = sw.toString();
OutputStream f0;
byte buf[] = xmlString.getBytes();
f0 = new FileOutputStream(url);
for(int i=0;i<buf .length;i++) {
f0.write(buf[i]);
}
f0.close();
buf = null;
} catch (TransformerException ex) {
Logger.getLogger(xGrep.class.getName()).log(Level.SEVERE, null, ex);
} catch (ParserConfigurationException | SAXException | IOException ex) {
Logger.getLogger(xGrep.class.getName()).log(Level.SEVERE, null, ex);
}
}
I've done some research, and I think I need a java program on my server to accept the change to the xml, but I'm not sure how exactly to do that. Could you tell me what I need to edit the file on the server, and how to code something if I do need it?

How to turn a string into an XML file? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Writing to a XML file in Java
I have below XML text as a string.
<someNode>
<id>A124</id>
<status>404</status>
<message>No data</message>
</someNode>
I have above XML data as a String. Is it possible to convert the text into an XML file and archive the generated XML file?
Thanks!

DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(theString)));

public class StringToXML {
public static void main(String[] args) {
String xmlString = "<?xml version=\"1.0\" encoding=\"utf-8\"?><soap:Envelope xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"></soap:Envelope>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
try
{
builder = factory.newDocumentBuilder();
// Use String reader
Document document = builder.parse( new InputSource(
new StringReader( xmlString ) ) );
TransformerFactory tranFactory = TransformerFactory.newInstance();
Transformer aTransformer = tranFactory.newTransformer();
Source src = new DOMSource( document );
Result dest = new StreamResult( new File( "xmlFileName.xml" ) );
aTransformer.transform( src, dest );
} catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
This information is helpful.
Thanks,
Pavan

Its simple as that:
String text = "<your><xml>data</xml></your>";
Writer writer = new FileWriter("/tmp/filename.xml");
writer.write(text);
writer.flush();
writer.close();

You can, use the java.io.FileWriter to save your file.
String fileData = "<sample><xml>data</xml></sample>";
File outputFile = new File("someFile.xml");
BufferedWriter bw = null;
try{
bw = new BufferedWriter(new FileWriter(outputFile));
bw.write(fileData);
}
catch(IOException e)
{
e.printStackTrace();
}
finally
{
try{bw.close();}catch(Exception e){}
}
In case you need to manipulate the xml do like Kazekage Gaara said:
DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(theString)));
And to save you can do as I said above. To transform the document back to string:
fileData = doc.toString();

I would recommend using commons-io. It has a single method that will do everything you need.
Code would look something like
FileUtils.writeStringToFile(new File("filename.xml"), xml);

How to validate an XML document against an XSD schema using JDom

I am working on an application that used JDom for parsing XML documents.
Following is the existing code:
private Document openDocumentAtPath(File file) {
// Create a sax builder for building the JDOM document
SAXBuilder builder = new SAXBuilder();
// JDOM document to be created from XML document
Document doc = null;
// Try to build the document
try {
// Get the file into a single string
BufferedReader input = new BufferedReader(
new FileReader( file ) );
String content = "";
String line = null;
while( ( line = input.readLine() ) != null ) {
content += "\n" + line;
}
StringReader reader = new StringReader( content );
doc = builder.build(reader);
}// Only thrown when a XML document is not well-formed
catch ( JDOMException e ) {
System.out.println(this.file + " is not well-formed!");
System.out.println("Error Message: " + e.getMessage());
}
catch (IOException e) {
System.out.println("Cannot access: " + this.file.toString());
System.out.println("Error Message: " + e.getMessage());
}
return doc;
}
Now I also want to validate the XML against an XSD. I read the API and it tells to use JAXP and stuff and I don't know how.
The application is using JDom 1.1.1 and the examples I found online used some classes that are not available in this version. Can someone explain how to validate an XML against an XSD, especially for this version.

How about simply copy-pasting code from the JDOM FAQ?

Or, use JDOM 2.0.1, and change the line:
SAXBuilder builder = new SAXBuilder();
to be
SAXBuilder builder = new SAXBuilder(XMLReaders.XSDVALIDATING);
See the JDOM 2.0.1 javadoc (examples at bottom of page): http://hunterhacker.github.com/jdom/jdom2/apidocs/org/jdom2/input/sax/package-summary.html
Oh, and I should update the FAQs

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Tika : How to use XPath queries - java

Related

JAVA how to find and delete the structure of sentences?

Cannot create XML Document from String

Do I need server code in order to edit server files remotely?

How to turn a string into an XML file? [duplicate]

How to validate an XML document against an XSD schema using JDom

Categories

Resources