Parse document structure with Java

Parse document structure with Java - java

We need to get tree like structure from a given text document using Java. Used file type should be common and open (rtf, odt, ...). Currently we use Apache Tika to parse plain text from multiple documents.
What file type and API we should use so that we could most reliably get the correct structure parsed? If this is possible with Tika, I would be happy to see any demonstrations.
For example, we should get this kind of data from the given document:
Main Heading
Heading 1
Heading 1.1
Heading 2
Heading 2.2
Main Heading is the title of the paper. Paper has two main headings, Heading 1 and Heading 2 and they both have one subheadings. We should also get contents under each heading (paragraph text).
Any help is appreciated.

OpenDocument (.odt) is practically a zip package containing multiple xml files. Content.xml contains the actual textual content of the document. We are interested in headings and they can be found inside text:h tags. Read more about ODT.
I found an implementation for extracting headings from .odt files with QueryPath.
Since the original question was about Java, here it is. First we need to get access to content.xml by using ZipFile. Then we use SAX to parse xml content out of content.xml. Sample code simply prints out all the headings:
Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions
Sample code:
public void printHeadingsOfOdtFIle(File odtFile) {
try {
ZipFile zFile = new ZipFile(odtFile);
System.out.println(zFile.getName());
ZipEntry contentFile = zFile.getEntry("content.xml");
System.out.println(contentFile.getName());
System.out.println(contentFile.getSize());
XMLReader xr = XMLReaderFactory.createXMLReader();
OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
xr.setContentHandler(handler);
xr.parse(new InputSource(zFile.getInputStream(contentFile)));
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));
}
Relevant parts of used ContentHandler look like this:
#Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
temp = "";
if("text:h".equals(qName)) {
String headingLevel = atts.getValue("text:outline-level");
if(headingLevel != null) {
System.out.print(headingLevel + " ");
}
}
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
char[] subArray = new char[length];
System.arraycopy(ch, start, subArray, 0, length);
temp = new String(subArray);
fullText.append(temp);
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if("text:h".equals(qName)) {
System.out.println(temp);
}
}

Related

ContentHandler is taking a lot of time to go through 3MB XML parsed xlsx file

I'm using SAX parser and XSSFReader of apache.poi to parse .xlsx file. my sheet contains up to 650 columns and 2000 rows (file size- about 2.5 MB). my code looks like so:
public class MyClass {
public static void main(String path){
try {
OPCPackage pkg = OPCPachage.open(new FileInputStream(path));
XSSFReader reader = new XSSFReader(pkg);
InputStream sheetData = reader.getSheet("rId3"); //the needed sheet
MyHandler handler = new MyHandler();
XMLReader parser = SAXHelper.newXMLReader();
parser.setContentHandler(handler);
parser.parse(new InputSource(sheetData));
}
catch (Exception e){
//or other catches with required exceptions
}
}
}
class MyHandler extends DefaultHandler {
#Override
public void startElement (String uri, String localName, String name, Attributes attributes) {
if("row".equals(name)) {
System.out.pringln("row: " + attributes.getValue("r"));
}
}
}
but unfortunately I saw that it takes 2 or 3 seconds to go over one row, that means that going over the sheet takes over than 30 minutes(!!)
Well, I am sure this is not supposed to be, if it was- noboy was suggesing apache.poi eventaApi for large files, was he?
I want to get to the <mergeCell> values at the end of the XML (after the closing </sheetData>, is there a better way to do it? (I was thinking of handle with the string of the xml, and simply search with some regular expression for the required values, is it possible?)
So- I have two questions:
1. What's wrong with my code/why it takes so long? (when I think about it- it actually sounds as normal situation, 600 cells- why not processing in few seconds?)
2. Is there a way to treat XML as a text file and simply search in it using regex?

SAX parser is not working properly when xml input is given as stream and some xml elements are empty

When xml input is given as input stream to SAX parser with some of the xml elements are empty, then
the parser's character method is not called and gets different result.
For example,
XML input:
<root>
<salutation>Hello Sir</salutation>
<userName />
<parent>
<child>a</child>
</parent>
<parent>
<child>b</child>
</parent>
<parent>
<child>c</child>
</parent>
<success>yes</success>
<hoursSpent />
</root>
Parser Implementation:
public class MyContentHandler implements ContentHandler {
private String salutation;
private String userName;
private String success;
private String hoursSpent;
String tmpValue="";
public void endElement(String uri, String localName, String qName) throws SAXException {
if ("salutation".equals(qName)) {
userName=tmpValue;
}
}else
if ("userName".equals(qName)) {
userName=tmpValue;
}
}else
if ("success".equals(qName)) {
success=tmpValue;
}
}else
if ("hoursSpent".equals(qName)) {
hoursSpent=tmpValue;
}
}
public void characters(char[] ch, int begin, int length) throws SAXException {
tmpValue = new String(ch, begin, length).trim();
}
Main Program:
public class MainProgram{
public static void main(String[] args) throws Exception {
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxParserFactory.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
MyContentHandler contentHandler = new MyContentHandler(xmlReader);
xmlReader.setContentHandler(contentHandler);
String input = "<root><salutation>Hello Sir</salutation><userName /><parent><child>a</child></parent><parent><child>b</child></parent><success>yes</success><hoursSpent /></root>";
InputStream stream = new ByteArrayInputStream(input.getBytes());
xmlReader.parse(new InputSource(stream));
System.out.println(contentHandler.getUserName()); //prints Hello sir instead of null
System.out.println(contentHandler.getHoursSpent); //prints yes instead of null
if empty xml element is specified without open and close elements as below,
<userName />
instead of <userName></userName>, then the character() method in the handler class is not executed and wrong value is set. This issue occurs only when i use input xml as input stream. Please help me to solve this issue

The parser is behaving exactly as specified, it is your code that is wrong.
In general the parser makes zero-to-many calls on the characters() method between a start tag and the corresponding end tag. You need to initialize an empty buffer in startElement(), append to the buffer in characters(), and then use the accumulated value in endElement().
The way you have written it, you will not only get the wrong result for an empty element, you will also get the wrong result if the parser breaks the text up into multiple calls, which often happens if (a) there are entity references in the text, or (b) the text is very long, or (c) the text happens to span two chunks that are read from the input stream in separate read() calls.

adding items to existing svg image

Good afternoon everyone;
I am currently working on a project in Java try to create a desktop tool. My question is I am now a using a framework(plantuml) to obtain svg formatted graph. However, I want to change a existing item(or component we can say) in svg output and display it another way(e.g adding details). I have done some research about this and I found two frameworks helps me to achieve my goal;
1.) http://xmlgraphics.apache.org/batik/
2.) http://svgsalamander.java.net/
Questions;
Is there any other framework will help me to manipulate existing svg?
which one should I use and how should i use I am kind of lost. I don't know where to start exactly.
assumption is: i can not change anything about plantuml. So only thing that i have i an .svg formatted image.
Regards ...

I heard about Batik and I know it is quite popular, but I have never used it. In the past I had to generate/alter SVG programmatically a few times in my program, either in Java, Javascript or C++. I always did it by hand, which means:
Using Java's standard org.w3c.dom or some other DOM library;
Using Java's standard org.xml.sax or some other SAX library.
SVG is an XML application, so it is very easy to manipulate using a generic XML API like the two listed above. You basically load the SVG file and start adding/removing/altering elements by calling appropriate methods of the API.
Here is a little example using SAX for Java. I realized that I'm a little rusty; the code seems to work but cleanup by SAX/XML/Java Gurus is welcome. It operates on an SVG file generated with Inkscape, a vector editing program, but the concepts discussed here apply to any SVG (or even XML in general) document. Basically it works by altering a stream of XML elements; it inserts a progressive label near to every object in the drawing.
SAX is event based; events are raised during XML parsing. The code handles 3 distinct events:
startDocument is raised at the beginning of the XML document; we use this event to reset the progressive counter.
startElement is raised at the beginning of an XML element; if the element is a path (a common element to describe shapes in SVG) we take note of its position (cx, cy; the qualifier sodipodi is the former name of Inkscape).
endElement is raised at the end of an XML element; if the element is a path, we raise ourselves events that lead to the generation of the label. We use the SVG element text to add the label to the document.
import java.io.File;
import java.io.FileInputStream;
import javax.xml.transform.Result;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLFilter;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.AttributesImpl;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;
public class SVGMod {
public static void main(String argv[]) {
try {
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler serializer = factory.newTransformerHandler();
Result result = new StreamResult(new File("Output.svg"));
serializer.setResult(result);
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setFeature("http://xml.org/sax/features/namespaces", true);
reader.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
XMLFilter filter = new XMLFilterImpl() {
private int x;
private int y;
private int cnt;
#Override
public void startDocument() throws SAXException {
super.startDocument();
cnt = 0;
}
#Override
public void startElement(String uri, String localName,
String qName, Attributes atts) throws SAXException {
super.startElement(uri, localName, qName, atts);
if (qName.equals("path")) {
int xIndex = atts.getIndex("sodipodi:cx");
int yIndex = atts.getIndex("sodipodi:cy");
if (xIndex != -1 && yIndex != -1) {
x = (int)Float.parseFloat(atts.getValue(xIndex));
y = (int)Float.parseFloat(atts.getValue(yIndex));
++cnt;
}
}
}
#Override
public void endElement(String uri, String localName,
String qName) throws SAXException {
super.endElement(uri, localName, qName);
if (qName.equals("path")) {
AttributesImpl atts = new AttributesImpl();
atts.addAttribute(uri, "", "x", "CDATA", new Integer(x).toString());
atts.addAttribute(uri, "", "y", "CDATA", new Integer(y).toString());
atts.addAttribute(uri, "", "fill", "CDATA", "red");
super.startElement(uri, "", "text", atts);
char[] chars = ("Object #: " + cnt).toCharArray();
super.characters(chars, 0, chars.length);
super.endElement(uri, "", "text");
}
}
};
filter.setContentHandler(serializer);
filter.setParent(reader);
filter.parse(new InputSource(new FileInputStream("Input.svg")));
} catch (Exception e) {
e.printStackTrace();
}
}
}

You might want to go the modification of PlantUML route anyway.
In the PlantUML forum
* http://sourceforge.net/apps/phpbb/plantuml/viewforum.php?f=1
you could ask for a plugin architecture that allows modification of the svg along the lines of PlantUML. This way your changes will be more "compatible" to what PlantUML does than if you just do your own modification approach.

What is the localName of this XML string?

I was going to ask an entirely different question, but magically managed to solve that. So, a new problem.
So I'm working on an android app with SAX parser. I have an XML file that contains mostly
<content:encoded>bla bla bla</content:encoded>
And I know I can use localName encoded to get that one.
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (localName.equalsIgnoreCase(descriptionId))
{
if (isItem){descriptionList.add(buff.toString());}
}
... etc etc
but then there's this:
<enclosure url="SOME URL" length="100623688" type="audio/mpeg"/>
And I want to extract SOME URL. Does anyone know how I would do that?
Thanks a lot,

Never developed for android, but if I understand you correctly, you need to read the attributes of that XML element.
In the startElement method of your SaxParser you'll have an argument "Attributes attrs" or something along those lines (at least this is what I remember from the Xerces SAX Parser).
That Attributes object contains the various... attributes of the element =)
I think it's implemented over a Map, but you can debug that quickly.
Hope it helps.

Here SOME URL is the value of the attribute url which belongs to the enclosure tag.
Here is a sample take from
http://www.exampledepot.com/egs/org.xml.sax/GetAttr.html
// Create a handler for SAX events
DefaultHandler handler = new MyHandler();
// Parse an XML file using SAX;
// The Quintessential Program to Parse an XML File Using SAX
parseXmlFile("infilename.xml", handler, true);
// This class listens for startElement SAX events
static class MyHandler extends DefaultHandler {
// This method is called when an element is encountered
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) {
// Get the number of attribute
int length = atts.getLength();
// Process each attribute
for (int i=0; i<length; i++) {
// Get names and values for each attribute
String name = atts.getQName(i);
String value = atts.getValue(i);
// The following methods are valid only if the parser is namespace-aware
// The uri of the attribute's namespace
String nsUri = atts.getURI(i);
// This is the name without the prefix
String lName = atts.getLocalName(i);
}
}
}

SAX Parser : Retrieving HTML tags from XML

I have an XML to be parsed, which as given below
<feed>
<feed_id>12941450184d2315fa63d6358242</feed_id>
<content> <fieldset><table cellpadding='0' border='0' cellspacing='0' style="clear :both"><tr valign='top' ><td width='35' ><a href='http://mypage.rediff.com/android/32868898' class='space' onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" ><div style='width:25px;height:25px;overflow:hidden;'><img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span><a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" >Android </a> </span><span style='color:#000000 !important;'>testing</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/></content>
<action>status updated</action>
</feed>
Tag contains HTML contents, which contains the data which i need. I am using a SAX Parser. Here's what i am doing
private Timeline timeLine; //Object
private String tempStr;
public void characters(char[] ch, int start, int length)
throws SAXException {
tempStr = new String(ch, start, length);
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (localName.equalsIgnoreCase("content")) {
if (timeLine != null) {
timeLine.setContent(tempStr);
}
}
Will this logic work? If no, how should i extract embedded HTML data from XML using SAX Parser.

You can parse html after all html is also xml.There is a link similar to this in stackoverflow.You can try this How to parse the html content in android using SAX PARSER

On start element,
if the element is content, your temp Str buffer should be initialized.
else if content already started,
capture the current start element and its attributes and update that to the temp Str buffer.
On characters,
if content is started, add the charecters to the current string buffer.
On end element
if content is started, Capture the end node and add to string buffer.
My Assumption:
The xml will have only one content tag.

If the html is actually xhtml, you can parse it using SAX and extract the xhtml contents of the <content> tag, but not nearly this easily.
You would have to make your handler actually respond to the events that will be raised by all the xhtml tags inside the <content> tag, and either build something resembling a DOM structure, which you could then serialize back out to xml form, or on-the-fly directly write into an xml string buffer replicating the contents.
If you modify your xml so that the html inside the content tag is wrapped in a CDATA element as suggested in How to parse the html content in android using SAX PARSER, something not too far from your code should indeed work.
But you can't just put the contents into your String tempStr variable in the characters method as you're doing. You'll need to have a startElement method that initializes a buffer for the string on seeing the <content> tag, collect into that buffer in the characters method, and put the result somewhere in the endElement for the <content> tag.

I find the solution in this way:
Note: In this solution I want to get the html content between <chapter> tags (<chapter> ... html content ... </chapter>)
DefaultHandler handler = new DefaultHandler() {
boolean chap = false;
public char[] temp;
int chapterStart;
int chapterEnd;
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equalsIgnoreCase("chapter")) {
chap = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
if (qName.equalsIgnoreCase("chapter")) {
System.out.println(new String(temp, chapterStart, chapterEnd-chapterStart));
}
System.out.println("End Element :" + qName);
}
public void characters(char ch[], int start, int length)
throws SAXException {
if (chap) {
temp = ch;
chapterStart = start;
chap = false;
}
chapterEnd = start + length;
}
};
Update:
My code have a bug. because the length of ch[] in DocumentHandler varies in different situation!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse document structure with Java - java

Related

ContentHandler is taking a lot of time to go through 3MB XML parsed xlsx file

SAX parser is not working properly when xml input is given as stream and some xml elements are empty

adding items to existing svg image

What is the localName of this XML string?

SAX Parser : Retrieving HTML tags from XML

Categories

Resources