I have a rich text document(.rtf or .doc) that has lot of data elements which needs to be read and converted into structured data objects either XML or Json. These docs have certain formats in terms of data. Are there any libraries that i can use to convert using java. DO anyone have come across this type of scenario?
Has anyone tried Apache POI or Apache Tika to convert into XML
I'd break this task into two parsers and two serializers
Parse rtf to java model
Parse doc to java model
Serialize java model to xml
Serialize java model to json
For 1&2 its pretty standard to use POI.
For 3&4 you have many more options, a popular option would be Jackson
I'd suggest looking at RTF Parser Kit which you can use to populate a Java data structure suitable for further processing or persistence.
Related
I work with Java. I have a bunch of long XML files. I have to filter the documents by several fields, then I have to extract other fields and then create an java object based on these fields. How should I do this? SAX, JDOM, XSLT, XPATH, JAXP etc.? Sorry I'm not familiar with XML technologies. Can you give me a general architecture for my use case?
You are asking about XML Parsing, and you also in the Java platform. It's looking pretty cool.
You can able to parse the XML stuffs using Java XML Parsers. There are many XML Parsers are available. You might going to use Java Dom Parser, JDOM Parser, SAX parser, StAX Parser or JAXP.
I prefered JAXP is powerful. It's speciallity comparing with DOM parsers. It load the part of the document when it needs.
Better you follow this site to understand the concepts.
Java XML Parsers - Tutorials Point
Could any one please recommend a tutorial or tell me how can I build a java program for extracting information from xml files and produce the out put as RDF triples using an existing ontology. an example would be really helpful.
Thanks
There are ready-made tools that address this problem, such as XSPARQL. You can write an XSPARQL query that queries the XML and produces RDF triples as output. This example should be pretty close to what you're looking for.
Your problem is really two problems:
parsing XML
writing RDF
For Java XML parsing, there are numerous examples on the web:
Java and XML - Tutorial
Java Examples in a Nutshell, Chapter 19, XML
Working with XML: The Java/XML Tutorial
For RDF there are fewer resources, it's a much more specialized field:
What are some good Java RDF libraries?
In the past I worked with Jena – it offers a friendly API to the semantic web stack.
I would recommend the XmlToRdf Java library.
XmlToRdf offers incredibly fast conversion by using the built in Java SAX parser to stream convert your XML file to RDF. A vast selection of configurations (with sane defaults) makes it simple to adjust the conversion for your needs, including element renaming and advanced IRI generation with composite identifiers.
Output from the conversion can be written directly to file as RDF Turtle or added to a Sesame Repository or Jena Dataset for further processing. With Sesame and Jena it is possible to do further, SPARQL based, transformations on the data and outputting to formats such as RDF Turtle and JSON-LD.
java serializes objects in a well-known and published manner (there's a spec).
what im looking for is a library that can parse a binary blob of serialized objects into something like a graph of apache beanutils DynaBean
such a library would be useful in case i want to "read" (and work with) serialized objects without having the classes themselves in the classpath (or, as in my case, because the classes were refactored and renamed rendering old data unreadable ...)
What is wrong with XML / JSON / BSON ? Well defined, widely accepted and language agnostic formats. There is a ton of serialisation libraries with different flavours
normally I would use JaxB, XMLBeans or Simple to convert a XML file to a Java Object.
In this case I can however only use Java5 and no external libraries (for several reasons).
What is the best way to do that? My XML input is very simple. What is the most flexible and elegant way to get the XML into a Java-Object (I don't really need real JavaBeans, since I just need GETTER).
Thanks!
Well, you can do that using DOM implementation.
Java5 provides JaxP which includes DOM and SAX.
Which one to use depends largely on how big the XML document is and how fast you need to access elements. DOM will put the whole XML structure into memory, while SAX provides a serial streaming approach.
The most flexible way to do data binding is by using XPath see the article below
http://onjava.com/pub/a/onjava/2007/09/07/schema-less-java-xml-data-binding-with-vtd-xml.html
I need to read an XML file using Java. Its contents are something like
<ReadingFile>
<csvFile>
<fileName>C:/Input.csv</fileName>
<delimiter>COMMA</delimiter>
<tableFieldNamesList>COMPANYNAME|PRODUCTNAME|PRICE</tableFieldNamesList>
<fieldProcessorDescriptorSize>20|20|20</fieldProcessorDescriptorSize>
<fieldName>company_name|product_name|price</fieldName>
</csvFile>
</ReadingFile>
Is there any special reader/JARs or should we read using FileInputStream?
Check out Java's JAXP APIs which come as standard. You can read the XML in from the file into a DOM (object model), or as SAX - a series of events (your code will receive an event for each start-of-element, end-of-element etc.). For both DOM and SAX, I would look at an API tutorial to get started.
Alternatively, you may find JDOM easier/more intuitive to use.
Another suggestion: Try out Commons digester. This allows you to develop parsing code very quickly using a rule-based approach. There's a tutorial here and the library is available here
I also agree with Brian and Alzoid in that JAXB is great to get you up and running quickly. You can use the xjc binding compiler that ships with the JDK to auto generate your Java classes given an XML schema.
xstream would do very nicely here. Check out the one page tutorial
You can user external libraries like
Castor https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1046622.html
I have used castor in past. Here are few other links that might help.
http://www.xml-training-guide.com/e-xml27.html
http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/XMLReader.html
http://www.cafeconleche.org/books/xmljava/chapters/ch07.html
There are two major ways to parse XML with Java. The first is to use a SAX parser see here
which is fairly simple.
The second option is to use a DOM parser see here
which is more complicated but gives you more control.
JAXB is another technology that might suit your needs.