Apache Tika and Apache Solr integration through Java API

Apache Tika and Apache Solr integration through Java API - java

I am trying to integrate Apache tika and Apache Solr so that I can index my parse data. I'm using Solr version 4.3.1 and Tika version as 2.11.6.
The code which I am following are like:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.UUID;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.DublinCore;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class Main {
private static SolrServer solr;
public static void main(String[] args) throws IOException, SAXException, TikaException {
try {
solr = new HttpSolrServer("http://localhost:8983/solr/#/"); //create solr connection
//solr.deleteByQuery( "*:*" ); //delete everything in the index; good for testing
//location of source documents
//later this will be switched to a database
String path = "C:\\content\\";
String file_html = path + "mobydick.htm";
String file_txt = path + "/home/ben/abc.warc";
String file_pdf = path + "callofthewild.pdf";
processDocument(file_html);
processDocument(file_txt);
processDocument(file_pdf);
solr.commit(); //after all docs are added, commit to the index
//now you can search at http://localhost:8983/solr/browse
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void processDocument(String pathfilename) {
try {
InputStream input = new FileInputStream(new File(pathfilename));
//use Apache Tika to convert documents in different formats to plain text
ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
Metadata meta = new Metadata();
Parser parser = new AutoDetectParser();
//handles documents in different formats:
ParseContext context = new ParseContext();
parser.parse(input, textHandler, meta, context); //convert to plain text
//collect metadata and content from Tika and other sources
//document id must be unique, use guide
UUID guid = java.util.UUID.randomUUID();
String docid = guid.toString();
//Dublin Core metadata (partial set)
String doctitle = meta.get(DublinCore.TITLE);
String doccreator = meta.get(DublinCore.CREATOR);
//other metadata
String docurl = pathfilename; //document url
//content
String doccontent = textHandler.toString();
//call to index
indexDocument(docid, doctitle, doccreator, docurl, doccontent);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void indexDocument(String docid, String doctitle, String doccreator, String docurl, String doccontent) {
try {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", docid);
//map metadata fields to default schema
//location: path\solr-4.7.2\example\solr\collection1\conf\schema.xml
//Dublin Core
//thought: schema could be modified to use Dublin Core
doc.addField("title", doctitle);
doc.addField("author", doccreator);
//other metadata
doc.addField("url", docurl);
//content (and text)
//per schema, the content field is not indexed by default, used for returning and highlighting document content
//the schema "copyField" command automatically copies this to the "text" field which is indexed
doc.addField("content", doccontent);
//indexing
//when a field is indexed, like "text", Solr will handle tokenization, stemming, removal of stopwords etc, per the schema defn
//add to index
solr.add(doc);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
} } }
The Error I got
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/NoHttpResponseException
at Main.main(Main.java:28)
Caused by: java.lang.ClassNotFoundException: org.apache.http.NoHttpResponseException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more

This does not seem to be Tika related, so I would focus on Solr.
Specifically on how you included Solr libraries and its dependencies. If you pulled them through Maven, they should have worked. But if you did it manually, perhaps you missed one or two.
Specifically, the error message is about a missing class that is distributed with Apache Commons HTTP (client) library. Perhaps are missing it in dependencies or on the classpath.

Related

Set up URI or catalog resolver with Saxon/XQuery

I am developing a simple command line application in Java to mine data from a large XML data set (15,000+ XML files). I have chosen to use Saxon S9API as the XQuery processor for this. Everything works fine so long as there is open access to the internet where the parser used by Saxon can resolve the xsi:noNamespaceSchemaLocation URI (or any other I will assume).
I have scoured Stackoverflow, as well as general Google searching, for answers on how to provide a catalog to the XQuery processor. I have not found a good explanation on how to do so.
This is the simple code I have at this point, which as I stated works fine when there is open access the Internet:
package ipd.part.info.mining.app;
import java.io.File;
import java.util.List;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import net.sf.saxon.Configuration;
import net.sf.saxon.TransformerFactoryImpl;
import net.sf.saxon.s9api.DOMDestination;
import net.sf.saxon.s9api.Processor;
import net.sf.saxon.s9api.QName;
import net.sf.saxon.s9api.SaxonApiException;
import net.sf.saxon.s9api.XQueryCompiler;
import net.sf.saxon.s9api.XQueryEvaluator;
import net.sf.saxon.s9api.XQueryExecutable;
import net.sf.saxon.s9api.XdmAtomicValue;
import net.sf.saxon.lib.*;
import static org.apache.xerces.jaxp.JAXPConstants.JAXP_SCHEMA_LANGUAGE;
import static org.apache.xerces.jaxp.JAXPConstants.W3C_XML_SCHEMA;
import org.apache.xerces.util.XMLCatalogResolver;
import org.apache.xml.resolver.tools.CatalogResolver;
import org.w3c.dom.Document;
import org.xml.sax.ErrorHandler;
/**
*
* #author tfurst
*/
public class IPDPartInfoMiningApp {
/**
* #param args the command line arguments
*/
private static Scanner scanner = new Scanner(System.in);
private static String ietmPath;
private static String outputPath;
private static CatalogResolver resolver;
private static org.apache.xerces.util.XMLCatalogResolver xres;
private static ErrorHandler eHandler;
private static DocumentBuilderFactory DBF;
private static DocumentBuilder DB;
public static void main(String[] args) {
initDb();
try {
// TODO code application logic here
System.out.println("Enter path to complete IETM Export:");
ietmPath = scanner.nextLine();
System.out.println("Enter path to save report:");
outputPath = scanner.nextLine();
Processor proc = new Processor(true);
XQueryCompiler comp = proc.newXQueryCompiler();
//File xq = fixXquery(new File(XQ));
//XQueryExecutable exp = comp.compile(xq);
XQueryExecutable exp = comp.compile("declare variable $path external;\n" +
"\n" +
"let $coll := collection(concat($path,'?select=*.xml'))//itemSequenceNumber \n" +
"\n" +
"return\n" +
"<parts>\n" +
"{\n" +
" for $mod in $coll\n" +
" let $pn := normalize-space($mod/partNumber)\n" +
" let $nomen := $mod/partIdentSegment[1]/descrForPart\n" +
" let $smr := $mod/locationRcmdSegment/locationRcmd/sourceMaintRecoverability\n" +
" order by $pn\n" +
" return <part pn=\"{$pn}\" nomen=\"{$nomen}\" smr=\"{$smr}\"/>\n" +
"}\n" +
"</parts>");
//Serializer out = proc.newSerializer(System.out);
Document dom = DB.newDocument();
XQueryEvaluator ev = exp.load();
ev.setExternalVariable(new QName("path"), new XdmAtomicValue(new File(ietmPath).toPath().toUri().toString().substring(0, new File(ietmPath).toPath().toUri().toString().lastIndexOf("/"))));
ev.run(new DOMDestination(dom));
TransformerFactoryImpl tfact = new net.sf.saxon.TransformerFactoryImpl();
Transformer trans = tfact.newTransformer();
DOMSource src = new DOMSource(dom);
StreamResult res = new StreamResult(new File(outputPath + File.separator + "output.xml"));
trans.transform(src, res);
} catch (SaxonApiException | TransformerException ex) {
Logger.getLogger(IPDPartInfoMiningApp.class.getName()).log(Level.SEVERE, null, ex);
}
}
private static XMLCatalogResolver createXMLCatalogResolver(CatalogResolver resolver)
{
int i = 0;
List files = resolver.getCatalog().getCatalogManager().getCatalogFiles();
String[] catalogs = new String[files.size()];
XMLCatalogResolver xcr = new XMLCatalogResolver();
for(Object file : files)
{
catalogs[i] = new File(file.toString()).getAbsolutePath();
}
xcr.setCatalogList(catalogs);
return xcr;
}
private static void initDb()
{
try
{
resolver = new CatalogResolver();
eHandler = new DocumentErrorHandler();
xres = createXMLCatalogResolver(resolver);
DBF = DocumentBuilderFactory.newInstance();
DBF.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
DBF.setNamespaceAware(true);
DB = DBF.newDocumentBuilder();
DB.setEntityResolver(xres);
DB.setErrorHandler(eHandler);
}
catch (ParserConfigurationException ex)
{
ex.printStackTrace();
}
}
}
I am receiving this error when I disconnect my machine from the network:
C:\Users\tfurst\Desktop\XQuery Test\testXml\test\tool>java -jar IPD_Part_Info_Mining_App.jar
Enter path to complete IETM Export:
C:\Users\tfurst\Desktop\Wire Repl Testing
Enter path to save report:
C:\Users\tfurst\Desktop\Wire Repl Testing\report
Error on line 6 column 2
collection(): failed to parse XML file
file:/C:/Users/tfurst/Desktop/Wire%20Repl%20Testing/DMC-HH60W-A-52-21-0001-04AAA-520A-B.xml: I/O error reported by XML parser processing file:/C:/Users/tfurst/Desktop/Wire%20Repl%20Testing/DMC-HH60W-A-52-21-0001-04AAA-520A-B.xml: Read timed out
Aug 20, 2019 2:55:23 PM ipd.part.info.mining.app.IPDPartInfoMiningApp main
SEVERE: null
net.sf.saxon.s9api.SaxonApiException: collection(): failed to parse XML file file:/C:/Users/tfurst/Desktop/Wire%20Repl%20Testing/DMC-HH60W-A-52-21-0001-04AAA-520A-B.xml: I/O error reported by XML parser processing file:/C:/Users/tfurst/Desktop/Wire%20Repl%20Testing/DMC-HH60W-A-52-21-0001-04AAA-520A-B.xml: Read timed out
at net.sf.saxon.s9api.XQueryEvaluator.run(XQueryEvaluator.java:372)
at ipd.part.info.mining.app.IPDPartInfoMiningApp.main(IPDPartInfoMiningApp.java:80)
Caused by: net.sf.saxon.trans.XPathException: collection(): failed to parse XML file file:/C:/Users/tfurst/Desktop/Wire%20Repl%20Testing/DMC-HH60W-A-52-21-0001-04AAA-520A-B.xml: I/O error reported by XML parser processing file:/C:/Users/tfurst/Desktop/Wire%20Repl%20Testing/DMC-HH60W-A-52-21-0001-04AAA-520A-B.xml: Read timed out
at net.sf.saxon.resource.XmlResource.getItem(XmlResource.java:113)
at net.sf.saxon.functions.CollectionFn$2.mapItem(CollectionFn.java:246)
at net.sf.saxon.expr.ItemMappingIterator.next(ItemMappingIterator.java:113)
at net.sf.saxon.expr.ItemMappingIterator.next(ItemMappingIterator.java:108)
at net.sf.saxon.expr.ItemMappingIterator.next(ItemMappingIterator.java:108)
at net.sf.saxon.om.FocusTrackingIterator.next(FocusTrackingIterator.java:85)
at net.sf.saxon.expr.ContextMappingIterator.next(ContextMappingIterator.java:59)
at net.sf.saxon.expr.sort.DocumentOrderIterator.<init>(DocumentOrderIterator.java:47)
at net.sf.saxon.expr.sort.DocumentSorter.iterate(DocumentSorter.java:230)
at net.sf.saxon.expr.flwor.ForClausePush.processTuple(ForClausePush.java:34)
at net.sf.saxon.expr.flwor.FLWORExpression.process(FLWORExpression.java:841)
at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:337)
at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:284)
at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:151)
at net.sf.saxon.query.XQueryExpression.run(XQueryExpression.java:411)
at net.sf.saxon.s9api.XQueryEvaluator.run(XQueryEvaluator.java:370)
... 1 more
C:\Users\tfurst\Desktop\XQuery Test\testXml\test\tool>pause
Press any key to continue . . .
I am sure this is probably a relatively simple fix, most likely something I have overlooked. I know how to handle this when working with XSL tranformations, by supplying a catalog and the location of the schemas. Thanks in advance for any help, much appreciated.

To use an XML catalog file something like the following in your code should work:
Processor proc = new Processor(false); //false for Saxon-HE
XQueryCompiler compiler = proc.newXQueryCompiler();
XmlCatalogResolver.setCatalog("path/catalog.xml", proc.getUnderlyingConfiguration(), false);
...

Error indexing text from Apache Tika in Solr

I am trying to integrate Apache Tika with Solr so that text extracted by Tika could be indexed in Solr.
I tried the following code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.UUID;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.DublinCore;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class Main {
private static SolrServer solr;
public static void main(String[] args) throws IOException, SAXException, TikaException {
try {
solr = new HttpSolrServer("http://localhost:8983/solr/#/");
String path = "C:\\content\\";
String file_html = path + "mobydick.htm";
String file_txt = path + "/home/ben/abc.warc";
String file_pdf = path + "callofthewild.pdf";
processDocument(file_html);
processDocument(file_txt);
processDocument(file_pdf);
solr.commit();
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void processDocument(String pathfilename) {
try {
InputStream input = new FileInputStream(new File(pathfilename));
//use Apache Tika to convert documents in different formats to plain text
ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
Metadata meta = new Metadata();
Parser parser = new AutoDetectParser(); //handles documents in different formats:
ParseContext context = new ParseContext();
parser.parse(input, textHandler, meta, context); //convert to plain text
//collect metadata and content from Tika and other sources
//document id must be unique, use guide
UUID guid = java.util.UUID.randomUUID();
String docid = guid.toString();
//Dublin Core metadata (partial set)
String doctitle = meta.get(DublinCore.TITLE);
String doccreator = meta.get(DublinCore.CREATOR);
//other metadata
String docurl = pathfilename; //document url
//content
String doccontent = textHandler.toString();
//call to index
indexDocument(docid, doctitle, doccreator, docurl, doccontent);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void indexDocument(String docid, String doctitle, String
doccreator, String docurl, String doccontent) {
try {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", docid);
//map metadata fields to default schema
//location: path\solr-4.7.2\example\solr\collection1\conf\schema.xml
//Dublin Core
//thought: schema could be modified to use Dublin Core
doc.addField("title", doctitle);
doc.addField("author", doccreator);
//other metadata
doc.addField("url", docurl);
//content (and text)
//per schema, the content field is not indexed by default, used for returning and highlighting document content
//the schema "copyField" command automatically copies this to the "text" field which is indexed
doc.addField("content", doccontent);
//indexing
//when a field is indexed, like "text", Solr will handle tokenization, stemming, removal of stopwords etc, per the schema defn
//add to index
solr.add(doc);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
}
Unfortunately I am hitting the Error below:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/http/NoHttpResponseException at Main.main(Main.java:28)
Caused by: java.lang.ClassNotFoundException:
org.apache.http.NoHttpResponseException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 1 more
Could you please help me with the resolution of this issue?

Amazon Product Advertising API through Java/SOAP

I have been playing with Amazon's Product Advertising API, and I cannot get a request to go through and give me data. I have been working off of this: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/GSG/ and this: Amazon Product Advertising API signed request with Java
Here is my code. I generated the SOAP bindings using this: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/GSG/YourDevelopmentEnvironment.html#Java
On the Classpath, I only have: commons-codec.1.5.jar
import com.ECS.client.jax.AWSECommerceService;
import com.ECS.client.jax.AWSECommerceServicePortType;
import com.ECS.client.jax.Item;
import com.ECS.client.jax.ItemLookup;
import com.ECS.client.jax.ItemLookupRequest;
import com.ECS.client.jax.ItemLookupResponse;
import com.ECS.client.jax.ItemSearchResponse;
import com.ECS.client.jax.Items;
public class Client {
public static void main(String[] args) {
String secretKey = <my-secret-key>;
String awsKey = <my-aws-key>;
System.out.println("API Test started");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(
secretKey)); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
// Get the operation object:
com.ECS.client.jax.ItemSearchRequest itemRequest = new com.ECS.client.jax.ItemSearchRequest();
// Fill in the request object:
itemRequest.setSearchIndex("Books");
itemRequest.setKeywords("Star Wars");
// itemRequest.setVersion("2011-08-01");
com.ECS.client.jax.ItemSearch ItemElement = new com.ECS.client.jax.ItemSearch();
ItemElement.setAWSAccessKeyId(awsKey);
ItemElement.getRequest().add(itemRequest);
// Call the Web service operation and store the response
// in the response object:
com.ECS.client.jax.ItemSearchResponse response = port
.itemSearch(ItemElement);
String r = response.toString();
System.out.println("response: " + r);
for (Items itemList : response.getItems()) {
System.out.println(itemList);
for (Item item : itemList.getItem()) {
System.out.println(item);
}
}
System.out.println("API Test stopped");
}
}
Here is what I get back.. I was hoping to see some Star Wars books available on Amazon dumped out to my console :-/:
API Test started
response: com.ECS.client.jax.ItemSearchResponse#7a6769ea
com.ECS.client.jax.Items#1b5ac06e
API Test stopped
What am I doing wrong (Note that no "item" in the second for loop is being printed out, because its empty)? How can I troubleshoot this or get relevant error information?

I don't use the SOAP API but your Bounty requirements didn't state that it had to use SOAP only that you wanted to call Amazon and get results. So, I'll post this working example using the REST API which will at least fulfill your stated requirements:
I would like some working example code that hits the amazon server and returns results
You'll need to download the following to fulfill the signature requirements:
http://associates-amazon.s3.amazonaws.com/signed-requests/samples/amazon-product-advt-api-sample-java-query.zip
Unzip it and grab the com.amazon.advertising.api.sample.SignedRequestsHelper.java file and put it directly into your project. This code is used to sign the request.
You'll also need to download Apache Commons Codec 1.3 from the following and add it to your classpath i.e. add it to your project's library. Note that this is the only version of Codec that will work with the above class (SignedRequestsHelper)
http://archive.apache.org/dist/commons/codec/binaries/commons-codec-1.3.zip
Now you can copy and paste the following making sure to replace your.pkg.here with the proper package name and replace the SECRET and the KEY properties:
package your.pkg.here;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringWriter;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class Main {
private static final String SECRET_KEY = "<YOUR_SECRET_KEY>";
private static final String AWS_KEY = "<YOUR_KEY>";
public static void main(String[] args) {
SignedRequestsHelper helper = SignedRequestsHelper.getInstance("ecs.amazonaws.com", AWS_KEY, SECRET_KEY);
Map<String, String> params = new HashMap<String, String>();
params.put("Service", "AWSECommerceService");
params.put("Version", "2009-03-31");
params.put("Operation", "ItemLookup");
params.put("ItemId", "1451648537");
params.put("ResponseGroup", "Large");
String url = helper.sign(params);
try {
Document response = getResponse(url);
printResponse(response);
} catch (Exception ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}
private static Document getResponse(String url) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(url);
return doc;
}
private static void printResponse(Document doc) throws TransformerException, FileNotFoundException {
Transformer trans = TransformerFactory.newInstance().newTransformer();
Properties props = new Properties();
props.put(OutputKeys.INDENT, "yes");
trans.setOutputProperties(props);
StreamResult res = new StreamResult(new StringWriter());
DOMSource src = new DOMSource(doc);
trans.transform(src, res);
String toString = res.getWriter().toString();
System.out.println(toString);
}
}
As you can see this is much simpler to setup and use than the SOAP API. If you don't have a specific requirement for using the SOAP API then I would highly recommend that you use the REST API instead.
One of the drawbacks of using the REST API is that the results aren't unmarshaled into objects for you. This could be remedied by creating the required classes based on the wsdl.

This ended up working (I had to add my associateTag to the request):
public class Client {
public static void main(String[] args) {
String secretKey = "<MY_SECRET_KEY>";
String awsKey = "<MY AWS KEY>";
System.out.println("API Test started");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(secretKey)); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
// Get the operation object:
com.ECS.client.jax.ItemSearchRequest itemRequest = new com.ECS.client.jax.ItemSearchRequest();
// Fill in the request object:
itemRequest.setSearchIndex("Books");
itemRequest.setKeywords("Star Wars");
itemRequest.getResponseGroup().add("Large");
// itemRequest.getResponseGroup().add("Images");
// itemRequest.setVersion("2011-08-01");
com.ECS.client.jax.ItemSearch ItemElement = new com.ECS.client.jax.ItemSearch();
ItemElement.setAWSAccessKeyId(awsKey);
ItemElement.setAssociateTag("th0426-20");
ItemElement.getRequest().add(itemRequest);
// Call the Web service operation and store the response
// in the response object:
com.ECS.client.jax.ItemSearchResponse response = port
.itemSearch(ItemElement);
String r = response.toString();
System.out.println("response: " + r);
for (Items itemList : response.getItems()) {
System.out.println(itemList);
for (Item itemObj : itemList.getItem()) {
System.out.println(itemObj.getItemAttributes().getTitle()); // Title
System.out.println(itemObj.getDetailPageURL()); // Amazon URL
}
}
System.out.println("API Test stopped");
}
}

It looks like the response object does not override toString(), so if it contains some sort of error response, simply printing it will not tell you what the error response is. You'll need to look at the api for what fields are returned in the response object and individually print those. Either you'll get an obvious error message or you'll have to go back to their documentation to try to figure out what is wrong.

You need to call the get methods on the Item object to retrieve its details, e.g.:
for (Item item : itemList.getItem()) {
System.out.println(item.getItemAttributes().getTitle()); //Title of item
System.out.println(item.getDetailPageURL()); // Amazon URL
//etc
}
If there are any errors you can get them by calling getErrors()
if (response.getOperationRequest().getErrors() != null) {
System.out.println(response.getOperationRequest().getErrors().getError().get(0).getMessage());
}

Convert DOC file to DOCX with Java

I need to use DOCX files (actually the XML contained in them) in a Java software I'm currently developing, but some people in my company still use the DOC format.
Do you know if there is a way to convert a DOC file to the DOCX format using Java ? I know it's possible using C#, but that's not an option
I googled it, but nothing came up...
Thanks

You may try Aspose.Words for Java. It allows you to load a DOC file and save it as DOCX format. The code is very simple as shown below:
// Open a document.
Document doc = new Document("input.doc");
// Save document.
doc.save("output.docx");
Please see if this helps in your scenario.
Disclosure: I work as developer evangelist at Aspose.

Check out JODConverter to see if it fits the bill. I haven't personally used it.

Use newer versions of jars jodconverter-core-4.2.2.jar and jodconverter-local-4.2.2.jar
String inputFile = "*.doc";
String outputFile = "*.docx";
LocalOfficeManager localOfficeManager = LocalOfficeManager.builder()
.install()
.officeHome(getDefaultOfficeHome()) //your path to openoffice
.build();
try {
localOfficeManager.start();
final DocumentFormat format
= DocumentFormat.builder()
.from(DefaultDocumentFormatRegistry.DOCX)
.build();
LocalConverter
.make()
.convert(new FileInputStream(new File(inputFile)))
.as(DefaultDocumentFormatRegistry.getFormatByMediaType("application/msword"))
.to(new File(outputFile))
.as(format)
.execute();
} catch (OfficeException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (FileNotFoundException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} finally {
OfficeUtils.stopQuietly(localOfficeManager);
}

JODConvertor calls OpenOffice/LibreOffice via a network protocol. It can therefore 'do anything you can do in OpenOffice'. This includes converting formats. But it only does as good a job as whatever version of OpenOffice you are running. I have some art in one of my docs, and it doesn't convert them as I hoped.
JODConvertor is no longer supported, according to the google code web site for v3.
To get JOD to do the job you need to do something like
private static void transformBinaryWordDocToDocX(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
DocumentFormat docx = converter.getFormatRegistry().getFormatByExtension("docx");
docx.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "MS Word 2007 XML"));
converter.convert(in, out, docx);
}
private static void transformBinaryWordDocToW2003Xml(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);;
DocumentFormat w2003xml = new DocumentFormat("Microsoft Word 2003 XML", "xml", "text/xml");
w2003xml.setInputFamily(DocumentFamily.TEXT);
w2003xml.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MS Word 2003 XML"));
converter.convert(in, out, w2003xml);
}
private static OfficeManager officeManager;
#BeforeClass
public static void setupStatic() throws IOException {
/*officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome("C:/Program Files/LibreOffice 3.6")
.buildOfficeManager();
*/
officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();
officeManager.start();
}
#AfterClass
public static void shutdownStatic() throws IOException {
officeManager.stop();
}
For this to work you need to be running LibreOffice as a networked server ( I could not get the 'run on demand' part of JODConvertor to work under windows with LO 3.6 very well )

To convert DOC file to HTML look at this
(Convert Word doc to HTML programmatically in Java)
Use this: http://poi.apache.org/
Or use this :
XWPFDocument docx = new XWPFDocument(OPCPackage.openOrCreate(new File("hello.docx")));
XWPFWordExtractor wx = new XWPFWordExtractor(docx);
String text = wx.getText();
System.out.println("text = "+text);

I needed the same conversion ,after researching a lot found Jodconvertor can be useful in it , you can download the jar from
https://code.google.com/p/jodconverter/downloads/list
Add jodconverter-core-3.0-beta-4-sources.jar file to your project lib
//1) Create OfficeManger Object
OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome(new File("/opt/libreoffice4.4"))
.buildOfficeManager();
officeManager.start();
// 2) Create JODConverter converter
OfficeDocumentConverter converter = new OfficeDocumentConverter(
officeManager);
// 3)Create DocumentFormat for docx
DocumentFormat docx = converter.getFormatRegistry().getFormatByExtension("docx");
docx.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "MS Word 2007 XML"));
//4)Call convert funtion in converter object
converter.convert(new File("doc/AdvancedTable.doc"), new File(
"docx/AdvancedTable.docx"), docx);

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class TestCon {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("C:/Users/312845/Desktop/a.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("C:/Users/312845/Desktop/test.docx"));
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
}

Identifying file type in Java

Please help me to find out the type of the file which is being uploaded.
I wanted to distinguish between excel type and csv.
MIMEType returns same for both of these file. Please help.

I use Apache Tika which identifies the filetype using magic byte patterns and globbing hints (the file extension) to detect the MIME type. It also supports additional parsing of file contents (which I don't really use).
Here is a quick and dirty example on how Tika can be used to detect the file type without performing any additional parsing on the file:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.HashMap;
import org.apache.tika.metadata.HttpHeaders;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaMetadataKeys;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.helpers.DefaultHandler;
public class Detector {
public static void main(String[] args) throws Exception {
File file = new File("/pats/to/file.xls");
AutoDetectParser parser = new AutoDetectParser();
parser.setParsers(new HashMap<MediaType, Parser>());
Metadata metadata = new Metadata();
metadata.add(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
InputStream stream = new FileInputStream(file);
parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
stream.close();
String mimeType = metadata.get(HttpHeaders.CONTENT_TYPE);
System.out.println(mimeType);
}
}

I hope this will help. Taken from an example not from mine:
import javax.activation.MimetypesFileTypeMap;
import java.io.File;
class GetMimeType {
public static void main(String args[]) {
File f = new File("test.gif");
System.out.println("Mime Type of " + f.getName() + " is " +
new MimetypesFileTypeMap().getContentType(f));
// expected output :
// "Mime Type of test.gif is image/gif"
}
}
Same may be true for excel and csv types. Not tested.

I figured out a cheaper way of doing this with java.nio.file.Files
public String getContentType(File file) throws IOException {
return Files.probeContentType(file.toPath());
}
- or -
public String getContentType(Path filePath) throws IOException {
return Files.probeContentType(filePath);
}
Hope that helps.
Cheers.

A better way without using javax.activation.*:
URLConnection.guessContentTypeFromName(f.getAbsolutePath()));

If you are already using Spring this works for csv and excel:
import org.springframework.mail.javamail.ConfigurableMimeFileTypeMap;
import javax.activation.FileTypeMap;
import java.io.IOException;
public class ContentTypeResolver {
private FileTypeMap fileTypeMap;
public ContentTypeResolver() {
fileTypeMap = new ConfigurableMimeFileTypeMap();
}
public String getContentType(String fileName) throws IOException {
if (fileName == null) {
return null;
}
return fileTypeMap.getContentType(fileName.toLowerCase());
}
}
or with javax.activation you can update the mime.types file.

The CSV will start with text and the excel type is most likely binary.
However the simplest approach is to try to load the excel document using POI. If this fails try to load the file as a CSV, if that fails its possibly neither type.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Tika and Apache Solr integration through Java API - java

Related

Set up URI or catalog resolver with Saxon/XQuery

Error indexing text from Apache Tika in Solr

Amazon Product Advertising API through Java/SOAP

Convert DOC file to DOCX with Java

Identifying file type in Java

Categories

Resources