How to extract text from docx with Tika - java

I'm trying to extract text from a docx: tika-app does it well, but when I try to do the same thing in my code the result is nothing and the tika parser says that the content-type of my docx file is "application/zip".
How can i do? Should I use a recursive approach (like this) or there is another way?
UPDATE: The file content-type is now correctly detected if I add the filename to the metadata:
InputStream is = new FileInputStream(myFile);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, myFileFilename);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
Anyway at parse() i get the error
java.lang.NoClassDefFoundError: org/apache/poi/openxml4j/exceptions/InvalidFormatException
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)

For me the main confusing thing in Apache Tika that it can be compiled without tika-parsers.jar, but it obviously can't work without it. So make sure that you installed tika-parsers.jar with all dependencies (they are many).

Related

How to parse octet-stream files using Apache Tika?

I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.
If you are using Azure Storage Client Library, you can write similar code as below:
blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

Why does tika provide 2 identical content-types for a single file

Rather naively I apply
InputStream in = ...
Metadata tikaMeta = new Metadata();
Tika tika = new Tika();
tika.setMaxStringLength(-1);
body = tika.parseToString(in, tikaMeta);
to convert a file. The body stays empty, which is ok, because it is a binary executable. But a debug log of the content-type meta reveals:
Content-Type->[application/x-executable, application/x-executable]
Any ideas why Tika provides two content types?

Tika pass parser info during incremental read

I know Tika has a very nice wrapper that let's me get a Reader back from parsing a file like so:
Reader parsedReader = tika.parse(in);
However, if I use this, I cannot specify the parser that I want and the metadata that I want to pass in. For example, I would want to pass in extra info like which handler, parser, and context to use, but I can't do it if I use this method. As far as I know, it's the only one that let's me get a Reader instance back and read incrementally instead of getting the entire parsed string back.
Example of things I want to include:
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); //This aids in the content detection
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
However, calling parse on a parser directly does not return a reader, and the only option I have(noticed in the docs) is to return a fully parsed string, which might not be great for memory usage. I know I can limit the string that is returned, but I want to stay away from that as I wanto the fully parsed info, but in incremental fashion. Best of both world, is this possible?
One of the many great things about Apache Tika is that it's open source, so you can see how it works. The class for the Tika facade you're using is here
The key bit of that class for your interest is this bit:
public Reader parse(InputStream stream, Metadata metadata)
throws IOException {
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
return new ParsingReader(parser, stream, metadata, context);
}
You see there how Tika is taking a parser and a stream, and processing it to a Reader. Do something similar and you're set. Alternately, write your own ContentHandler and call that directly for full control!

Getting paragraph count from Tika for both Word and PDF

I have a scenario where I need to reconcile two documents, an Word (.docx) doc as well as a PDF. The two are supposed to be "indentical" to each other (the PDF is just a PDF version of the DOCX file); meaning they should contain the same text, content, etc.
Specifically, I need to make sure that both documents contain the same number of paragraphs. So I need to read the DOCX, get the paragraph count, then read the PDF and grab its paragraph count. If both numbers are the same, then I'm in business.
It looks like Apache Tika (I'm interested in 1.3) is the right tool for the job here. I see in this source file that Tika supports the notion of paragraph counting, but trying to figure out how to get the count from both documents. Here's my best attempt but I'm choking on connecting some of the final dots:
InputStream docxStream = new FileInputStream("some-doc.docx");
InputStream pdfStream = new FileInputStream("some-doc.pdf");
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
Parser parser = new OfficeParser();
ParseContext pc = new ParseContext();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
docxStream.close();
pdfStream.close();
int docxParagraphCount = docxMeta.getXXX(???);
int pdfParagraphCount = pdfMeta.getXXX(???);
if(docxParagraphCount == pdfParagraphCount)
setInBusiness(myself, true);
So I ask: have I set this up correctly or am I way off base? If off-base, please lend me some help to get me back on track. And if I have set things up correctly, then how do I get the desired counts out of the two Metadata instances? Thanks in advance.
First up, Tika will only give you back the metadata contained within your documents. It won't compute anything for you. So, if one of your documents lacks the paragraph count metadata, you're out of luck. If one of your documents has duff data (i.e. the program that wrote the file out got it wrong), you'll be out of luck.
Otherwise, your code is nearly there, but not quite. You most likely want to use DefaultParser or AutoDetectParser - OfficeParser is for the Microsoft file formats only, while the others automatically load all the available parsers and pick the correct one.
The property you want is PARAGRAPH_COUNT, which comes from the Office metadata namespace. Your code would be something like:
TikaInputStream docxStream = TikaInputStream.get(new File("some-doc.docx"));
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();
Parser parser = TikaConfig.getDefaultConfig().getParser();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
int docxParagraphCount = docxMeta.getInt(Office.PARAGRAPH_COUNT);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);
If you don't care about the text at all, only the metadata, pass in a dummy content handler

Apache Tika: Parsing only metadata without content extraction

I'm using Apache Tika for extracting metadata from documents. I'm mostly interested in setting up a basic dublin core, like Author, Title, Date, etc. I'm not interested in the content of the documents at all. Currently I'm simply doing the usual thing:
FileInputStream fis = new FileInputStream( uploadedFileLocation );
// Tika parsing
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fis, handler, metadata);
Is there some way to tell Tika to not parse the content? I'm hoping that this will speed things up as well as save memory.

Categories

Resources