How to parse octet-stream files using Apache Tika?

How to parse octet-stream files using Apache Tika? - java

I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());

If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.
If you are using Azure Storage Client Library, you can write similar code as below:
blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

Related

How to extract text from docx with Tika

I'm trying to extract text from a docx: tika-app does it well, but when I try to do the same thing in my code the result is nothing and the tika parser says that the content-type of my docx file is "application/zip".
How can i do? Should I use a recursive approach (like this) or there is another way?
UPDATE: The file content-type is now correctly detected if I add the filename to the metadata:
InputStream is = new FileInputStream(myFile);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, myFileFilename);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
Anyway at parse() i get the error
java.lang.NoClassDefFoundError: org/apache/poi/openxml4j/exceptions/InvalidFormatException
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)

For me the main confusing thing in Apache Tika that it can be compiled without tika-parsers.jar, but it obviously can't work without it. So make sure that you installed tika-parsers.jar with all dependencies (they are many).

Why does tika provide 2 identical content-types for a single file

Rather naively I apply
InputStream in = ...
Metadata tikaMeta = new Metadata();
Tika tika = new Tika();
tika.setMaxStringLength(-1);
body = tika.parseToString(in, tikaMeta);
to convert a file. The body stays empty, which is ok, because it is a binary executable. But a debug log of the content-type meta reveals:
Content-Type->[application/x-executable, application/x-executable]
Any ideas why Tika provides two content types?

Getting paragraph count from Tika for both Word and PDF

I have a scenario where I need to reconcile two documents, an Word (.docx) doc as well as a PDF. The two are supposed to be "indentical" to each other (the PDF is just a PDF version of the DOCX file); meaning they should contain the same text, content, etc.
Specifically, I need to make sure that both documents contain the same number of paragraphs. So I need to read the DOCX, get the paragraph count, then read the PDF and grab its paragraph count. If both numbers are the same, then I'm in business.
It looks like Apache Tika (I'm interested in 1.3) is the right tool for the job here. I see in this source file that Tika supports the notion of paragraph counting, but trying to figure out how to get the count from both documents. Here's my best attempt but I'm choking on connecting some of the final dots:
InputStream docxStream = new FileInputStream("some-doc.docx");
InputStream pdfStream = new FileInputStream("some-doc.pdf");
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
Parser parser = new OfficeParser();
ParseContext pc = new ParseContext();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
docxStream.close();
pdfStream.close();
int docxParagraphCount = docxMeta.getXXX(???);
int pdfParagraphCount = pdfMeta.getXXX(???);
if(docxParagraphCount == pdfParagraphCount)
setInBusiness(myself, true);
So I ask: have I set this up correctly or am I way off base? If off-base, please lend me some help to get me back on track. And if I have set things up correctly, then how do I get the desired counts out of the two Metadata instances? Thanks in advance.

First up, Tika will only give you back the metadata contained within your documents. It won't compute anything for you. So, if one of your documents lacks the paragraph count metadata, you're out of luck. If one of your documents has duff data (i.e. the program that wrote the file out got it wrong), you'll be out of luck.
Otherwise, your code is nearly there, but not quite. You most likely want to use DefaultParser or AutoDetectParser - OfficeParser is for the Microsoft file formats only, while the others automatically load all the available parsers and pick the correct one.
The property you want is PARAGRAPH_COUNT, which comes from the Office metadata namespace. Your code would be something like:
TikaInputStream docxStream = TikaInputStream.get(new File("some-doc.docx"));
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();
Parser parser = TikaConfig.getDefaultConfig().getParser();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
int docxParagraphCount = docxMeta.getInt(Office.PARAGRAPH_COUNT);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);
If you don't care about the text at all, only the metadata, pass in a dummy content handler

Apache Tika: Parsing only metadata without content extraction

I'm using Apache Tika for extracting metadata from documents. I'm mostly interested in setting up a basic dublin core, like Author, Title, Date, etc. I'm not interested in the content of the documents at all. Currently I'm simply doing the usual thing:
FileInputStream fis = new FileInputStream( uploadedFileLocation );
// Tika parsing
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fis, handler, metadata);
Is there some way to tell Tika to not parse the content? I'm hoping that this will speed things up as well as save memory.

Apache Tika and document metadata

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :
word count, author, title, timestamps, language etc.
which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.
I know that apache tika should remove the need for this, but the document formats are quite different right ?
For instance
InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();
for(String s : metadata.names()) {
System.out.println("Metadata name : " + s);
}
I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?
Could please anybody who has experience with it share his experience ? Thank you

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.
You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
// Do something special with the PDF metadata here
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse octet-stream files using Apache Tika? - java

If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties. If you are using Azure Storage Client Library, you can write similar code as below: blockBlob.Properties.ContentType = "text/xml"; blockBlob.SetProperties();

Related

How to extract text from docx with Tika

Why does tika provide 2 identical content-types for a single file

Getting paragraph count from Tika for both Word and PDF

Apache Tika: Parsing only metadata without content extraction

Apache Tika and document metadata

Categories

Resources