Apache Tika and document metadata - java

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :
word count, author, title, timestamps, language etc.
which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.
I know that apache tika should remove the need for this, but the document formats are quite different right ?
For instance
InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();
for(String s : metadata.names()) {
System.out.println("Metadata name : " + s);
}
I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?
Could please anybody who has experience with it share his experience ? Thank you

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.
You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
// Do something special with the PDF metadata here
}

Related

Apache Tika, cannot retrieve 'subject' metadata value

I am using java and i am trying to extract some metadata with apache tika, but i cannot extarct the expected value for the 'subject' metadata. The file is a jpg image. Here is my code:
First i am parsing the file like this:
inputStream = new FileInputStream(fileToExtract);
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
and then i am trying to print these:
metadata.get(Metadata.AUTHOR) --> "MyAuthor"
metadata.get(TikaCoreProperties.CREATOR) --> "MyCreator"
metadata.get(TikaCoreProperties.TITLE) --> "MyTitle"
metadata.get(Metadata.SUBJECT) --> **null**
metadata.get(TikaCoreProperties.KEYWORDS) --> **null**
So, i get correctly all the values and i get null value for the subject. The metadata were added manually by me (right click->properties, windows OS).
Am i doing something wrong?
PS: Note that "TikaCoreProperties.KEYWORDS" is another way to retrieve the subject according to apache tika documentation.
Apache Tika tries to return consistent metadata across all file formats. It shouldn't matter if one format calls it Author, another Creator, another Created By and another Creator[0], Tika maps those all onto a consistent key. Typically, those keys are based on well known external standards, such as Dublin Core
If you want to see the mappings that Tika applies to Microsoft Office documents, you'll need to look in SummaryExtractor. If you want to know what all the metadata keys and values are that Tika can extract from a given file, either use the tika-app cli tool with --metadata, or call names() on the Metadata object to get the list of metadata keys Tika found.

Tika pass parser info during incremental read

I know Tika has a very nice wrapper that let's me get a Reader back from parsing a file like so:
Reader parsedReader = tika.parse(in);
However, if I use this, I cannot specify the parser that I want and the metadata that I want to pass in. For example, I would want to pass in extra info like which handler, parser, and context to use, but I can't do it if I use this method. As far as I know, it's the only one that let's me get a Reader instance back and read incrementally instead of getting the entire parsed string back.
Example of things I want to include:
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); //This aids in the content detection
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
However, calling parse on a parser directly does not return a reader, and the only option I have(noticed in the docs) is to return a fully parsed string, which might not be great for memory usage. I know I can limit the string that is returned, but I want to stay away from that as I wanto the fully parsed info, but in incremental fashion. Best of both world, is this possible?
One of the many great things about Apache Tika is that it's open source, so you can see how it works. The class for the Tika facade you're using is here
The key bit of that class for your interest is this bit:
public Reader parse(InputStream stream, Metadata metadata)
throws IOException {
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
return new ParsingReader(parser, stream, metadata, context);
}
You see there how Tika is taking a parser and a stream, and processing it to a Reader. Do something similar and you're set. Alternately, write your own ContentHandler and call that directly for full control!

Getting paragraph count from Tika for both Word and PDF

I have a scenario where I need to reconcile two documents, an Word (.docx) doc as well as a PDF. The two are supposed to be "indentical" to each other (the PDF is just a PDF version of the DOCX file); meaning they should contain the same text, content, etc.
Specifically, I need to make sure that both documents contain the same number of paragraphs. So I need to read the DOCX, get the paragraph count, then read the PDF and grab its paragraph count. If both numbers are the same, then I'm in business.
It looks like Apache Tika (I'm interested in 1.3) is the right tool for the job here. I see in this source file that Tika supports the notion of paragraph counting, but trying to figure out how to get the count from both documents. Here's my best attempt but I'm choking on connecting some of the final dots:
InputStream docxStream = new FileInputStream("some-doc.docx");
InputStream pdfStream = new FileInputStream("some-doc.pdf");
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
Parser parser = new OfficeParser();
ParseContext pc = new ParseContext();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
docxStream.close();
pdfStream.close();
int docxParagraphCount = docxMeta.getXXX(???);
int pdfParagraphCount = pdfMeta.getXXX(???);
if(docxParagraphCount == pdfParagraphCount)
setInBusiness(myself, true);
So I ask: have I set this up correctly or am I way off base? If off-base, please lend me some help to get me back on track. And if I have set things up correctly, then how do I get the desired counts out of the two Metadata instances? Thanks in advance.
First up, Tika will only give you back the metadata contained within your documents. It won't compute anything for you. So, if one of your documents lacks the paragraph count metadata, you're out of luck. If one of your documents has duff data (i.e. the program that wrote the file out got it wrong), you'll be out of luck.
Otherwise, your code is nearly there, but not quite. You most likely want to use DefaultParser or AutoDetectParser - OfficeParser is for the Microsoft file formats only, while the others automatically load all the available parsers and pick the correct one.
The property you want is PARAGRAPH_COUNT, which comes from the Office metadata namespace. Your code would be something like:
TikaInputStream docxStream = TikaInputStream.get(new File("some-doc.docx"));
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();
Parser parser = TikaConfig.getDefaultConfig().getParser();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
int docxParagraphCount = docxMeta.getInt(Office.PARAGRAPH_COUNT);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);
If you don't care about the text at all, only the metadata, pass in a dummy content handler

Apache Tika: Parsing only metadata without content extraction

I'm using Apache Tika for extracting metadata from documents. I'm mostly interested in setting up a basic dublin core, like Author, Title, Date, etc. I'm not interested in the content of the documents at all. Currently I'm simply doing the usual thing:
FileInputStream fis = new FileInputStream( uploadedFileLocation );
// Tika parsing
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fis, handler, metadata);
Is there some way to tell Tika to not parse the content? I'm hoping that this will speed things up as well as save memory.

Extracting text from documents of unknown content type

is there a parser for application/octet-stream type within Apache Tika? I suppose it's a non-parsable stream.
I just need to parse ODS documents, MS documents and PDF files. It seems that new Tika( ).parseToString(file); is enough. But I can't figure out what happens when the content type is not detected - > application/octet-stream is default. If I have a chance to extract text from those documents that are one of those types, but contentType detector didn't detect their type.
What else should I try instead of returning document to the user telling him that it is not supported format.
Or is really a resulting application/octet-stream content type a signal that we can't read this ? Or "you must figure out your own way how to deal with this" ?
If the detector doesn't know what the file is, it'll return application/octet-stream
And if the detector doesn't know what it is, then Tika won't be able to pick a suitable Parser for it. (You'll end up with the EmptyParser which does nothing)
If you can, pass in the name of your file when you do the detection and parsing, as that'll help with the detection in some cases:
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
Also, it's worth checking the supported formats part of the Tika website to ensure that the documents you have are ones where there's a Parser - http://tika.apache.org/0.9/formats.html
If your documents are in a format that isn't currently supported, then you have two choices (neither immediate fixes). One is to help write a new parser (requires finding a suitable Java library for the format). The other is to use a command line based parser (requires finding an executable for your platform that can do the xhtml generation, then wiring that in)

Categories

Resources