Apache Tika, cannot retrieve 'subject' metadata value

Apache Tika, cannot retrieve 'subject' metadata value - java

I am using java and i am trying to extract some metadata with apache tika, but i cannot extarct the expected value for the 'subject' metadata. The file is a jpg image. Here is my code:
First i am parsing the file like this:
inputStream = new FileInputStream(fileToExtract);
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
and then i am trying to print these:
metadata.get(Metadata.AUTHOR) --> "MyAuthor"
metadata.get(TikaCoreProperties.CREATOR) --> "MyCreator"
metadata.get(TikaCoreProperties.TITLE) --> "MyTitle"
metadata.get(Metadata.SUBJECT) --> **null**
metadata.get(TikaCoreProperties.KEYWORDS) --> **null**
So, i get correctly all the values and i get null value for the subject. The metadata were added manually by me (right click->properties, windows OS).
Am i doing something wrong?
PS: Note that "TikaCoreProperties.KEYWORDS" is another way to retrieve the subject according to apache tika documentation.

Apache Tika tries to return consistent metadata across all file formats. It shouldn't matter if one format calls it Author, another Creator, another Created By and another Creator[0], Tika maps those all onto a consistent key. Typically, those keys are based on well known external standards, such as Dublin Core
If you want to see the mappings that Tika applies to Microsoft Office documents, you'll need to look in SummaryExtractor. If you want to know what all the metadata keys and values are that Tika can extract from a given file, either use the tika-app cli tool with --metadata, or call names() on the Metadata object to get the list of metadata keys Tika found.

Related

CSV Detector in Apache Tika

I'm using the Java library Tika by Apache (tika-core ver. 1.10).
Exists a org.apache.tika.detect.Detector for CSV files?
The MIME type should be text/csv, but I cannot find anything like that.
I would like to use the nice detect method

Currently (v1.10) tika-mimetypes.xml defines text/csv like this:
<mime-type type="text/csv">
<glob pattern="*.csv"/>
<sub-class-of type="text/plain"/>
</mime-type>
This means that Apache Tika detects only by filename. If you use Tika#detect(File) Tika will add filename (under Metadata.RESOURCE_NAME_KEY key) to Metadata object passed to detector. There's similar behavior for URLs.
If you want to inject filename you can use something like:
new Tika().detect(is, fileName)
If you want some heuristics, based on content, feel free to check and file a ticket in Tika's JIRA.

Identify File Type in Java

I want to check that the user uploads only a particular file format (say text files only).
I've written a verification mechanism which checks for format after the file name like this
filename.txt
But, this created a problem when it was accepting other files also (like excel files) which are saved as .txt like
myexcelfile.txt is being assumed as a text file even when it is an excel file
So, What would be the unique parameter to check for to make sure that the uploaded file is of the required type?
Using apache-commons uploader, servlets.
======================EDIT=====================
Based on answers below, I've tried
FileInputStream my = new FileInputStream(uploadedFile2);
InputStream inputStream = new BufferedInputStream(my);
String mimeType = URLConnection.guessContentTypeFromStream(inputStream);
But is always returning a null value.
probe content type is based on filename extension and also there is a bug with this approach, checked that too.
I don't prefer to use third party file verifiers, I believe that this problem will have a logical solution.

Apache Tika has content detection capabilities for a wide range of file formats. From the documentation, one of the simplest ways to detect content type is based on the following code:
// default tika configuration can detect a lot of different file types
TikaConfig tika = new TikaConfig();
// meta data collected about the source file
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
// determine mime type from file contents
String mimetype = tika.getDetector().detect
(TikaInputStream.get(uploadedFile2), metadata);
System.out.println("File " + uploadedFile2 + " is " + mimetype);
If mimetype is text/plain, then the file or stream contains plain text content.

You could open the file and read the first few bytes into a byte[] and check the values to see if it matches the known magic numbers for a particular file format. I tried finding out what that would be for an Excel file (pre-XML; the xlsx file format would identify as a zip file), but I haven't really found much data about that. The closest thing I've found so far was looking at the code for a Java Excel file parser library.
The old Excel data format used what's called BIFF. Check out the Apache POI library for parsers and such for those types of files. From the looks of it, the magic numbers for an Excel file would probably be 00 06 10 00 (for BIFF8 worksheet), or 00 05 10 00 (BIFF7 worksheet, sounds rather old).

try
Files.probeContentType(Paths.get("~/a.xls"))
note that output depends on system content type detector - it may be different on different machines.
As for me, this code returns
application/vnd.ms-excel

private static String getMimeType(String fileUrl) {
String extension = MimeTypeMap.getFileExtensionFromUrl(fileUrl);
return MimeTypeMap.getSingleton().getMimeTypeFromExtension(extension);
}

Detect Content-Type Based on FileName

I'm trying to use Apache Tika to determine the content-type (i.e. - application/pdf for .pdf files). I would like to use Apache Tika's org.apache.tika.detect.NameDetector class. My problem is that it's detect method only accepts an InputStream. I do not have access to the File's InputStream. I only have the File's name (i.e. - myFile.pdf).
Is there any good way to use Apache Tika to determine the content-type based on only the extension/name of the file? (Note - I would like to avoid creating a temp file with the desired name to determine it's content-type.)
Thanks.

You can use the normal Apache Tika Detector interface passing in null for the InputStream, and supplying the filename.
Your code would look something like:
TikaConfig config = new TikaConfig();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
String mimetype = config.getDetector().detect(null, metadata);
To simplify things even more, if you use the Tika facade class you can just do:
Tika tika = new Tika();
String mimetype = tika.detect(filename);
And you'll just get back the mimetype guessed from the filename only
For more information, see the "Ways of triggering Detection" documentation on the Apache Tika website.

I did some searching and found a blog post which contains a code example that determines the type using the org.apache.tika.Tika class's detect method.
So I could write something like this:
org.apache.tika.Tika tika = new org.apache.tika.Tika();
String mimeType = tika.detect("abc.pdf"); // replace abc.pdf with a string variable

Extracting text from documents of unknown content type

is there a parser for application/octet-stream type within Apache Tika? I suppose it's a non-parsable stream.
I just need to parse ODS documents, MS documents and PDF files. It seems that new Tika( ).parseToString(file); is enough. But I can't figure out what happens when the content type is not detected - > application/octet-stream is default. If I have a chance to extract text from those documents that are one of those types, but contentType detector didn't detect their type.
What else should I try instead of returning document to the user telling him that it is not supported format.
Or is really a resulting application/octet-stream content type a signal that we can't read this ? Or "you must figure out your own way how to deal with this" ?

If the detector doesn't know what the file is, it'll return application/octet-stream
And if the detector doesn't know what it is, then Tika won't be able to pick a suitable Parser for it. (You'll end up with the EmptyParser which does nothing)
If you can, pass in the name of your file when you do the detection and parsing, as that'll help with the detection in some cases:
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
Also, it's worth checking the supported formats part of the Tika website to ensure that the documents you have are ones where there's a Parser - http://tika.apache.org/0.9/formats.html
If your documents are in a format that isn't currently supported, then you have two choices (neither immediate fixes). One is to help write a new parser (requires finding a suitable Java library for the format). The other is to use a command line based parser (requires finding an executable for your platform that can do the xhtml generation, then wiring that in)

Library for writing XMP to a multipage TIFF

Can you recommend a library that lets me add XMP data to a TIFF file? Preferably a library that can be used with Java.

There is JempBox which is open source and allows the manipulation of XMP streams, but it doesn't look like it will read/write the XMP data in a TIFF file.
There is also Chilkat which is not open source, but does appear to do what you want.

It's been a while, but it may still be useful to someone: Apache Commons has a library called Sanselan suitable for this task. It's a bit dated and the documentation is sparse, but it does the job well nevertheless:
File file = new File("path/to/your/file");
// Get XMP xml data from a file
String xml = Sanselan.getXmpXml(file);
// Process the XML data
xml = processXml(xml);
// Write XMP xml data from a file
Map params = new HashMap();
params.put(Sanselan.PARAM_KEY_XMP_XML, xml);
BufferedImage image = Sanselan.getBufferedImage(file);
Sanselan.writeImage(image, file, Sanselan.guessFormat(file), params);
You may have to be careful with multipage TIFFs though, because Sanselan.getBufferedImage will probably only get the first (so only the first gets written back).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Tika, cannot retrieve 'subject' metadata value - java

Related

CSV Detector in Apache Tika

Identify File Type in Java

Detect Content-Type Based on FileName

Extracting text from documents of unknown content type

Library for writing XMP to a multipage TIFF

Categories

Resources