I'm trying to use Apache Tika to determine the content-type (i.e. - application/pdf for .pdf files). I would like to use Apache Tika's org.apache.tika.detect.NameDetector class. My problem is that it's detect method only accepts an InputStream. I do not have access to the File's InputStream. I only have the File's name (i.e. - myFile.pdf).
Is there any good way to use Apache Tika to determine the content-type based on only the extension/name of the file? (Note - I would like to avoid creating a temp file with the desired name to determine it's content-type.)
Thanks.
You can use the normal Apache Tika Detector interface passing in null for the InputStream, and supplying the filename.
Your code would look something like:
TikaConfig config = new TikaConfig();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
String mimetype = config.getDetector().detect(null, metadata);
To simplify things even more, if you use the Tika facade class you can just do:
Tika tika = new Tika();
String mimetype = tika.detect(filename);
And you'll just get back the mimetype guessed from the filename only
For more information, see the "Ways of triggering Detection" documentation on the Apache Tika website.
I did some searching and found a blog post which contains a code example that determines the type using the org.apache.tika.Tika class's detect method.
So I could write something like this:
org.apache.tika.Tika tika = new org.apache.tika.Tika();
String mimeType = tika.detect("abc.pdf"); // replace abc.pdf with a string variable
Related
I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation.
Unfortunately I don't have other info about the file (e.g. extension).
Is there something I can do to make Tika be more specific?
I'm currently using this:
Tika tika = new Tika();
tika.setMaxStringLength(-1);
String mimetype = tika.detect(Base64.decode(fileString));
and it gives me text/plain for JSON and PDF files, but I would like to obtain a more specific information: application/json, application/pdf etc...
Hope someone can help me!
Thanks.
Tika#detect(String)
Detects the media type of a document with the given file name.
Passing the content of a PDF or JSON file won't work as this method expects a filename. Tika will fallback to text/plain as it won't find any matching filenames.
PDF
For PDF, you just need to either write some of the data to a stream, or pass it some of the bytes and have Tika read that using Mime Magic Detection by looking for special ("magic") patterns of bytes near the start of the file (which in plain text is %PDF):
String pdfContent = "%PDF-1.4\n%\\E2\\E3\\CF\\D3"; // i.e. base64 decoded
Tika tika = new Tika();
System.out.println(tika.detect(pdfContent.getBytes())); // "application/pdf"
JSON
For JSON though, even this method will return text/plain & Tika is correct. application/json is like a subtype of plain text to indicate that the text should be interpreted differently. So that's what you'll have to do if you get text/plain. Use a JSON library (e.g. Jackson) to parse the content to see if it's valid JSON:
Sring json = "[1, 2, 3]"; // an array in JSON
try {
final JsonParser parser = new ObjectMapper().getFactory().createParser(json);
while (parser.nextToken() != null) {
}
System.out.println("Probably JSON!");
} catch (Exception e) {
System.out.println("Definitely not JSON!");
}
Just be careful about how strict you want to be since Jackson treats a single number 1 as valid JSON but it's not really. To get round that, you could 1st of all test that the string starts with either { or [ (possibly preceded by whitespace) using something like json.matches("^\\s*[{\\[].*") before even attempting to parse it as JSON.
Here's a DZone tutorial for Jackson.
In my past project I used TikaConfig
What I did is:
//Note you can use alse byte[] instead of InputStream
InputStream is = new FileInputStream(new File(YOUR_FILE));
TikaConfig tc = new TikaConfig();
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, fileName);
String mimeType = tc.getDetector().detect(TikaInputStream.get(is), md).toString();
By using byte[]:
byte[] fileBytes = GET_BYTE_ARRAY_FROM_YOUR_FILE;
TikaConfig tc = new TikaConfig();
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, fileName);
String mimeType = tc.getDetector().detect(TikaInputStream.get(fileBytes), md).toString();
I had no issue in getting the right mimeType....
I hope it is useful
Angelo
I am using java and i am trying to extract some metadata with apache tika, but i cannot extarct the expected value for the 'subject' metadata. The file is a jpg image. Here is my code:
First i am parsing the file like this:
inputStream = new FileInputStream(fileToExtract);
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
and then i am trying to print these:
metadata.get(Metadata.AUTHOR) --> "MyAuthor"
metadata.get(TikaCoreProperties.CREATOR) --> "MyCreator"
metadata.get(TikaCoreProperties.TITLE) --> "MyTitle"
metadata.get(Metadata.SUBJECT) --> **null**
metadata.get(TikaCoreProperties.KEYWORDS) --> **null**
So, i get correctly all the values and i get null value for the subject. The metadata were added manually by me (right click->properties, windows OS).
Am i doing something wrong?
PS: Note that "TikaCoreProperties.KEYWORDS" is another way to retrieve the subject according to apache tika documentation.
Apache Tika tries to return consistent metadata across all file formats. It shouldn't matter if one format calls it Author, another Creator, another Created By and another Creator[0], Tika maps those all onto a consistent key. Typically, those keys are based on well known external standards, such as Dublin Core
If you want to see the mappings that Tika applies to Microsoft Office documents, you'll need to look in SummaryExtractor. If you want to know what all the metadata keys and values are that Tika can extract from a given file, either use the tika-app cli tool with --metadata, or call names() on the Metadata object to get the list of metadata keys Tika found.
I want to check that the user uploads only a particular file format (say text files only).
I've written a verification mechanism which checks for format after the file name like this
filename.txt
But, this created a problem when it was accepting other files also (like excel files) which are saved as .txt like
myexcelfile.txt is being assumed as a text file even when it is an excel file
So, What would be the unique parameter to check for to make sure that the uploaded file is of the required type?
Using apache-commons uploader, servlets.
======================EDIT=====================
Based on answers below, I've tried
FileInputStream my = new FileInputStream(uploadedFile2);
InputStream inputStream = new BufferedInputStream(my);
String mimeType = URLConnection.guessContentTypeFromStream(inputStream);
But is always returning a null value.
probe content type is based on filename extension and also there is a bug with this approach, checked that too.
I don't prefer to use third party file verifiers, I believe that this problem will have a logical solution.
Apache Tika has content detection capabilities for a wide range of file formats. From the documentation, one of the simplest ways to detect content type is based on the following code:
// default tika configuration can detect a lot of different file types
TikaConfig tika = new TikaConfig();
// meta data collected about the source file
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
// determine mime type from file contents
String mimetype = tika.getDetector().detect
(TikaInputStream.get(uploadedFile2), metadata);
System.out.println("File " + uploadedFile2 + " is " + mimetype);
If mimetype is text/plain, then the file or stream contains plain text content.
You could open the file and read the first few bytes into a byte[] and check the values to see if it matches the known magic numbers for a particular file format. I tried finding out what that would be for an Excel file (pre-XML; the xlsx file format would identify as a zip file), but I haven't really found much data about that. The closest thing I've found so far was looking at the code for a Java Excel file parser library.
The old Excel data format used what's called BIFF. Check out the Apache POI library for parsers and such for those types of files. From the looks of it, the magic numbers for an Excel file would probably be 00 06 10 00 (for BIFF8 worksheet), or 00 05 10 00 (BIFF7 worksheet, sounds rather old).
try
Files.probeContentType(Paths.get("~/a.xls"))
note that output depends on system content type detector - it may be different on different machines.
As for me, this code returns
application/vnd.ms-excel
private static String getMimeType(String fileUrl) {
String extension = MimeTypeMap.getFileExtensionFromUrl(fileUrl);
return MimeTypeMap.getSingleton().getMimeTypeFromExtension(extension);
}
Rather naively I apply
InputStream in = ...
Metadata tikaMeta = new Metadata();
Tika tika = new Tika();
tika.setMaxStringLength(-1);
body = tika.parseToString(in, tikaMeta);
to convert a file. The body stays empty, which is ok, because it is a binary executable. But a debug log of the content-type meta reveals:
Content-Type->[application/x-executable, application/x-executable]
Any ideas why Tika provides two content types?
I have an URL to file which I can download. It looks like this:
http://<server>/recruitment-mantis/plugin.php?page=BugSynchronizer/getfile&fileID=139&filehash=3e7a52a242f90c23539a17f6db094d86
How to get content type of this file? I have to admin that in this case simple:
URL url = new URL(stringUrl);
URLConnection urlConnection = url.openConnection();
urlConnection.connect();
String urlContent = urlConnection.getContentType();
returning me application/force-download content type in every file (no matter is jpg or pdf file).
I want to do this cause I want to set extension of downloaded file (which can be various). How to 'get around' of this application/force-download content type? Thanks in advance for your help.
Check urlConnection.getHeaderField("Content-Disposition") for a filename. Usually that header is used for attachments in multipart content, but it doesn't hurt to check.
If that header is not present, you can save the URL to a temporary file, and use probeContentType to get a meaningful MIME type:
Path tempFile = Files.createTempFile(null, null);
try (InputStream urlStream = urlConnection.getInputStream()) {
Files.copy(urlStream, tempFile, StandardCopyOption.REPLACE_EXISTING);
}
String mimeType = Files.probeContentType(tempFile);
Be aware that probeContentType may return null if it can't determine the type of the file.
How to 'get around' of this application/force-download content type?
I had the same problem with my uploaded content-type. Although you can trust the content-type from the URL, I chose to go looking for a content-type utilities to determine the content from the byte content.
After trying 5 or so implementations I decided to reinvent the wheel and released my SimpleMagic package which makes use of the magic(5) Unix content-type files to implement the same functionality as the Unix file(1) command. It uses either internal config files or can read /etc/magic, /usr/share/file/magic, or other magic(5) files and determine file content from File, InputStream, or byte[].
Location of the github sources, javadocs, and some documentation are available from the home page.
With SimpleMagic, you do something like the following:
ContentInfoUtil util = new ContentInfoUtil();
ContentInfo info = util.findMatch(byteArray);
It works from the contents of the data (File, InputStream, or byte[]), not the file name.
I guess this content type is set from the server your are downloading from. Some server use these kind of content type to force browsers to download the file instead of trying to open it. For example when my server return content type "application/pdf" chrome will try to open it as pdf, but when the server returns "application/force-download" the browser will save it to disk, because he has no clue what to do with this.
So you need to change the server to return the correct content type or better try some other heuristic to get the correct file type, because the server can always lie to you by setting it to jpg but giving you an exe.
I see with Java 7 you can try this method:
http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType%28java.nio.file.Path%29