Identify File Type in Java - java

I want to check that the user uploads only a particular file format (say text files only).
I've written a verification mechanism which checks for format after the file name like this
filename.txt
But, this created a problem when it was accepting other files also (like excel files) which are saved as .txt like
myexcelfile.txt is being assumed as a text file even when it is an excel file
So, What would be the unique parameter to check for to make sure that the uploaded file is of the required type?
Using apache-commons uploader, servlets.
======================EDIT=====================
Based on answers below, I've tried
FileInputStream my = new FileInputStream(uploadedFile2);
InputStream inputStream = new BufferedInputStream(my);
String mimeType = URLConnection.guessContentTypeFromStream(inputStream);
But is always returning a null value.
probe content type is based on filename extension and also there is a bug with this approach, checked that too.
I don't prefer to use third party file verifiers, I believe that this problem will have a logical solution.

Apache Tika has content detection capabilities for a wide range of file formats. From the documentation, one of the simplest ways to detect content type is based on the following code:
// default tika configuration can detect a lot of different file types
TikaConfig tika = new TikaConfig();
// meta data collected about the source file
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
// determine mime type from file contents
String mimetype = tika.getDetector().detect
(TikaInputStream.get(uploadedFile2), metadata);
System.out.println("File " + uploadedFile2 + " is " + mimetype);
If mimetype is text/plain, then the file or stream contains plain text content.

You could open the file and read the first few bytes into a byte[] and check the values to see if it matches the known magic numbers for a particular file format. I tried finding out what that would be for an Excel file (pre-XML; the xlsx file format would identify as a zip file), but I haven't really found much data about that. The closest thing I've found so far was looking at the code for a Java Excel file parser library.
The old Excel data format used what's called BIFF. Check out the Apache POI library for parsers and such for those types of files. From the looks of it, the magic numbers for an Excel file would probably be 00 06 10 00 (for BIFF8 worksheet), or 00 05 10 00 (BIFF7 worksheet, sounds rather old).

try
Files.probeContentType(Paths.get("~/a.xls"))
note that output depends on system content type detector - it may be different on different machines.
As for me, this code returns
application/vnd.ms-excel

private static String getMimeType(String fileUrl) {
String extension = MimeTypeMap.getFileExtensionFromUrl(fileUrl);
return MimeTypeMap.getSingleton().getMimeTypeFromExtension(extension);
}

Related

Is possible to read a tif file like a txt, delete some header rows, and save it back to a tif file?

i'm trying to delete the first 3 rows of a tif file content generated by a scanner, because i cant open correctly.
example of rows to delete:
------=_Part_23XX49_-1XXXX3073.1XXXXX20715
ID: documento<br>
MimeType: image/tiff
I have no problem about change the content, but when i save the new file, i cant open correctly again.
System.out.println(new InputStreamReader(in).getEncoding());
this method tell me that the encoding of source file is "Cp1252", so i've put an argument in the JVM (-Dfile.encoding=Cp1252), but nothing appear to change.
This is what i do:
StringBuilder fileContent = new StringBuilder();
// working with content and save result content in fileContent variable
// save the file again
FileWriter fstreamWrite = new FileWriter(f.getAbsolutePath());
out = new BufferedWriter(fstreamWrite);
out.write(fileContent.toString());
Is possible that something is going wrong with Encoding?
if i do the operation with notepad++, i obtain a correct tiff that i can open without problem.
I found the TIFF Java library that maybe gonna be useful for your requirements.
Please take a look at the readme how to read and how to write a tiff file.
Hope this can help you

Getting MIME type of a File

I want to get mimetype of a file can anyone please help me
I want MIME Type like this...
File file=new File("example.jpeg");
String MimeTypeOfFile=/*files mimetype*/;
Thank You in Advance
You can use the Apache Tika Library: It detects and extracts metadata and text from over a thousand different file types
http://tika.apache.org/0.7/detection.html
It has various methods like extension checking or reading file data to detect mime-type. It would be easy and efficient rather than writing yourself.
Example :
System.out.println(new Tika().detect(new File(PATH_TO_FILE)));

JSP: Get MIME Type on File Upload

I'm doing a file upload, and I want to get the Mime type from the uploaded file.
I was trying to use the request.getContentType(), but when I call:
String contentType = req.getContentType();
It will return:
multipart/form-data; boundary=---------------------------310662768914663
How can I get the correct value?
Thanks in advance
It sounds like as if you're homegrowing a multipart/form-data parser. I wouldn't recommend to do that. Rather use a decent one like Apache Commons FileUpload. For uploaded files, it offers a FileItem#getContentType() to extract the client-specified content type, if any.
String contentType = item.getContentType();
If it returns null (just because the client didn't specify it), then you can take benefit of ServletContext#getMimeType() based on the file name.
String filename = FilenameUtils.getName(item.getName());
String contentType = getServletContext().getMimeType(filename);
This will be resolved based on <mime-mapping> entries in servletcontainer's default web.xml (in case of for example Tomcat, it's present in /conf/web.xml) and also on the web.xml of your webapp, if any, which can expand/override the servletcontainer's default mappings.
You however need to keep in mind that the value of the multipart content type is fully controlled by the client and also that the client-provided file extension does not necessarily need to represent the actual file content. For instance, the client could just edit the file extension. Be careful when using this information in business logic.
Related:
How to upload files in JSP/Servlet?
How to check whether an uploaded file is an image?
just use:
public String ServletContext.getMimeType(String file)
You could use MimetypesFileTypeMap
String contentType = new MimetypesFileTypeMap().getContentType(fileName)); // gets mime type
However, you would encounter the overhead of editing the mime.types file, if the file type is not already listed. (Sorry, I take that back, as you could add instances to the map programmatically and that would be the first place that it checks)

Get real file extension -Java code

I would like to determine real file extension for security reason.
How can I do that?
Supposing you really mean to get the true content type of a file (ie it's MIME type) you should refer to this excellent answer.
You can get the true content type of a file in Java using the following code:
File file = new File("filename.asgdsag");
InputStream is = new BufferedInputStream(new FileInputStream(file));
String mimeType = URLConnection.guessContentTypeFromStream(is);
There are a number of ways that you can do this, some more complicated (and more reliable) than others. The page I linked to discusses quite a few of these approaches.
Not sure exactly what you mean, but however you do this it is only going to work for the specific set of file formats which are known to you
you could exclude executables (are you talking windows here?) - there's some file header information here http://support.microsoft.com/kb/65122 - you could scan and block files which look like they have an exe header - is this getting close to what you mean when you say 'real file extension'?

CSV file validation with Java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Categories

Resources