CSV file validation with Java

CSV file validation with Java - java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.

Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.

Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Related

Identify File Type in Java

I want to check that the user uploads only a particular file format (say text files only).
I've written a verification mechanism which checks for format after the file name like this
filename.txt
But, this created a problem when it was accepting other files also (like excel files) which are saved as .txt like
myexcelfile.txt is being assumed as a text file even when it is an excel file
So, What would be the unique parameter to check for to make sure that the uploaded file is of the required type?
Using apache-commons uploader, servlets.
======================EDIT=====================
Based on answers below, I've tried
FileInputStream my = new FileInputStream(uploadedFile2);
InputStream inputStream = new BufferedInputStream(my);
String mimeType = URLConnection.guessContentTypeFromStream(inputStream);
But is always returning a null value.
probe content type is based on filename extension and also there is a bug with this approach, checked that too.
I don't prefer to use third party file verifiers, I believe that this problem will have a logical solution.

Apache Tika has content detection capabilities for a wide range of file formats. From the documentation, one of the simplest ways to detect content type is based on the following code:
// default tika configuration can detect a lot of different file types
TikaConfig tika = new TikaConfig();
// meta data collected about the source file
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
// determine mime type from file contents
String mimetype = tika.getDetector().detect
(TikaInputStream.get(uploadedFile2), metadata);
System.out.println("File " + uploadedFile2 + " is " + mimetype);
If mimetype is text/plain, then the file or stream contains plain text content.

You could open the file and read the first few bytes into a byte[] and check the values to see if it matches the known magic numbers for a particular file format. I tried finding out what that would be for an Excel file (pre-XML; the xlsx file format would identify as a zip file), but I haven't really found much data about that. The closest thing I've found so far was looking at the code for a Java Excel file parser library.
The old Excel data format used what's called BIFF. Check out the Apache POI library for parsers and such for those types of files. From the looks of it, the magic numbers for an Excel file would probably be 00 06 10 00 (for BIFF8 worksheet), or 00 05 10 00 (BIFF7 worksheet, sounds rather old).

try
Files.probeContentType(Paths.get("~/a.xls"))
note that output depends on system content type detector - it may be different on different machines.
As for me, this code returns
application/vnd.ms-excel

private static String getMimeType(String fileUrl) {
String extension = MimeTypeMap.getFileExtensionFromUrl(fileUrl);
return MimeTypeMap.getSingleton().getMimeTypeFromExtension(extension);
}

Keeping Encoding when Reading Image File

I'm am currently reading through a file which contains meta-data and a tiff image like so:
private String readFile( String file ) throws IOException {
File file = new File(filename);
int size = (int) file.length();
byte[] bytes = new byte[size];
BufferedInputStream buf = new BufferedInputStream(new FileInputStream(file));
buf.read(bytes, 0, bytes.length);
buf.close();
...
}
I parse the meta-data + image content, then I try to output the tiff like this, where img is a String:
writer = new BufferedWriter( new FileWriter( "img.tiff"));
writer.write(img);
writer.close();
Why is the encoding being lost of the tiff image file?

Why are you trying to rewrite the file?
If the answer is "I'm trying to alter some metadata within the file." I strongly suggest that you use a set of tools that are specifically geared towards working with TIFF metadata, especially if you intend to manipulate/alter than metadata as there are several special case data elements in TIFF files that really don't like being moved around blithely.
My day-to-day job involves understanding the TIFF spec, so I always get a little antsy when I see people mucking around with the internals of TIFFs without first consulting the spec or being concerned with some of the bizarre special cases that exist in the wild that now need to be handled because of someone else who didn't fully grok the spec and created a commercial product that generated thousands of these beasts (I'm looking at you Microsoft for making "old style JPEG compression" TIFFs, but I've also seen a Java product that defined a type of image that used floating point numbers for the component values without bothering to (1) normalize them as the spec would have you do or (2) have a standard for defining what the expected min and max of the component values would be).
In my code base (and this is a commercial product), you can do your work like this:
TiffFile myTiff = new TiffFile();
myTiff.read(someImageInputStream);
for (TiffDirectory dir : myTiff.getImages())
{
// a TiffDirectory contains a collection of TiffTag objects, from which the
// metadata for each image in the document can be read/edited
// TiffTag definitions can be found [here][2].
}
myTiff.save(someImageOutputStream); // writes the whole TIFF back
and in general, we've found that it's really advanced customers who want to do this. For the most part, we find that customers are more concerned with higher-level operations like combining TIFF files into a single document or extracting out pages, for which we have a different API which is much lighter weight and doesn't require you to know the TIFF specification (as it should be).

Try specifying the encoding in your writer.
http://docs.oracle.com/javase/7/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter%28java.io.OutputStream,%20java.nio.charset.CharsetEncoder%29
Wrap your stream:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
For images you should look into the ImageIO package.
http://docs.oracle.com/javase/7/docs/api/javax/imageio/ImageIO.html#getImageWriter%28javax.imageio.ImageReader%29

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}

Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.

x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.

Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.

The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.

This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.

One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}

You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.

Convert from Codepage 1252 (Windows) to Java, in Java

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.

If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

Corrupt file when using Java to download file

This problem seems to happen inconsistently. We are using a java applet to download a file from our site, which we store temporarily on the client's machine.
Here is the code that we are using to save the file:
URL targetUrl = new URL(urlForFile);
InputStream content = (InputStream)targetUrl.getContent();
BufferedInputStream buffered = new BufferedInputStream(content);
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int letter;
while((letter = buffered.read()) != -1)
fos.write(letter);
fos.close();
Later, I try to access that file by using:
ObjectInputStream keyInStream = new ObjectInputStream(new FileInputStream(savedFile));
Most of the time it works without a problem, but every once in a while we get the error:
java.io.StreamCorruptedException: invalid stream header: 0D0A0D0A
which makes me believe that it isn't saving the file correctly.

I'm guessing that the operations you've done with getContent and BufferedInputStream have treated the file like an ascii file which has converted newlines or carriage returns into carriage return + newline (0x0d0a), which has confused ObjectInputStream (which expects serialized data objects.

If you are using an FTP URL, the transfer may be occurring in ASCII mode.
Try appending ";type=I" to the end of your URL.

Why are you using ObjectInputStream to read it?
As per the javadoc:
An ObjectInputStream deserializes primitive data and objects previously written using an ObjectOutputStream.
Probably the error comes from the fact you didn't write it with ObjectOutputStream.
Try reading it wit FileInputStream only.
Here's a sample for binary ( although not the most efficient way )
Here's another used for text files.

There are 3 big problems in your sample code:
You're not just treating the input as bytes
You're needlessly pulling the entire object into memory at once
You're doing multiple method calls for every single byte read and written -- use the array based read/write!
Here's a redo:
URL targetUrl = new URL(urlForFile);
InputStream is = targetUrl.getInputStream();
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int count;
byte[] buff = new byte[16 * 1024];
while((count = is.read(buff)) != -1) {
fos.write(buff, 0, count);
}
fos.close();
content.close();

You could also step back from the code and check to see if the file on your client is the same as the file on the server. If you get both files on an XP machine, you should be able to use the FC utility to do a compare (check FC's help if you need to run this as a binary compare as there is a switch for that). If you're on Unix, I don't know the file compare program, but I'm sure there's something.
If the files are identical, then you're looking at a problem with the code that reads the file.
If the files are not identical, focus on the code that writes your file.
Good luck!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

CSV file validation with Java - java

Related

Identify File Type in Java

Keeping Encoding when Reading Image File

How to check whether the file is binary?

Convert from Codepage 1252 (Windows) to Java, in Java

Corrupt file when using Java to download file

Categories

Resources