I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}
Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.
x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.
Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.
The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.
This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.
One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}
You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.
Related
My assignment is to create a program that does compression using the Huffman algorithm. My program must be able to compress any type of file. Hence why i'm not using the Reader that works with characters.
Im not understanding how to be able to make some kind of frequency table when encoding a binary file?
EDIT!! Problem solved.
public static void main(String args[]){
try{
FileInputStream in = new FileInputStream("./src/hello.jpg");
int currentByte;
while((currentByte = in.read())!=-1){ //in.read()
//read all byte streams in file and create a frequency
//table
}
}catch (IOException e){
e.printStackTrace();
}
}
I'm not sure what you mean by "reading from an image and look at the characters" but talking about text files (as you're reading one in in your code example) this is most of the time working by casting the read byte to char by doing a
char charVal = (char) currentByte;
It's mostly working because most data is ASCII and most charsets contain ASCII. It gets more complicated with non-ASCII characters because a simple cast is equivalent with using charset ISO-8859-1. This will still most of the time produce correct results, because e.g. Window's cp1252 (on german systems) only differ with ISO-8859-1 at the Euro-sign.
Things start to run havoc with charsets like UTF-8 where non-ASCII characters are encoded with multiple bytes, so you will see things like ä instead of an ä. Same for files being encoded with Unicode where every second byte is most likely a binary zero.
You could use Files.readAllBytes and then iterate over this array.
Path path = Paths.get("hello.txt");
try {
byte[] array = Files.readAllBytes(path);
} catch (IOException ) {
}
I'm a java newbie with a real hair puller. Hope someone can help.
I have a binary file that loads ok from the applet's directory,
but which only partially loads from the applet's jar file.
The code below loads the file both ways and compares them. They
should be identical, but the output is "divergence at byte 8181".
int spx_data_length = 158994;
byte[] spx_buf = new byte[spx_data_length];
byte[] spx_buf2 = new byte[spx_data_length];
// binary file in jar
InputStream is = Vocals.class.getResourceAsStream("0.raw");
is.read(spx_buf, 0, spx_data_length);
is.close();
// same binary file in applet directory
URL srcURL=new URL(getCodeBase(),"0.raw");
URLDataSource u_dat = new URLDataSource(srcURL);
is=u_dat.getInputStream();
is.read(spx_buf2, 0, spx_data_length);
is.close();
// compare them
for(int i=0;ispx_data_length;i++){
if(spx_buf[i] != spx_buf2[i]){
Obj[0]="divergence at byte "+i; win.call("show_string", Obj);
i=spx_data_length;
}
}
InputStream.read(byte[], int, int) will read up to spx_data_length bytes, but may very well read less. Particularly in the case of compressed data (i.e. reading from the JAR), it might return one decompression buffer worth of data at a time. You should either loop until the read returns -1, or use something like DataInputStream.readFully(byte[], int, int). And you should compare the number of bytes read: if that differes, there is little point in comparing the bytes past the smaller of these counts.
I'm outputting a byte array to a text file using the following method:
try{
FileOutputStream fos = new FileOutputStream(filePath+".8102");
fos.write(concatenatedIVCipherMAC);
fos.close();
}catch(Exception e)
{
e.printStackTrace();
}
which outputs to the file a UTF-16 encoded data, example:
¢¬6î)ªÈP~m˜LïiƟê•Àe»/#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
However when I'm reading it back in I get þÿ prepended to the front of the data, e.g:
þÿ¢¬6î)ªÈP~m˜LïiƟê•Àe»/?#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
This is the method I'm using to read in the file:
private String getFilesContents()
{
String fileContents = "";
Scanner sc = null;
try {
sc = new Scanner(file, "UTF-16");
System.out.println("Can read file: "+file.canRead());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
while(sc.hasNextLine()){
fileContents += sc.nextLine();
}
sc.close();
return fileContents;
}
and then byte[] contentsOfFile = fileContents.getBytes("UTF-16"); to convert the String into a byte array.
A quick Google told me that þÿ represents the byte order but is it Java putting that there or Windows? How can I avoid having the þÿ prepended at the start of the data I'm reading in? I was thinking of just ignoring the first two bytes but if it is Windows then this will obviously break the program on other platforms.
edit: changed appended to prepended.
The file is the IV+data+MAC. It's not meant to be readable text? Should be I be doing something differently?
Yes. You shouldn't be trying to treat it as text anywhere.
If you really need to convert arbitrary binary data into text, use Base64 to convert it. Other than that, stick to byte arrays, InputStream and OutputStream.
I don't know exactly why you're supposedly getting extra characters, but the fact that you haven't got real text to start suggests that it's not really worth diagnosing that side. Just start handling binary data as binary data instead.
EDIT: Have a look at Guava's IO helpers for simplicity...
þÿ is the byte order mark (BOM) unicode character saved as UTF16-BE, interpreted as ISO-8859-1.
You shouldn't treat binary data as text (in whatever encoding), if you want to avoid such errors.
I have a string that has 823237 characters in it. its actually an xml file and for testing purpose I want to return as a response form a servlet.
I have tried everything I can possible think of
1) creating a constant with the whole string... in this case Eclipse complains (with a red line under servlet class name) -
The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool
2) breaking the whole string into 20 string constants and writing to the out object directly
something like :
out.println( CONSTANT_STRING_PART_1 + CONSTANT_STRING_PART_2 +
CONSTANT_STRING_PART_3 + CONSTANT_STRING_PART_4 +
CONSTANT_STRING_PART_5 + CONSTANT_STRING_PART_6 +
// add all the string constants till .... CONSTANT_STRING_PART_20);
in this case ... the build fails .. complaining..
[javac] D:\xx\xxx\xxx.java:87: constant string too long
[javac] CONSTANT_STRING_PART_19 + CONSTANT_STRING_PART_20);
^
3) reading the xml file as a string and writing to out object .. in this case I get
SEVERE: Allocate exception for servlet MyServlet
Caused by: org.apache.xmlbeans.XmlException: error: Content is not allowed in prolog.
Finally my question is ... how can I return such a big string (as response) from the servlet ???
You can avoid to load all the text in memory using streams:
InputStream is = new FileInputStream("path/to/your/file"); //or the following line if the file is in the classpath
InputStream is = MyServlet.class.getResourceAsStream("path/to/file/in/classpath");
byte[] buff = new byte[4 * 1024];
int read;
while ((read = is.read(buff)) != -1) {
out.write(buff, 0, read);
}
The second approach might work the following way:
out.print(CONSTANT_STRING_PART_1);
out.print(CONSTANT_STRING_PART_2);
out.print(CONSTANT_STRING_PART_3);
out.print(CONSTANT_STRING_PART_4);
// ...
out.print(CONSTANT_STRING_PART_N);
out.println();
You can do this in a loop of course (which is highly recommended ;)).
The way you do it, you just temporarely create the large string again to then pass it to println(), which is the same problem as the first one.
Ropes: Theory and practice
Why and when to use Ropes for Java for string manipulations
You can read a 823K file into a String. Maybe not the most elegant method, but totally doable. Method 3 should have worked. There was an XML error, but that has nothing to do with reading from a file into a String, or the length of the data.
It has to be an external file, though, because it is too big to be inlined into a class file (there are size limits for those).
I recommend Commons IO FileUtils#readFileToString.
You have to deal with ByteArrayOutputStream and not with the String it self. If you want to send your String in the http response all you have to do is to read from that byteArray stream and write in the response stream like this :
ByteArrayOutputStream baos = new ByteArrayOutputStream(8232237);
baos.write(constant1.getBytes());
baos.write(constant2.getBytes());
...
baos.writeTo(response.getOutputStream());
Both problem 1) and 2) are due to the same fundamental issue. A String literal (or constant String expression) cannot be more than 65535 characters because there is a hard limit on string constants in the class file format.
The third problem sounds like a bug in the way you've implemented it rather than a fundamental problem. In fact, it sounds like you are trying to load the XML as a DOM and then unparse it (which is unnecessary), and that somehow you have managed to mangle the XML in the process. (Or maybe it is mangled in the file you are trying to read ...)
The simple and elegant solution is to save the stuff in a file, and then read it as plain text.
Or ... less elegant, but just as effective:
String[] strings = new String[](
"longString1",
"longString2",
...
"longStringN"};
for (String str : strings) {
out.write(str);
}
Of course, the problem with embedding test data as string literals is that you have to escape certain characters in the string to keep the compiler happy. That's tedious if you have to do it by hand.
I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.