File behaves differently when loaded from jar or directory - java

I'm a java newbie with a real hair puller. Hope someone can help.
I have a binary file that loads ok from the applet's directory,
but which only partially loads from the applet's jar file.
The code below loads the file both ways and compares them. They
should be identical, but the output is "divergence at byte 8181".
int spx_data_length = 158994;
byte[] spx_buf = new byte[spx_data_length];
byte[] spx_buf2 = new byte[spx_data_length];
// binary file in jar
InputStream is = Vocals.class.getResourceAsStream("0.raw");
is.read(spx_buf, 0, spx_data_length);
is.close();
// same binary file in applet directory
URL srcURL=new URL(getCodeBase(),"0.raw");
URLDataSource u_dat = new URLDataSource(srcURL);
is=u_dat.getInputStream();
is.read(spx_buf2, 0, spx_data_length);
is.close();
// compare them
for(int i=0;ispx_data_length;i++){
if(spx_buf[i] != spx_buf2[i]){
Obj[0]="divergence at byte "+i; win.call("show_string", Obj);
i=spx_data_length;
}
}

InputStream.read(byte[], int, int) will read up to spx_data_length bytes, but may very well read less. Particularly in the case of compressed data (i.e. reading from the JAR), it might return one decompression buffer worth of data at a time. You should either loop until the read returns -1, or use something like DataInputStream.readFully(byte[], int, int). And you should compare the number of bytes read: if that differes, there is little point in comparing the bytes past the smaller of these counts.

Related

android write to disk unreliable - written file.length !=expected.length

I have a write method that will write a byte[] to disk. On very few devices I'm running into some strange problems where the written file.length() != byte[].length after a successful write operation.
Code and Problem
The code to write a file to disk
private static boolean writeByteFile(File file, byte[] byteData) throws IOException {
if (!file.exists()) {
boolean fileCreated = file.createNewFile();
if (!fileCreated) {
return false;
}
}
FileOutputStream fos = new FileOutputStream(file);
BufferedOutputStream bos = new BufferedOutputStream(fos);
bos.write(byteData);
bos.flush();
fos.getFD().sync(); // sync to disk as recommended: http://android-developers.blogspot.com/2010/12/saving-data-safely.html
fos.close();
if (file.length() != byteData.length) {
final byte[] originalMD5Hash = md.digest(byteData);
InputStream is = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(is);
byte[] buffer = new byte[4096];
while(bis.read(buffer) > -1) {
md.update(buffer);
}
is.close();
final byte[] writtenFileMD5Hash = md.digest();
if(!Arrays.equals(originalMD5Hash, writtenFileMD5Hash)) {
String message = String.format(
"After an fsync, the file's length is not equal to the number of bytes we wrote!\npath=%s, expected=%d, actual=%d. >> " +
"Original MD5 Hash: %s, written file MD5 hash: %s",
file.getAbsolutePath(), byteData.length, file.length(),
digestToHex(originalMD5Hash), digestToHex(writtenFileMD5Hash));
throw new GiantWtfException(message);
}
}
return true;
}
I'm running into the if-statement where I compare file length on a few devices.
One example output:
After an fsync, the file's length is not equal to the number of bytes we wrote! path=/mnt/sdcard/.folder/filename, expected=233510, actual=229376 >> Original MD5 Hash: f1d298c0484672c52d9c26d04a3a21dc, written file MD5 hash: ab30660bd2b476d9551c15b340207a8a
I currently see this problem on 5 devices as I'm slowly rolling out the code. Some device data:
Question
Is there anything else I can do or improve?
More stats and observations
Current system version
2.3.5
2.3.6
Model
N860 (LG)
GT-I9100G (Samsung)
GT-S5300 (Samsung)
GT-S7500 (Samsung)
LG-VS410PP (LG)
Other stats
In the general crash analytics (from Crittercism) there is always more then enough free disk space at the time the problem happens. Still some (not all) of the devices have thrown IOExceptions around no free disk space at a different point in time.
As always I've never been able to reproduce that problem on any test phone I have.
Assumptions / Observations:
Generally I would expect a IOException when the disk is full. Still all the exceptions that I catch have less bytes written then they should have.
Interestingly enough all the number of bytes that actually have been written to disk are a multiple of 2^15.
EDIT:
I added a MD5 check sum validation that also fails and simplified the example code a little for better readability. It still fails in the wild with different MD5 hashes.
philipp, file.length() is the file size as reported by the OS. It might be the space the file takes up on disk or the number of bytes in the file.
If the number returned is size on disk, it is related to the number of clusters that hold the file. For example NTFS generally uses 4KB clusters. If you save a text document with 3 ascii encoded characters in on an NTFS formatted volume, the size of the file is 3 bytes, the size of the file on disk is 4096 bytes. On NTFS with a 4KB cluster all files are a multiple of 4096 bytes on disk. See http://en.wikipedia.org/wiki/Data_cluster for more.
If the number returned is the length of the file in bytes (from the underlying file-system's meta-data) then you should have an exact match to how many bytes you wrote, though I wouldn't bet my life on it.
Android uses YAFFS or EXT4, if that helps at all.
I strongly agree with admdrew, use a hash. MD5 would work great. SHA or even CRC should work fine for this task. As you write bytes to the disk, feed the stream to your hash algorithm as well. Once the file is written, read it back and feed that to your hasher. Compare the results. If you want to be sure the data is clean, file size is not enough.

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}
Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.
x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.
Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.
The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.
This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.
One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}
You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.

how to write a file without allocating the whole byte array into memory?

This is a newbie question, I know. Can you guys help?
I'm talking about big files, of course, above 100MB. I'm imagining some kind of loop, but I don't know what to use. Chunked stream?
One thins is for certain: I don't want something like this (pseudocode):
File file = new File(existing_file_path);
byte[] theWholeFile = new byte[file.length()]; //this allocates the whole thing into memory
File out = new File(new_file_path);
out.write(theWholeFile);
To be more specific, I have to re-write a applet that downloads a base64 encoded file and decodes it to the "normal" file. Because it's made with byte arrays, it holds twice the file size in memory: one base64 encoded and the other one decoded. My question is not about base64. It's about saving memory.
Can you point me in the right direction?
Thanks!
From the question, it appears that you are reading the base64 encoded contents of a file into an array, decoding it into another array before finally saving it.
This is a bit of an overhead when considering memory. Especially given the fact that Base64 encoding is in use. It can be made a bit more efficient by:
Reading the contents of the file using a FileInputStream, preferably decorated with a BufferedInputStream.
Decoding on the fly. Base64 encoded characters can be read in groups of 4 characters, to be decoded on the fly.
Writing the output to the file, using a FileOutputStream, again preferably decorated with a BufferedOutputStream. This write operation can also be done after every single decode operation.
The buffering of read and write operations is done to prevent frequent IO access. You could use a buffer size that is appropriate to your application's load; usually the buffer size is chosen to be some power of two, because such a number does not have an "impedance mismatch" with the physical disk buffer.
Perhaps a FileInputStream on the file, reading off fixed length chunks, doing your transformation and writing them to a FileOutputStream?
Perhaps a BufferedReader? Javadoc: http://download-llnw.oracle.com/javase/1.4.2/docs/api/java/io/BufferedReader.html
Use this base64 encoder/decoder, which will wrap your file input stream and handle the decoding on the fly:
InputStream input = new Base64.InputStream(new FileInputStream("in.txt"));
OutputStream output = new FileOutputStream("out.txt");
try {
byte[] buffer = new byte[1024];
int readOffset = 0;
while(input.available() > 0) {
int bytesRead = input.read(buffer, readOffset, buffer.length);
readOffset += bytesRead;
output.write(buffer, 0, bytesRead);
}
} finally {
input.close();
output.close();
}
You can use org.apache.commons.io.FileUtils. This util class provides other options too beside what you are looking for. For example:
FileUtils.copyFile(final File srcFile, final File destFile)
FileUtils.copyFile(final File input, final OutputStream output)
FileUtils.copyFileToDirectory(final File srcFile, final File destDir)
And so on.. Also you can follow this tut.

Copy binary data from URL to file in Java without intermediate copy

I'm updating some old code to grab some binary data from a URL instead of from a database (the data is about to be moved out of the database and will be accessible by HTTP instead). The database API seemed to provide the data as a raw byte array directly, and the code in question wrote this array to a file using a BufferedOutputStream.
I'm not at all familiar with Java, but a bit of googling led me to this code:
URL u = new URL("my-url-string");
URLConnection uc = u.openConnection();
uc.connect();
InputStream in = uc.getInputStream();
ByteArrayOutputStream out = new ByteArrayOutputStream();
final int BUF_SIZE = 1 << 8;
byte[] buffer = new byte[BUF_SIZE];
int bytesRead = -1;
while((bytesRead = in.read(buffer)) > -1) {
out.write(buffer, 0, bytesRead);
}
in.close();
fileBytes = out.toByteArray();
That seems to work most of the time, but I have a problem when the data being copied is large - I'm getting an OutOfMemoryError for data items that worked fine with the old code.
I'm guessing that's because this version of the code has multiple copies of the data in memory at the same time, whereas the original code didn't.
Is there a simple way to grab binary data from a URL and save it in a file without incurring the cost of multiple copies in memory?
Instead of writing the data to a byte array and then dumping it to a file, you can directly write it to a file by replacing the following:
ByteArrayOutputStream out = new ByteArrayOutputStream();
With:
FileOutputStream out = new FileOutputStream("filename");
If you do so, there is no need for the call out.toByteArray() at the end. Just make sure you close the FileOutputStream object when done, like this:
out.close();
See the documentation of FileOutputStream for more details.
I don't know what you mean with "large" data, but try using the JVM parameter
java -Xmx 256m ...
which sets the maximum heap size to 256 MByte (or any value you like).
If you need the Content-Length and your web-server is somewhat standard conforming, then it should provide you a "Content-Length" header.
URLConnection#getContentLength() should give you that information upfront so that you are able to create your file. (Be aware that if your HTTP server is misconfigured or under control of an evil entity, that header may not match the number of bytes received. In that case, why dont you stream to a temp-file first and copy that file later?)
In addition to that: A ByteArrayInputStream is a horrible memory allocator. It always doubles the buffer size, so if you read a 32MB + 1 byte file, then you end up with a 64MB buffer. It might be better to implement a own, smarter byte-array-stream, like this one:
http://source.pentaho.org/pentaho-reporting/engines/classic/trunk/core/source/org/pentaho/reporting/engine/classic/core/util/MemoryByteArrayOutputStream.java
subclassing ByteArrayOutputStream gives you access to the buffer and the number of bytes in it.
But of course, if all you want to do is to store de data into a file, you are better off using a FileOutputStream.

Corrupt file when using Java to download file

This problem seems to happen inconsistently. We are using a java applet to download a file from our site, which we store temporarily on the client's machine.
Here is the code that we are using to save the file:
URL targetUrl = new URL(urlForFile);
InputStream content = (InputStream)targetUrl.getContent();
BufferedInputStream buffered = new BufferedInputStream(content);
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int letter;
while((letter = buffered.read()) != -1)
fos.write(letter);
fos.close();
Later, I try to access that file by using:
ObjectInputStream keyInStream = new ObjectInputStream(new FileInputStream(savedFile));
Most of the time it works without a problem, but every once in a while we get the error:
java.io.StreamCorruptedException: invalid stream header: 0D0A0D0A
which makes me believe that it isn't saving the file correctly.
I'm guessing that the operations you've done with getContent and BufferedInputStream have treated the file like an ascii file which has converted newlines or carriage returns into carriage return + newline (0x0d0a), which has confused ObjectInputStream (which expects serialized data objects.
If you are using an FTP URL, the transfer may be occurring in ASCII mode.
Try appending ";type=I" to the end of your URL.
Why are you using ObjectInputStream to read it?
As per the javadoc:
An ObjectInputStream deserializes primitive data and objects previously written using an ObjectOutputStream.
Probably the error comes from the fact you didn't write it with ObjectOutputStream.
Try reading it wit FileInputStream only.
Here's a sample for binary ( although not the most efficient way )
Here's another used for text files.
There are 3 big problems in your sample code:
You're not just treating the input as bytes
You're needlessly pulling the entire object into memory at once
You're doing multiple method calls for every single byte read and written -- use the array based read/write!
Here's a redo:
URL targetUrl = new URL(urlForFile);
InputStream is = targetUrl.getInputStream();
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int count;
byte[] buff = new byte[16 * 1024];
while((count = is.read(buff)) != -1) {
fos.write(buff, 0, count);
}
fos.close();
content.close();
You could also step back from the code and check to see if the file on your client is the same as the file on the server. If you get both files on an XP machine, you should be able to use the FC utility to do a compare (check FC's help if you need to run this as a binary compare as there is a switch for that). If you're on Unix, I don't know the file compare program, but I'm sure there's something.
If the files are identical, then you're looking at a problem with the code that reads the file.
If the files are not identical, focus on the code that writes your file.
Good luck!

Categories

Resources