Calculating checksum using message digest from ByteBuffer - java

I get the data in the form of byte buffer of 32KB, and want to calculate the checksum of the whole data. So using the MessageDigest I keep updating the bytes into it and at the end I use the digest method to calculate the bytes read and calculating the checksum out of it. Checksum calculated is wrong by the above method. Below is the code. Any idea how to get it right?
private MessageDigest messageDigest;
//Keep getting bytebuffer of 32kb till eof is read
public int write(ByteBuffer src) throws IOException {
try {
ByteBuffer copiedByteBUffer = src.duplicate();
try{
messageDigest = MessageDigest.getInstance(MD5_CHECKSUM);
while(copiedByteBUffer.hasRemaining()){
messageDigest.update(copiedByteBUffer.get());
}
}catch(Exception e){
throw new IOException(e);
}
copiedByteBUffer = null;
}catch(Exception e){
}
}
//called after whole file is read in write function
public void calculateDigest(){
if(messageDigest != null){
byte[] digest = messageDigest.digest();
checkSumMultiPartFile = toHex(digest); // converting bytes into hexadecimal
}
}
Updated try #2
//Will Keep getting bytebuffer of 32kb till eof is read
public int write(ByteBuffer original) throws IOException {
try {
ByteBuffer copiedByteBuffer = cloneByteBuffer(original);
messageDigest = MessageDigest.getInstance(MD5_CHECKSUM);
messageDigest.update(copiedByteBuffer);
copiedByteBUffer = null;
}catch(Exception e){
}
}
public static ByteBuffer cloneByteBuffer(ByteBuffer original) {
final ByteBuffer clone = (original.isDirect()) ? ByteBuffer.allocateDirect(original.capacity()):ByteBuffer.allocate(original.capacity());
final ByteBuffer readOnlyCopy = original.asReadOnlyBuffer();
readOnlyCopy.flip();
clone.put(readOnlyCopy);
clone.position(original.position());
clone.limit(original.limit());
clone.order(original.order());
return clone;
}
After trying the above code i was able to see that the message digest was getting updated with all the bytes read for example: if the file size is 52,42,892 bytes then it was updated with 52,42,892 bytes. But when the checksum of file calculated using certutil -hashfile MD5 using CMD and the one calculated using the above method does not match.

Related

Reading binary from any type of file

I'm looking for a way that I can read the binary data of a file into a string.
I've found one that reads the bytes directly and converts the bytes to binary, the only problem is that it takes up a significant amount of RAM.
Here's the code I'm currently using
try {
byte[] fileData = new byte[(int) sellect.length()];
FileInputStream in = new FileInputStream(sellect);
in.read(fileData);
in.close();
getBinary(fileData[0]);
getBinary(fileData[1]);
getBinary(fileData[2]);
} catch (IOException e) {
e.printStackTrace();
}
And the getBinary() method
public String getBinary(byte bite) {
String output = String.format("%8s", Integer.toBinaryString(bite & 0xFF)).replace(' ', '0');
System.out.println(output); // 10000001
return output;
}
Can you do something like this:
int buffersize = 1000;
int offset = 0;
byte[] fileData = new byte[buffersize];
int numBytesRead;
String string;
while((numBytesRead = in.read(fileData,offset,buffersize)) != -1)
{
string = getBinary(fileData);//Adjust this so it can work with a whole array of bytes at once
out.write(string);
offset += numBytesRead;
}
This way, you never store more information in the ram than the byte and string structures. The file is read 1000 bytes at a time, translated to a string 1 byte at a time, and then put into a new file as a string. Using read() returns the value of how many bytes it reads.
This link can help you :
File to byte[] in Java
public static byte[] toByteArray(InputStream input) throws IOException
Gets the contents of an InputStream as a byte[]. This method buffers
the input internally, so there is no need to use a
BufferedInputStream.
Parameters: input - the InputStream to read from Returns: the
requested byte array Throws: NullPointerException - if the input is
null IOException - if an I/O error occurs

Why the result is different for twice common-codec md5

when I use apache common-codec md5Hex to get the inputstream's md5 result,but get the different result for twice. the example code is below :
public static void main(String[] args) {
String data = "D:\\test.jpg";
File file = new File(data);
InputStream is = null;
try {
is = new FileInputStream(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String digest = null, digest2 = null;
try {
System.out.println(is.hashCode());
digest = DigestUtils.md5Hex(is);
System.out.println(is.hashCode());
digest2 = DigestUtils.md5Hex(is);
System.out.println(is.hashCode());
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Digest = " + digest);
System.out.println("Digest2 = " + digest2);
}
and the result is:
1888654590
1888654590
1888654590
Digest = 5cc6c20f0b3aa9b44fe952da20cc928e
Digest2 = d41d8cd98f00b204e9800998ecf8427e
Thank you for answer!
d41d8cd98f00b204e9800998ecf8427e is the md5 hash of the empty string ("").
That is because is is a stream, meaning that once you've read it (in DigestUtils.md5Hex(is)), the "cursor" is at the end of the stream, where there is no more data to read, so attempting to read anything will return 0 bytes.
I suggest reading the contents of the stream to a byte[] instead, and hashing that.
For how to get a byte[] from an InputStream, see this question.
The InputStream can be traversed only once. The first call traverses it and returns the MD5 for your input file. When you call md5hex the second time, the InputStream points to the end-of-file, thus the digest2 is the MD5 for empty input.
You cannot move back within InputStream. So invoking twice:
DigestUtils.md5Hex(is);
is not the same. Better read into byte array and use:
public static String md5Hex(byte[] data)

Efficiently hashing all the files of a directory (1000 2MB files)

I would like to hash (MD5) all the files of a given directory, which holds 1000 2MB photos.
I tried just running a for loop and hashing a file at a time, but that caused memory issues.
I need a method to hash each file in an efficient manner (memory wise).
I have posted 3 questions with my problem, but now instead of fixing my code, I want to see what would be the best general approach to my requirement.
Thank you very much for the help.
public class MD5 {
public static void main(String[] args) throws IOException {
File file = new File("/Users/itaihay/Desktop/test");
for (File f : file.listFiles()) {
try {
model.MD5.hash(f);
} catch (Exception e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
private static MessageDigest md;
private static BufferedInputStream fis;
private static byte[] dataBytes;
private static byte[] mdbytes;
private static void clean() throws NoSuchAlgorithmException {
md = MessageDigest.getInstance("MD5");
dataBytes = new byte[8192];
}
public static void hash(File file) {
try {
clean();
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
try {
fis = new BufferedInputStream(new FileInputStream(file));
int nread = 0;
while ((nread = fis.read(dataBytes)) != -1) {
md.update(dataBytes, 0, nread);
}
nread = 0;
mdbytes = md.digest(); System.out.println(javax.xml.bind.DatatypeConverter.printHexBinary(mdbytes).toLowerCase());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fis.close();
dataBytes = null;
md = null;
mdbytes = null;
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
As others have said, using built-in Java MD5 code, you should be able to keep your memory footprint very small. I do something similar when hashing a large number of Jar files (up to a few MB apiece, usually 500MB-worth at a time) and get decent performance. You'll definitely want to play around with different buffer sizes until you find the optimal size for your system configuration. The following code-snippet uses no more than bufSize+128 bytes at a time, plus a negligible amount of overhead for the File, MessageDigest, and InputStream objects used to compute the md5 hash:
InputStream is = null;
File f = ...
int bufSize = ...
byte[] md5sum = null;
try {
MessageDigest digest = MessageDigest.getInstance("MD5");
is = new FileInputStream(f);
byte[] buffer = new byte[bufSize];
int read = 0;
while((read = is.read(buffer)) > 0) digest.update(buffer,0,read);
md5sum = digest.digest();
} catch (Exception e){
} finally {
try{
if(is != null) is.close();
} catch (IOException e){}
}
Increasing your Java heap space could solve it short term.
Long term, you want to look into reading images into a fixed-size queue that can fit in the memory. Don't read them all in at once. Enqueue the most recent image and dequeue the earliest image.
MD5 updates its state in 64 byte chunks, so you only need 16 bytes of a file in memory at a time. The MD5 state itself is 128 bits, as is the output size.
The most memory conservative approach would be to read 64 bytes at a time from each file, file-by-file, and use it to update that file's MD5 state. You would need at most 999 * 16 + 64 = 16048 ~= 16k of memory.
But such small reads would be very inefficient, so from there you can increase the read size from a file to fit within your memory constraints.

Checking MD5 of a file

I am getting an error while trying to check the MD5 hash of a file.
The file, notice.txt has the following contents:
My name is sanjay yadav . i am in btech computer science .>>
When I checked online with onlineMD5.com it gave the MD5 as: 90F450C33FAC09630D344CBA9BF80471.
My program output is:
My name is sanjay yadav . i am in btech computer science .
Read 58 bytes
d41d8cd98f00b204e9800998ecf8427e
Here's my code:
import java.io.*;
import java.math.BigInteger;
import java.security.DigestException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class MsgDgt {
public static void main(String[] args) throws IOException, DigestException, NoSuchAlgorithmException {
FileInputStream inputstream = null;
byte[] mybyte = new byte[1024];
inputstream = new FileInputStream("e://notice.txt");
int total = 0;
int nRead = 0;
MessageDigest md = MessageDigest.getInstance("MD5");
while ((nRead = inputstream.read(mybyte)) != -1) {
System.out.println(new String(mybyte));
total += nRead;
md.update(mybyte, 0, nRead);
}
System.out.println("Read " + total + " bytes");
md.digest();
System.out.println(new BigInteger(1, md.digest()).toString(16));
}
}
There's a bug in your code and I believe the online tool is giving the wrong answer. Here, you're currently computing the digest twice:
md.digest();
System.out.println(new BigInteger(1, md.digest()).toString(16));
Each time you call digest(), it resets the internal state. You should remove the first call to digest(). That then leaves you with this as the digest:
2f4c6a40682161e5b01c24d5aa896da0
That's the same result I get from C#, and I believe it to be correct. I don't know why the online checker is giving an incorrect result. (If you put it into the text part of the same site, it gives the right result.)
A couple of other points on your code though:
You're currently using the platform default encoding when converting the bytes to a string. I would strongly discourage you from doing that.
You're currently converting the whole buffer to a string, instead of only the bit you've read.
I don't like using BigInteger as a way of converting binary data to hex. You potentially need to pad it with 0s, and it's basically not what the class was designed for. Use a dedicated hex conversion class, e.g. from Apache Commons Codec (or various Stack Overflow answers which provide standalone classes for the purpose).
You're not closing your input stream. You should do so in a finally block, or using a try-with-resources statement in Java 7.
I use this function:
public static String md5Hash(File file) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
InputStream is = new FileInputStream(file);
byte[] buffer = new byte[1024];
try {
is = new DigestInputStream(is, md);
while (is.read(buffer) != -1) { }
} finally {
is.close();
}
byte[] digest = md.digest();
BigInteger bigInt = new BigInteger(1, digest);
String output = bigInt.toString(16);
while (output.length() < 32) {
output = "0" + output;
}
return output;
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

Java decompressing array of bytes

On server (C++), binary data is compressed using ZLib function:
compress2()
and it's sent over to client (Java).
On client side (Java), data should be decompressed using the following code snippet:
public static String unpack(byte[] packedBuffer) {
InflaterInputStream inStream = new InflaterInputStream(new ByteArrayInputStream( packedBuffer);
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
int readByte;
try {
while((readByte = inStream.read()) != -1) {
outStream.write(readByte);
}
} catch(Exception e) {
JMDCLog.logError(" unpacking buffer of size: " + packedBuffer.length);
e.printStackTrace();
// ... the rest of the code follows
}
Problem is that when it tries to read in while loop it always throws:
java.util.zip.ZipException: invalid stored block lengths
Before I check for other possible causes can someone please tell me can I compress on one side with compress2 and decompress it on the other side using above code, so I can eliminate this as a problem? Also if someone has a possible clue about what might be wrong here (I know I didn't provide too much of of the code in here but projects are rather big.
Thanks.
I think the problem is not with unpack method but in packedBuffer content. Unpack works fine
public static byte[] pack(String s) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DeflaterOutputStream dout = new DeflaterOutputStream(out);
dout.write(s.getBytes());
dout.close();
return out.toByteArray();
}
public static void main(String[] args) throws Exception {
byte[] a = pack("123");
String s = unpack(a); // calls your unpack
System.out.println(s);
}
output
123
public static String unpack(byte[] packedBuffer) {
try (GZipInputStream inStream = new GZipInputStream(
new ByteArrayInputStream(packedBuffer));
ByteArrayOutputStream outStream = new ByteArrayOutputStream()) {
inStream.transferTo(outStream);
//...
return outStream.toString(StandardCharsets.UTF_8);
} catch(Exception e) {
JMDCLog.logError(" unpacking buffer of size: " + packedBuffer.length);
e.printStackTrace();
throw new IllegalArgumentException(e);
}
}
ZLib is the zip format, hence a GZipInputStream is fine.
A you seem to expect the bytes to represent text, hence be in some encoding, add that encoding, Charset, to the conversion to String (which always holds Unicode).
Note, UTF-8 is the encoding of the bytes. In your case it might be an other encoding.
The ugly try-with-resources syntax closes the streams even on exception or here the return.
I rethrowed a RuntimeException as it seems dangerous to do something with no result.

Categories

Resources