Efficiently hashing all the files of a directory (1000 2MB files)

Efficiently hashing all the files of a directory (1000 2MB files) - java

I would like to hash (MD5) all the files of a given directory, which holds 1000 2MB photos.
I tried just running a for loop and hashing a file at a time, but that caused memory issues.
I need a method to hash each file in an efficient manner (memory wise).
I have posted 3 questions with my problem, but now instead of fixing my code, I want to see what would be the best general approach to my requirement.
Thank you very much for the help.
public class MD5 {
public static void main(String[] args) throws IOException {
File file = new File("/Users/itaihay/Desktop/test");
for (File f : file.listFiles()) {
try {
model.MD5.hash(f);
} catch (Exception e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
private static MessageDigest md;
private static BufferedInputStream fis;
private static byte[] dataBytes;
private static byte[] mdbytes;
private static void clean() throws NoSuchAlgorithmException {
md = MessageDigest.getInstance("MD5");
dataBytes = new byte[8192];
}
public static void hash(File file) {
try {
clean();
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
try {
fis = new BufferedInputStream(new FileInputStream(file));
int nread = 0;
while ((nread = fis.read(dataBytes)) != -1) {
md.update(dataBytes, 0, nread);
}
nread = 0;
mdbytes = md.digest(); System.out.println(javax.xml.bind.DatatypeConverter.printHexBinary(mdbytes).toLowerCase());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fis.close();
dataBytes = null;
md = null;
mdbytes = null;
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

As others have said, using built-in Java MD5 code, you should be able to keep your memory footprint very small. I do something similar when hashing a large number of Jar files (up to a few MB apiece, usually 500MB-worth at a time) and get decent performance. You'll definitely want to play around with different buffer sizes until you find the optimal size for your system configuration. The following code-snippet uses no more than bufSize+128 bytes at a time, plus a negligible amount of overhead for the File, MessageDigest, and InputStream objects used to compute the md5 hash:
InputStream is = null;
File f = ...
int bufSize = ...
byte[] md5sum = null;
try {
MessageDigest digest = MessageDigest.getInstance("MD5");
is = new FileInputStream(f);
byte[] buffer = new byte[bufSize];
int read = 0;
while((read = is.read(buffer)) > 0) digest.update(buffer,0,read);
md5sum = digest.digest();
} catch (Exception e){
} finally {
try{
if(is != null) is.close();
} catch (IOException e){}
}

Increasing your Java heap space could solve it short term.
Long term, you want to look into reading images into a fixed-size queue that can fit in the memory. Don't read them all in at once. Enqueue the most recent image and dequeue the earliest image.

MD5 updates its state in 64 byte chunks, so you only need 16 bytes of a file in memory at a time. The MD5 state itself is 128 bits, as is the output size.
The most memory conservative approach would be to read 64 bytes at a time from each file, file-by-file, and use it to update that file's MD5 state. You would need at most 999 * 16 + 64 = 16048 ~= 16k of memory.
But such small reads would be very inefficient, so from there you can increase the read size from a file to fit within your memory constraints.

Related

Calculating checksum using message digest from ByteBuffer

I get the data in the form of byte buffer of 32KB, and want to calculate the checksum of the whole data. So using the MessageDigest I keep updating the bytes into it and at the end I use the digest method to calculate the bytes read and calculating the checksum out of it. Checksum calculated is wrong by the above method. Below is the code. Any idea how to get it right?
private MessageDigest messageDigest;
//Keep getting bytebuffer of 32kb till eof is read
public int write(ByteBuffer src) throws IOException {
try {
ByteBuffer copiedByteBUffer = src.duplicate();
try{
messageDigest = MessageDigest.getInstance(MD5_CHECKSUM);
while(copiedByteBUffer.hasRemaining()){
messageDigest.update(copiedByteBUffer.get());
}
}catch(Exception e){
throw new IOException(e);
}
copiedByteBUffer = null;
}catch(Exception e){
}
}
//called after whole file is read in write function
public void calculateDigest(){
if(messageDigest != null){
byte[] digest = messageDigest.digest();
checkSumMultiPartFile = toHex(digest); // converting bytes into hexadecimal
}
}
Updated try #2
//Will Keep getting bytebuffer of 32kb till eof is read
public int write(ByteBuffer original) throws IOException {
try {
ByteBuffer copiedByteBuffer = cloneByteBuffer(original);
messageDigest = MessageDigest.getInstance(MD5_CHECKSUM);
messageDigest.update(copiedByteBuffer);
copiedByteBUffer = null;
}catch(Exception e){
}
}
public static ByteBuffer cloneByteBuffer(ByteBuffer original) {
final ByteBuffer clone = (original.isDirect()) ? ByteBuffer.allocateDirect(original.capacity()):ByteBuffer.allocate(original.capacity());
final ByteBuffer readOnlyCopy = original.asReadOnlyBuffer();
readOnlyCopy.flip();
clone.put(readOnlyCopy);
clone.position(original.position());
clone.limit(original.limit());
clone.order(original.order());
return clone;
}
After trying the above code i was able to see that the message digest was getting updated with all the bytes read for example: if the file size is 52,42,892 bytes then it was updated with 52,42,892 bytes. But when the checksum of file calculated using certutil -hashfile MD5 using CMD and the one calculated using the above method does not match.

compress base 64 png image in java

Hi I would like to know if there is any way in Java to reduce the size of an image.Actually My front end is IOS,they are sending Base 64 encode data and when i'm getting the encoded data and i'm decoding the encoded data and storing in byte array. and now i want to compress the PNG image in java and my method code something like
public String processFile(String strImageBase64, String strImageName,String donorId)
{
FileOutputStream fos =null;
File savedFile=null;
try
{
String FileItemRefPath = propsFPCConfig.getProperty("fileCreationReferencePath");
String imageURLReferncePath = propsFPCConfig.getProperty("imageURLReferncePath");
File f = new File(FileItemRefPath+"\\"+"productimages"+"\\"+donorId);
String strException = "Actual File "+f.getName();
if(!f.exists())
{
boolean isdirCreationStatus = f.mkdir();
}
String strDateTobeAppended = new SimpleDateFormat("yyyyMMddhhmm").format(new Date(0));
String fileName = strImageName+strDateTobeAppended;
savedFile = new File(f.getAbsolutePath()+"\\"+fileName);
strException=strException+" savedFile "+savedFile.getName();
Base64 decoder = new Base64();
byte[] decodedBytes = decoder.decode(strImageBase64);
if( (decodedBytes != null) && (decodedBytes.length != 0) )
{
System.out.println("Decoded bytes length:"+decodedBytes.length);
fos = new FileOutputStream(savedFile);
System.out.println(new String(decodedBytes) + "\n") ;
int x=0;
{
fos.write(decodedBytes, 0, decodedBytes.length);
}
fos.flush();
}
//System.out.println(savedFile.getCanonicalPath() +" savedFile.getCanonicalPath() ");
if(fos != null)
{
fos.close();
return savedFile.getAbsolutePath();
}
else
{
return null;
}
}
catch(Exception e)
{
e.printStackTrace();
}
finally
{
try
{
if( fos!= null)
{
fos.close();
}
else
{
savedFile = null;
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
return savedFile.getName();
}
and i'm storing this decoded data with imagename,now i want to store this compressed image in anothe url

I don't think this should be worth the effort.
PNGs already have a very high level of compression. It is hard to reduce the size by means of additional compression significantly.
If you are really sending the image or the response Base64 encoded to the client, of course there are ways to improve transfer rates: Enable gzip compression on your server so that HTTP responses will be gzip compressed. This reduces the actual number of bytes to transfer quite a bit in case the original data is Base64 encoded (which basically means that you are only using 6 of 8 bits per bytes). Enabling gzip compression is transparent to your server code and is just a configuration switch away for most webservers.

Converting from FSDataInputStream to FileInputStream

I'm kind of new to Hadoop HDFS and quite rusty with Java and I need some help. I'm trying to read a file from HDFS and calculate the MD5 hash of this file. The general Hadoop configuration is as below.
private FSDataInputStream hdfsDIS;
private FileInputStream FinputStream;
private FileSystem hdfs;
private Configuration myConfig;
myConfig.addResource("/HADOOP_HOME/conf/core-site.xml");
myConfig.addResource("/HADOOP_HOME/conf/hdfs-site.xml");
hdfs = FileSystem.get(new URI("hdfs://NodeName:54310"), myConfig);
hdfsDIS = hdfs.open(hdfsFilePath);
The function hdfs.open(hdfsFilePath) returns an FSDataInputStream
The problem is that i can only get an FSDataInputStream out of the HDFS, but i'd like to get a FileInputStream out of it.
The code below performs the hashing part and is adapted from something i found somewhere on StackOverflow (can't seem to find the link to it now).
FileInputStream FinputStream = hdfsDIS; // <---This is where the problem is
MessageDigest md;
try {
md = MessageDigest.getInstance("MD5");
FileChannel channel = FinputStream.getChannel();
ByteBuffer buff = ByteBuffer.allocate(2048);
while(channel.read(buff) != -1){
buff.flip();
md.update(buff);
buff.clear();
}
byte[] hashValue = md.digest();
return toHex(hashValue);
}
catch (NoSuchAlgorithmException e){
return null;
}
catch (IOException e){
return null;
}
The reason why i need a FileInputStream is because the code that does the hashing uses a FileChannel which supposedly increases the efficiency of reading the data from the file.
Could someone show me how i could convert the FSDataInputStream into a FileInputStream

Use it as an InputStream:
MessageDigest md;
try {
md = MessageDigest.getInstance("MD5");
byte[] buff = new byte[2048];
int count;
while((count = hdfsDIS.read(buff)) != -1){
md.update(buff, 0, count);
}
byte[] hashValue = md.digest();
return toHex(hashValue);
}
catch (NoSuchAlgorithmException e){
return null;
}
catch (IOException e){
return null;
}
the code that does the hashing uses a FileChannel which supposedly increases the efficiency of reading the data from the file
Not in this case. It only improves efficiency if you're just copying the data to another channel, if you use a DirectByteBuffer. If you're processing the data, as here, it doesn't make any difference. A read is still a read.

You can use the FSDataInputStream as just a regular InputStream, and pass that to Channels.newChannel to get back a ReadableByteChannel instead of a FileChannel. Here's an updated version:
InputStream inputStream = hdfsDIS;
MessageDigest md;
try {
md = MessageDigest.getInstance("MD5");
ReadableByteChannel channel = Channels.newChannel(inputStream);
ByteBuffer buff = ByteBuffer.allocate(2048);
while(channel.read(buff) != -1){
buff.flip();
md.update(buff);
buff.clear();
}
byte[] hashValue = md.digest();
return toHex(hashValue);
}
catch (NoSuchAlgorithmException e){
return null;
}
catch (IOException e){
return null;
}

You can' t do that assignment because:
java.lang.Object
extended by java.io.InputStream
extended by java.io.FilterInputStream
extended by java.io.DataInputStream
extended by org.apache.hadoop.fs.FSDataInputStream
FSDataInputStream is not a FileInputStream.
That said to convert from FSDataInputStream to FileInputStream,
you could user FSDataInputStream FileDescriptors to create a FileInputStream according to the Api
new FileInputStream(hdfsDIS.getFileDescriptor());
Not sure it will work.

Checking MD5 of a file

I am getting an error while trying to check the MD5 hash of a file.
The file, notice.txt has the following contents:
My name is sanjay yadav . i am in btech computer science .>>
When I checked online with onlineMD5.com it gave the MD5 as: 90F450C33FAC09630D344CBA9BF80471.
My program output is:
My name is sanjay yadav . i am in btech computer science .
Read 58 bytes
d41d8cd98f00b204e9800998ecf8427e
Here's my code:
import java.io.*;
import java.math.BigInteger;
import java.security.DigestException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class MsgDgt {
public static void main(String[] args) throws IOException, DigestException, NoSuchAlgorithmException {
FileInputStream inputstream = null;
byte[] mybyte = new byte[1024];
inputstream = new FileInputStream("e://notice.txt");
int total = 0;
int nRead = 0;
MessageDigest md = MessageDigest.getInstance("MD5");
while ((nRead = inputstream.read(mybyte)) != -1) {
System.out.println(new String(mybyte));
total += nRead;
md.update(mybyte, 0, nRead);
}
System.out.println("Read " + total + " bytes");
md.digest();
System.out.println(new BigInteger(1, md.digest()).toString(16));
}
}

There's a bug in your code and I believe the online tool is giving the wrong answer. Here, you're currently computing the digest twice:
md.digest();
System.out.println(new BigInteger(1, md.digest()).toString(16));
Each time you call digest(), it resets the internal state. You should remove the first call to digest(). That then leaves you with this as the digest:
2f4c6a40682161e5b01c24d5aa896da0
That's the same result I get from C#, and I believe it to be correct. I don't know why the online checker is giving an incorrect result. (If you put it into the text part of the same site, it gives the right result.)
A couple of other points on your code though:
You're currently using the platform default encoding when converting the bytes to a string. I would strongly discourage you from doing that.
You're currently converting the whole buffer to a string, instead of only the bit you've read.
I don't like using BigInteger as a way of converting binary data to hex. You potentially need to pad it with 0s, and it's basically not what the class was designed for. Use a dedicated hex conversion class, e.g. from Apache Commons Codec (or various Stack Overflow answers which provide standalone classes for the purpose).
You're not closing your input stream. You should do so in a finally block, or using a try-with-resources statement in Java 7.

I use this function:
public static String md5Hash(File file) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
InputStream is = new FileInputStream(file);
byte[] buffer = new byte[1024];
try {
is = new DigestInputStream(is, md);
while (is.read(buffer) != -1) { }
} finally {
is.close();
}
byte[] digest = md.digest();
BigInteger bigInt = new BigInteger(1, digest);
String output = bigInt.toString(16);
while (output.length() < 32) {
output = "0" + output;
}
return output;
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

How can I write a byte array to a file in Java?

How to write a byte array to a file in Java?

As Sebastian Redl points out the most straight forward now java.nio.file.Files.write. Details for this can be found in the Reading, Writing, and Creating Files tutorial.
Old answer:
FileOutputStream.write(byte[]) would be the most straight forward. What is the data you want to write?
The tutorials for Java IO system may be of some use to you.

You can use IOUtils.write(byte[] data, OutputStream output) from Apache Commons IO.
KeyGenerator kgen = KeyGenerator.getInstance("AES");
kgen.init(128);
SecretKey key = kgen.generateKey();
byte[] encoded = key.getEncoded();
FileOutputStream output = new FileOutputStream(new File("target-file"));
IOUtils.write(encoded, output);

As of Java 1.7, there's a new way: java.nio.file.Files.write
import java.nio.file.Files;
import java.nio.file.Paths;
KeyGenerator kgen = KeyGenerator.getInstance("AES");
kgen.init(128);
SecretKey key = kgen.generateKey();
byte[] encoded = key.getEncoded();
Files.write(Paths.get("target-file"), encoded);
Java 1.7 also resolves the embarrassment that Kevin describes: reading a file is now:
byte[] data = Files.readAllBytes(Paths.get("source-file"));

A commenter asked "why use a third-party library for this?" The answer is that it's way too much of a pain to do it yourself. Here's an example of how to properly do the inverse operation of reading a byte array from a file (sorry, this is just the code I had readily available, and it's not like I want the asker to actually paste and use this code anyway):
public static byte[] toByteArray(File file) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
boolean threw = true;
InputStream in = new FileInputStream(file);
try {
byte[] buf = new byte[BUF_SIZE];
long total = 0;
while (true) {
int r = in.read(buf);
if (r == -1) {
break;
}
out.write(buf, 0, r);
}
threw = false;
} finally {
try {
in.close();
} catch (IOException e) {
if (threw) {
log.warn("IOException thrown while closing", e);
} else {
throw e;
}
}
}
return out.toByteArray();
}
Everyone ought to be thoroughly appalled by what a pain that is.
Use Good Libraries. I, unsurprisingly, recommend Guava's Files.write(byte[], File).

To write a byte array to a file use the method
public void write(byte[] b) throws IOException
from BufferedOutputStream class.
java.io.BufferedOutputStream implements a buffered output stream. By setting up such an output stream, an application can write bytes to the underlying output stream without necessarily causing a call to the underlying system for each byte written.
For your example you need something like:
String filename= "C:/SO/SOBufferedOutputStreamAnswer";
BufferedOutputStream bos = null;
try {
//create an object of FileOutputStream
FileOutputStream fos = new FileOutputStream(new File(filename));
//create an object of BufferedOutputStream
bos = new BufferedOutputStream(fos);
KeyGenerator kgen = KeyGenerator.getInstance("AES");
kgen.init(128);
SecretKey key = kgen.generateKey();
byte[] encoded = key.getEncoded();
bos.write(encoded);
}
// catch and handle exceptions...

Apache Commons IO Utils has a FileUtils.writeByteArrayToFile() method. Note that if you're doing any file/IO work then the Apache Commons IO library will do a lot of work for you.

No need for external libs to bloat things - especially when working with Android. Here is a native solution that does the trick. This is a pice of code from an app that stores a byte array as an image file.
// Byte array with image data.
final byte[] imageData = params[0];
// Write bytes to tmp file.
final File tmpImageFile = new File(ApplicationContext.getInstance().getCacheDir(), "scan.jpg");
FileOutputStream tmpOutputStream = null;
try {
tmpOutputStream = new FileOutputStream(tmpImageFile);
tmpOutputStream.write(imageData);
Log.d(TAG, "File successfully written to tmp file");
}
catch (FileNotFoundException e) {
Log.e(TAG, "FileNotFoundException: " + e);
return null;
}
catch (IOException e) {
Log.e(TAG, "IOException: " + e);
return null;
}
finally {
if(tmpOutputStream != null)
try {
tmpOutputStream.close();
} catch (IOException e) {
Log.e(TAG, "IOException: " + e);
}
}

File file = ...
byte[] data = ...
try{
FileOutputStream fos = FileOutputStream(file);
fos.write(data);
fos.flush();
fos.close();
}catch(Exception e){
}
but if the bytes array length is more than 1024 you should use loop to write the data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficiently hashing all the files of a directory (1000 2MB files) - java

Increasing your Java heap space could solve it short term. Long term, you want to look into reading images into a fixed-size queue that can fit in the memory. Don't read them all in at once. Enqueue the most recent image and dequeue the earliest image.

Related

Calculating checksum using message digest from ByteBuffer

compress base 64 png image in java

Converting from FSDataInputStream to FileInputStream

Checking MD5 of a file

How can I write a byte array to a file in Java?

Categories

Resources