I want to load the MD5 of may different files. I am following this answer to do that but the main problem is that the time taken to load the MD5 of the files ( May be in hundreds) is a lot.
Is there any way which can be used to find the MD5 of an file without consuming much time.
Note- The size of the file may be large ( May go up to 300MB).
This is the code which I am using -
import java.io.*;
import java.security.MessageDigest;
public class MD5Checksum {
public static byte[] createChecksum(String filename) throws Exception {
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
return complete.digest();
}
// see this How-to for a faster way to convert
// a byte array to a HEX string
public static String getMD5Checksum(String filename) throws Exception {
byte[] b = createChecksum(filename);
String result = "";
for (int i=0; i < b.length; i++) {
result += Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
}
return result;
}
public static void main(String args[]) {
try {
System.out.println(getMD5Checksum("apache-tomcat-5.5.17.exe"));
// output :
// 0bb2827c5eacf570b6064e24e0e6653b
// ref :
// http://www.apache.org/dist/
// tomcat/tomcat-5/v5.5.17/bin
// /apache-tomcat-5.5.17.exe.MD5
// 0bb2827c5eacf570b6064e24e0e6653b *apache-tomcat-5.5.17.exe
}
catch (Exception e) {
e.printStackTrace();
}
}
}
You cannot use hashes to determine any similarity of content.
For instance, generating the MD5 of hellostackoverflow1 and hellostackoverflow2 calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes.
But as already noted, some speed might be achieved by using native code, thus the NDK. Additionally, if you still want to compare files for exact matches, first compare the size in bytes, after that use a hashing algorithm with enough speed and a low risk of collisions. As stated, CRC32 is fine.
Hash/CRC calculation takes some time as the file has to be read completely.
The code of createChecksum you presented is nearly optimal. The only parts that can be tweaked is the read buffer size (I would use a buffer size 2048 bytes or larger). However this may get you a maximum of 1-2% speed improvement.
If this is still too slow the only option left is to implement the hashing in C/C++ and use it as native method. Besides that there is nothing you can do.
Related
I'm trying to take hash of gzipped string in Python and need it to be identical to Java's. But Python's gzip implementation seems to be different from Java's GZIPOutputStream.
Python gzip:
import gzip
import hashlib
gzip_bytes = gzip.compress(bytes('test', 'utf-8'))
gzip_hex = gzip_bytes.hex().upper()
md5 = hashlib.md5(gzip_bytes).hexdigest().upper()
>>>gzip_hex
'1F8B0800678B186002FF2B492D2E01000C7E7FD804000000'
>>>md5
'C4C763E9A0143D36F52306CF4CCC84B8'
Java GZIPOutputStream:
import java.io.ByteArrayOutputStream;
import java.util.zip.GZIPOutputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class HelloWorld{
private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for (int j = 0; j < bytes.length; j++) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = HEX_ARRAY[v >>> 4];
hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
}
return new String(hexChars);
}
public static String md5(byte[] bytes) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] thedigest = md.digest(bytes);
return bytesToHex(thedigest);
}
catch (NoSuchAlgorithmException e){
new RuntimeException("MD5 Failed", e);
}
return new String();
}
public static void main(String []args){
String string = "test";
final byte[] bytes = string.getBytes();
try {
final ByteArrayOutputStream bos = new ByteArrayOutputStream();
final GZIPOutputStream gout = new GZIPOutputStream(bos);
gout.write(bytes);
gout.close();
final byte[] encoded = bos.toByteArray();
System.out.println("gzip: " + bytesToHex(encoded));
System.out.println("md5: " + md5(encoded));
}
catch(IOException e) {
new RuntimeException("Failed", e);
}
}
}
Prints:
gzip: 1F8B08000000000000002B492D2E01000C7E7FD804000000
md5: 1ED3B12D0249E2565B01B146026C389D
So, both gzip bytes outputs seem to be very similar, but slightly different.
1F8B0800678B186002FF2B492D2E01000C7E7FD804000000
1F8B08000000000000002B492D2E01000C7E7FD804000000
Python gzip.compress() method accepts compresslevel argument in range of 0-9. Tried all of them, but none gives desired result.
Any way to get same result as Java's GZIPOutputStream in Python?
Your requirement "hash of gzipped string in Python and need it to be identical to Java's" cannot be met in general. You need to change your requirement, implementing your need differently. I would recommend requiring simply that the decompressed data have identical hashes. In fact, there is a 32-bit hash (a CRC-32) of the decompressed data already there in the two gzip strings, which are identical (0xd87f7e0c). If you want a longer hash, then you can append one. The last four bytes is the uncompressed length, modulo 232, so you can compare those as well. Just compare the last eight bytes of the two strings and check that they are the same.
The difference between the two gzip strings in your question illustrates the issue. One has a time stamp in the header, and the other does not (set to zeros). Even if they both had time stamps, they would still very likely be different. They also have some other bytes in the header different, like the originating operating system.
Furthermore, the compressed data in your examples is extremely short, so it just so happens to be identical in this case. However for any reasonable amount of data, the compressed data generated by two gzippers will be different, unless they happen to made with exactly the same deflate code, the same version of that code, and the same memory size and compression level settings. If you are not in control of all of those, you will never be able to assure the same compressed data coming out of them, given identical uncompressed data.
In short, don't waste your time trying to get identical compressed strings.
We are really stuck on this topic, this is the only code we have which converts a file into hex but we need to open a file and then for the java code to read the hex and extract certain bytes (e.g. the first 4 bytes for the file extension:
import java.io.*;
public class FileInHexadecimal
{
public static void main(String[] args) throws Exception
{
FileInputStream fis = new FileInputStream("H://Sample_Word.docx");
int i = 0;
while ((i = fis.read()) != -1) {
if (i != -1) {
System.out.printf("%02X\n ", i);
}
}
fis.close();
}
}
Do not confuse internal and external representation - what you do when converting to hex is that you only create a different representation of the same bytes.
There is no need to convert to hex if you just want to read some bytes from the file - just read them. For example, to read the first four bytes, you can use something like
byte[] buffer = new byte[4];
FileInputStream fis = new FileInputStream("H://Sample_Word.docx");
int read = fis.read(buffer);
if (read != buffer.length) {
System.out.println("Short file!");
}
If you need to read data from an arbitrary position within the file, you might want to check RandomAccessFile instead of using a stream. RandomAccessFile allows to set the position where to start reading.
I'm searching for a way to parse large files (about 5-10Go) and search for position (in byte) of some recurrent strings, the fastest as possible.
I've tried to use the RandomAccessFile reader by doing something like bellow:
RandomAccessFile lecteurFichier = new RandomAccessFile(<MyFile>, "r");
while (currentPointeurPosition < lecteurFichier.length()) {
char currentFileChar = (char) lecteurFichier.readByte();
// Test each char for matching my string (by appending chars until I found my string)
// and keep a trace of all found string's position
}
The problem is this code is too slow (maybe because I read byte by byte ?).
I also tried the solution bellow, which is perfect in term of speedness but I can't get my string's positions.
FileInputStream is = new FileInputStream(fichier.getFile());
FileChannel f = is.getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(64 * 1024);
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
long len = 0;
while ((len = f.read(buf)) != -1) {
buf.flip();
String data = "";
try {
int old_position = buf.position();
data = decoder.decode(buf).toString();
// reset buffer's position to its original so it is not altered:
buf.position(old_position);
}
catch (Exception e) {
e.printStackTrace();
}
buf.clear();
}
f.close();
Does anyone has a better solution to propose ?
Thank you in advance (and sorry for my spelling, I'm french)
Since your input data is encoded in an 8-bit encoding*, you can speed up the search by encoding the search string rather than decoding the file:
byte[] encoded = searchString.getBytes("ISO-8859-1");
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int b;
long pos = -1;
while ((b = bis.read()) != -1) {
pos++;
if (encoded[0] == b) {
// see if rest of string matches
}
}
A BufferedInputStream should be pretty fast. Using ByteBuffer might be faster, but this is going to make the search logic more complicated because of the possibility of a string match than spans a buffer boundary.
Then there are various clever ways to optimize string searches that could be adapted to this situation ... where you are search a stream of bytes / characters rather than an array of bytes / characters. The Wikipedia page on String Searching is a good place to start.
Note that since we are reading and matching in a byte-wise fashion, the position is just the count of bytes read (or skipped), so there is no need to use a random access file.
* In fact this trick will work with many multibyte encodings too.
Searching for a 'needle' in a 'haystack' is a well-studied problem-Here's a related link on StackOverflow itself. I am sure the java implementations of the algorithms discussed should be available too. Why not try some of them,to see if they fit the job?
The java project i am working on requires me to write the java equivalent of this C code:
void read_hex_char(char *filename, unsigned char *image)
{
int i;
FILE *ff;
short tmp_short;
ff = fopen(filename, "r");
for (i = 0; i < 100; i++)
{
fscanf(ff, "%hx", &tmp_short);
image[i] = tmp_short;
}
fclose(ff);
}
I have written this Java code.
void read_hex_char(String filename, char[] image) throws IOException
{
Scanner s=new Scanner(new BufferedReader(new FileReader(filename)));
for(int i=0;i<100;i++)
{
image[i]=s.nextShort();
}
s.close();
}
Is this code correct? If its not, what corrections should be done?
I would go with a FileInputStream and read byte to byte (a short is just two bytes, char is more "complex" than just a short http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html). Simple byte extraction code from my project :
public static byte[] readFile(File file) throws IOException {
FileInputStream in = new FileInputStream(file);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int ch = -1;
while ((ch = in.read()) != -1)
bos.write(ch);
return bos.toByteArray();
}
For your example, the simplest is to find a few samples : run the C function on it then the java one and compare the results. It should give you informations.
Keep in mind, that java char type is pretty smart (smarter than just byte) and represents unicode character ( and reader classes perform processing of incoming byte stream into unicode characters possibly decoding or modifyiung bytes according to actual locale and charset settings ). From your source I guess that it is just actual bytes you want in
memory buffer. Here is really good explanation how to do this:
Convert InputStream to byte array in Java
For parsing hex values, you can use Short.parseShort(String s,int radix) with radix 16. In Java, char is a bit different than short, so if you intend to perform bitmap operations, short is probably the better type to use. However, note that Java doesn't have unsigned types, which may make some of the operations likely to be used in image processing (like bitwise operations) tricky.
I'm working on moving some files to a different directory in my project and it's working great, except for the fact that I can't verify it's moved properly.
I want to verify the length of the copy is the same as the original and then I want to delete the original. I'm closing both FileStreams before I do my verification but it still fails because the sizes are different. Below is my code for closing the streams, verification and deletion.
in.close();
out.close();
if (encCopyFile.exists() && encCopyFile.length() == encryptedFile.length())
encryptedFile.delete();
The rest of the code before this is using a Util to copy the streams, and it's all working fine so really I just need a better verification method.
One wonderful way you can check is to compare md5 hashes. Checking file length doesn't mean they are the same. While md5 hashes doesn't mean they are the same either, it is better than checking the length albeit a longer process.
public class Main {
public static void main(String[] args) throws NoSuchAlgorithmException, IOException {
System.out.println("Are identical: " + isIdentical("c:\\myfile.txt", "c:\\myfile2.txt"));
}
public static boolean isIdentical(String leftFile, String rightFile) throws IOException, NoSuchAlgorithmException {
return md5(leftFile).equals(md5(rightFile));
}
private static String md5(String file) throws IOException, NoSuchAlgorithmException {
MessageDigest digest = MessageDigest.getInstance("MD5");
File f = new File(file);
InputStream is = new FileInputStream(f);
byte[] buffer = new byte[8192];
int read = 0;
try {
while ((read = is.read(buffer)) > 0) {
digest.update(buffer, 0, read);
}
byte[] md5sum = digest.digest();
BigInteger bigInt = new BigInteger(1, md5sum);
String output = bigInt.toString(16);
return output;
} finally {
is.close();
}
}
}
You could include a checksum in your copy operation. Perform a checksum on the destination file and see that it matches a checksum on the source.
You could use commons io:
org.apache.commons.io.FileUtils.contentEquals(File file1, File file2)
or you could use checksum methods:
org.apache.commons.io.FileUtils:
static Checksum checksum(File file, Checksum checksum) //Computes the checksum of a file using the specified checksum object.
static long checksumCRC32(File file) //Computes the checksum of a file using the CRC32 checksum routine.
If you get no exception while copying streams, you should be OK. Make sure you don't ignore exceptions thrown by close method!
Update: If you use FileOutputStream, you can also make sure everything was written properly by calling fileOutputStream.getFD().sync() before closing your fileOutputStream.
Of course, if you want to absolutely make sure that files are the same, you can compare their checksums/digests, but that sounds bit paranoid to me.
If the sizes are different, perhaps you are not flushing the output stream before closing it.
Which file is bigger? What are the sizes of each file? Have you actually looked at the two files to see what is different?