Is there any suggestion (piece of code) that helps me to understand the concept of how to compare two different image files whether they are same or not?
Thanks in advance.
EDIT:
I mean, for example, I should check the CRC32 (that which I do not know how to do) with size check of the file. Then it means they are identically same pictures...
EDIT2: When I say images are the same, I mean images are looks exactly the same to the user.
You can use CRC32 to sum any file. However if you want to find out if two images are the same you have to decide whether two images which look the same are the same. e.g. The following images all have different sizes let alone different CRC32 numbers.
The checksum for the zip entry has the meaning: when different the files are different.
The CRC32 class allows you to calculate the checksum yourself of bytes.
To efficiently check whether two images are almost equal, there are many means, like making a small 8x8 image and comparing the differences in color values.
Related
I need to compare many images in database with one and mark it if their equals. I have two ideas for solve this problem:
Get hash of both images (MD5) and compare their hashes. This method allows calculate etalon image's hash only one time.
Compare all images with etalon pixel-to-pixel.
What method will be faster? Comparing all pixels or calculate hash for all images in database?
It simply depends how often you want to compare an image with the DB.
If you only need to do this once or twice, go for pixel-to-pixel compare.
because to create all the hash-values you need also to read all pixels of all images once.
If you often need to do this, go for the hash approach. Of cause you need to compare the images with same hash-value pixel by pixel, but that are far less than all images. (You might even be able to keep all hash-values in RAM, if your DB is not too big)
You do not need to use md5. You could even go for far simpler (and faster to calculate) hash functions. (You do not need the features needed for cryptographic hash-functions like md5 (even if md5 is not secure anymore)).
You simply want to reduce the number of images to compare pixel by pixel.
I'm using metadata-extractor to write a Java application that organizes images and finds duplicates. The API is great, but there's something I cannot figure out.
Suppose I have two JPG images. These images, visually, are exactly the same (i.e. same pixel-wise). However, maybe something within the metadata encapsulated in the file differs.
If I calculate MD5 hashes on each complete file, I will get two different hashes. However, I want to calculate a hash of only the image/pixel data, which would yield the same hash for both files.
So - Is there a way to pull out the raw image/pixel data from the JPG using metadata-extractor so that I can calculate my hash on this?
Also, is Javadoc available for this API? I cannot seem to find it.
You can achieve this using the library's JpegSegmentReader class. It'll let you pull out the JPEG segments that contain image data and ignore metadata segments.
I discussed this technique in another answer and the asker indicated they had success with the approach.
This would actually make a nice sample application for the library. If you come up with something and feel like sharing, please do.
I am presently comparing the lsit of files using MD5sum. How to group similar kind of files into a folder using these hash values? Will the hash difference between the two files will be less?
For example: I am having a file which contains a name "HELLO" and the other pdf file contains "hello", these both are more or less same. so these files needs to be grouped. will my idea of finding hash difference help?
Or any other idea? Please help me to sort this out.
No. The hashes will be completely different and there will be no correlation. You can use hashes if you want to divide them uniformly into different buckets, but it doesn't work with grouping similar files.
WhatsApp creates duplicate copies of images upon sharing. Although the resolution of the images are same, the MD5 checksum of the original image and it's copy are different. Why is this? How do I get my app to realize that this is a duplicate image.
I've tried MD5 and Sha-1, both of the algorithms generated different checksums for the two images.
Sounds like there's probably a difference in the metadata - e.g. the timestamp might have been changed by the WhatsApp servers when the copy was made.
I suggest you retrieve the pixel data for the images and run your checksums on that. You can use the Bitmap.getPixels() method. e.g.: myBitmap.getPixels(pixels, 0, myBitmap.getWidth(), 0, 0, myBitmap.getWidth(), myBitmap.getHeight());
Remember, just because the checksum is the same that doesn't necessarily mean the images are! If your checksums match, you'll have to do an element-by-element comparison of the data to be 100% sure that the images are identical.
Edit:
There's a good example of how to do a pixel-by-pixel test for equality here. Note you can use the Bitmap.sameAs() method if you're using API 12+!
Currently, I have a server to which 2 clients can connect. Both of those two clients have a text file on their HDD which is read by the program as soon as it starts up. This textfile should contain the EXACT same data (it's just plain text) on both clients (which should get validated by the server) or the server may not serve the clients.
I'm wondering how to do this correctly. What should I do? Calculate an hashcode, or use MD5/SHA1/SHA2 for something like this? Should I first read the file and calculate an hashcode on the created objects or calculate the MD5 directly on the file?
Thanks
To be really, really sure, you have to transfer the contents of both text files to the server and compare them as strings.
For all practical purposes, you can calculate a hash code and compare that value on the server. Have a look at the FileUtil class in apache commons. It defines a checksumCRC32(File file) method that you can use to compute a checksum for a file. If the checksum is equal for both files, the contents can assumed to be equal. The probability that they are different nonetheless is 1 / 2^32.
You can easily compute hash of the file using DigestUtils from Apache Commons. It has nice methods for computing hashes be it MD5 or SHA1. Then just compare the hashes of files for each client.
Also, you should be aware that exact hashes doesn't guarantee for 100% that the files are identical. It would be very rare situation where the files are not identical given their hashes are equal. However, depending on whether this determination is critical in your app, you may have to compare the files byte-by-byte when the hashes are equal to confirm for sure that they have the exact data.