Duplicate has different MD5 checksum - java

WhatsApp creates duplicate copies of images upon sharing. Although the resolution of the images are same, the MD5 checksum of the original image and it's copy are different. Why is this? How do I get my app to realize that this is a duplicate image.
I've tried MD5 and Sha-1, both of the algorithms generated different checksums for the two images.

Sounds like there's probably a difference in the metadata - e.g. the timestamp might have been changed by the WhatsApp servers when the copy was made.
I suggest you retrieve the pixel data for the images and run your checksums on that. You can use the Bitmap.getPixels() method. e.g.: myBitmap.getPixels(pixels, 0, myBitmap.getWidth(), 0, 0, myBitmap.getWidth(), myBitmap.getHeight());
Remember, just because the checksum is the same that doesn't necessarily mean the images are! If your checksums match, you'll have to do an element-by-element comparison of the data to be 100% sure that the images are identical.
Edit:
There's a good example of how to do a pixel-by-pixel test for equality here. Note you can use the Bitmap.sameAs() method if you're using API 12+!

Related

Comparing many images with one

I need to compare many images in database with one and mark it if their equals. I have two ideas for solve this problem:
Get hash of both images (MD5) and compare their hashes. This method allows calculate etalon image's hash only one time.
Compare all images with etalon pixel-to-pixel.
What method will be faster? Comparing all pixels or calculate hash for all images in database?
It simply depends how often you want to compare an image with the DB.
If you only need to do this once or twice, go for pixel-to-pixel compare.
because to create all the hash-values you need also to read all pixels of all images once.
If you often need to do this, go for the hash approach. Of cause you need to compare the images with same hash-value pixel by pixel, but that are far less than all images. (You might even be able to keep all hash-values in RAM, if your DB is not too big)
You do not need to use md5. You could even go for far simpler (and faster to calculate) hash functions. (You do not need the features needed for cryptographic hash-functions like md5 (even if md5 is not secure anymore)).
You simply want to reduce the number of images to compare pixel by pixel.

How to access the raw image data

I'm using metadata-extractor to write a Java application that organizes images and finds duplicates. The API is great, but there's something I cannot figure out.
Suppose I have two JPG images. These images, visually, are exactly the same (i.e. same pixel-wise). However, maybe something within the metadata encapsulated in the file differs.
If I calculate MD5 hashes on each complete file, I will get two different hashes. However, I want to calculate a hash of only the image/pixel data, which would yield the same hash for both files.
So - Is there a way to pull out the raw image/pixel data from the JPG using metadata-extractor so that I can calculate my hash on this?
Also, is Javadoc available for this API? I cannot seem to find it.
You can achieve this using the library's JpegSegmentReader class. It'll let you pull out the JPEG segments that contain image data and ignore metadata segments.
I discussed this technique in another answer and the asker indicated they had success with the approach.
This would actually make a nice sample application for the library. If you come up with something and feel like sharing, please do.

Java library or strategy to tell if an image contains a signature

I have an image file, and I need to determine if a specified area of this image contains a signature. Or to put it in end-user terms, "Has this document been signed?"
What I have done so far is to examine all the pixels contained in the area, to calculate an average "darkness", and compare that to a reference value. If the difference in darkness exceeds some threshold, then I consider it signed.
The problem with this (admittedly simplistic) approach is that because the pixels of the signature itself are such a small fraction of area, I have to use a very low darkness threshold, which results in a large number of false positives. I can't distinguish a real signature from stray markings, smudges, fax artifacts, etc.
To be clear...I'm not trying to match any specific signature or set of signatures. That is, I don't care who signed it, only whether it is signed.
Is anyone aware of a Java library that can do this, or of a better approach to this problem than what I am currently doing?
EDIT:
This is an example of the kinds of images I am working with. This document would be faxed to the recipient, signed and faxed back. It won't be this clean-looking by the time I need to look for a signature.
I do not know of any simple solutions. You can wrap over queXF or write something similar in Java. This paper talks about color code algorithm to recognize signatures.
This is what I believe can be done (although not a very good solution) but may still work. It would involve a bit of machine learning. I am assuming that your image does not contain hand written text and its just an image.
First thing to do would be to create a dataset of images which contain a signature and those which do not. The positive samples of the dataset should only contain signatures (you can learn a classifier for multiple aspect ratios) and negative samples should contain random images of the same aspect ratio/dimension. Now, you can compute some feature over these samples (HoG can be used as a feature, although I do not claim it is the best one to use for this application) and learn a SVM for each aspect ratio.
The next step would be to slide a detection window (of different aspect ratios) throughout the image and use the multiple SVMs you have learnt and check if any of them gives a positive response.
Although this approach may not work always, but should give a decent amount of accuracy. The more data you will use to learn, the better the results would get (and if you can come up with a good feature vector to represent a signature, it would help you case even further)

java.util.zip.CRC32 comparing image files

Is there any suggestion (piece of code) that helps me to understand the concept of how to compare two different image files whether they are same or not?
Thanks in advance.
EDIT:
I mean, for example, I should check the CRC32 (that which I do not know how to do) with size check of the file. Then it means they are identically same pictures...
EDIT2: When I say images are the same, I mean images are looks exactly the same to the user.
You can use CRC32 to sum any file. However if you want to find out if two images are the same you have to decide whether two images which look the same are the same. e.g. The following images all have different sizes let alone different CRC32 numbers.
The checksum for the zip entry has the meaning: when different the files are different.
The CRC32 class allows you to calculate the checksum yourself of bytes.
To efficiently check whether two images are almost equal, there are many means, like making a small 8x8 image and comparing the differences in color values.

Java validate file

Currently, I have a server to which 2 clients can connect. Both of those two clients have a text file on their HDD which is read by the program as soon as it starts up. This textfile should contain the EXACT same data (it's just plain text) on both clients (which should get validated by the server) or the server may not serve the clients.
I'm wondering how to do this correctly. What should I do? Calculate an hashcode, or use MD5/SHA1/SHA2 for something like this? Should I first read the file and calculate an hashcode on the created objects or calculate the MD5 directly on the file?
Thanks
To be really, really sure, you have to transfer the contents of both text files to the server and compare them as strings.
For all practical purposes, you can calculate a hash code and compare that value on the server. Have a look at the FileUtil class in apache commons. It defines a checksumCRC32(File file) method that you can use to compute a checksum for a file. If the checksum is equal for both files, the contents can assumed to be equal. The probability that they are different nonetheless is 1 / 2^32.
You can easily compute hash of the file using DigestUtils from Apache Commons. It has nice methods for computing hashes be it MD5 or SHA1. Then just compare the hashes of files for each client.
Also, you should be aware that exact hashes doesn't guarantee for 100% that the files are identical. It would be very rare situation where the files are not identical given their hashes are equal. However, depending on whether this determination is critical in your app, you may have to compare the files byte-by-byte when the hashes are equal to confirm for sure that they have the exact data.

Categories

Resources