I need to compare many images in database with one and mark it if their equals. I have two ideas for solve this problem:
Get hash of both images (MD5) and compare their hashes. This method allows calculate etalon image's hash only one time.
Compare all images with etalon pixel-to-pixel.
What method will be faster? Comparing all pixels or calculate hash for all images in database?
It simply depends how often you want to compare an image with the DB.
If you only need to do this once or twice, go for pixel-to-pixel compare.
because to create all the hash-values you need also to read all pixels of all images once.
If you often need to do this, go for the hash approach. Of cause you need to compare the images with same hash-value pixel by pixel, but that are far less than all images. (You might even be able to keep all hash-values in RAM, if your DB is not too big)
You do not need to use md5. You could even go for far simpler (and faster to calculate) hash functions. (You do not need the features needed for cryptographic hash-functions like md5 (even if md5 is not secure anymore)).
You simply want to reduce the number of images to compare pixel by pixel.
Related
I have about 5000 images with water marks on them and 5000 identical images with no watermarks. The file names of each set of images are not correlated to each other in any way. I'm looking for an API in Java preferably that I can use to pair each water marked image with its non-water marked pair.
You can use the OpenCV library. It can be used in Java. Please follow http://docs.opencv.org/doc/tutorials/introduction/desktop_java/java_dev_intro.html
Regarding image compare, you can see another useful answer here: Checking images for similarity with OpenCV
I think this is more about performance then about the image comparison itself and the answer is written in such manner so if you need help with the comparison itself comment me ...
create simplified histogram for each image
let say 8 values per each channel limiting to 4 bits per each intensity level. That will lead to 3*8*4=3*32 bits per image
sort images
take above histogram and consider it as a single number and sort the images of A group by it does not matter if ascending or descending
matching A and B grouped images
now the corresponding images should have similar histograms so take image from unsorted group B (watermarked), bin-search all the closest match in A group (original) and then compare more with more robust methods just against selected images instead of 5000.
add flag if image from A group is already matched
so you can ignore already matched images in bullet #3 to gain more speed
[Notes]
there are other ways to improvement like use Perceptual hash algorithms
I need to search a huge image database to find possible duplicate using pHash assuming those image records have the hash code generated using the pHash.
Now I have to compare a new image and I have to create the hash for this using pHash against existing records. But as per my understanding the has comparison is NOT straight forward like
hash1 - has2 < threshold
Looks like I need to pass the both hash codes into a pHash API to do the matching.So I have to retrieve all hash codes from DB in batches and compare one by one using the pHash API.
But this looks not the best approach if I have about 1000 images in queue to be compared against the millions of already exiting images.
I need to know the followings.
Is my understanding/approach on using pHash to compare with existing image db is correct?
Is there a better approach to handle this (without using cbir libraries like lire)?
I heard that there is an algorithm called dHash which also can be used for image comparison with hash codes..is there any java libraries for this and can this be used together with pHash to optimize this task of large image and repeated image processing tasks.
Thanks in advance.
I think some part of this question is discussed on the pHash support forum.
You will need to use the mvptree storage mechanism
http://lists.phash.org/htdig.cgi/phash-support-phash.org/2011-May/000122.html
and
http://lists.phash.org/htdig.cgi/phash-support-phash.org/2010-October/000103.html
Depending on your definition of "huge", a good solution here is to implement a BK-Tree hash tree (human readable description).
I'm working with a similar project, and I implemented a BK tree in cython. It's fairly performant (searching with a hamming distance of 2 takes less then 50 ms for a 12 million item dataset, and touches ~0.01-0.02% of the tree nodes).
Larger scale searches (edit distance of 8) take longer (~500 ms) and touch about 5% of the tree nodes.
This is with a 64 bit hash size.
WhatsApp creates duplicate copies of images upon sharing. Although the resolution of the images are same, the MD5 checksum of the original image and it's copy are different. Why is this? How do I get my app to realize that this is a duplicate image.
I've tried MD5 and Sha-1, both of the algorithms generated different checksums for the two images.
Sounds like there's probably a difference in the metadata - e.g. the timestamp might have been changed by the WhatsApp servers when the copy was made.
I suggest you retrieve the pixel data for the images and run your checksums on that. You can use the Bitmap.getPixels() method. e.g.: myBitmap.getPixels(pixels, 0, myBitmap.getWidth(), 0, 0, myBitmap.getWidth(), myBitmap.getHeight());
Remember, just because the checksum is the same that doesn't necessarily mean the images are! If your checksums match, you'll have to do an element-by-element comparison of the data to be 100% sure that the images are identical.
Edit:
There's a good example of how to do a pixel-by-pixel test for equality here. Note you can use the Bitmap.sameAs() method if you're using API 12+!
Is there any suggestion (piece of code) that helps me to understand the concept of how to compare two different image files whether they are same or not?
Thanks in advance.
EDIT:
I mean, for example, I should check the CRC32 (that which I do not know how to do) with size check of the file. Then it means they are identically same pictures...
EDIT2: When I say images are the same, I mean images are looks exactly the same to the user.
You can use CRC32 to sum any file. However if you want to find out if two images are the same you have to decide whether two images which look the same are the same. e.g. The following images all have different sizes let alone different CRC32 numbers.
The checksum for the zip entry has the meaning: when different the files are different.
The CRC32 class allows you to calculate the checksum yourself of bytes.
To efficiently check whether two images are almost equal, there are many means, like making a small 8x8 image and comparing the differences in color values.
I have an Android application that iterates through an array of thousands of integers and it uses them as key values to access pairs of integers (let us call them id's) in order to make calculations with them. It needs to do it as fast as possible and in the end, it returns a result which is crucial to the application.
I tried loading a HashMap into the memory for fast access to those numbers but it resulted in OOM Exception. I also tried writing those id's to a RandomAccessFile and storing their offsets on the file to another HashMap but it was way too slow. Also, the new HashMap that only stores the offsets is still occupying a large memory.
Now I am considering SQLite but I am not sure if it will be any faster. Are there any structures or libraries that could help me with that?
EDIT: Number of keys are more than 20 million whereas I only need to access thousands of them. I do not know which ones I will access beforehand because it changes with user input.
You could use Trove's TIntLongHashMap to map primitive ints to primitive longs (which store the ints of your value pair). This saves you the object overhead of a plain vanilla Map, which forces you to use wrapper types.
EDIT
Since your update states you have more than 20 million mappings, there will likely be more space-efficient structures than a hash map. An approach to partition your keys into buckets, combined with some sub-key compression will likely save you half the memory over even the most efficient hash map implementation.
SQLite is an embedded relational database, which uses indexes. I would bet it is much faster than using RandomAccessFile. You can give it a try.
My suggestion is to rearrange the keys in Buckets - what i mean is identify (more or less) the distribution of your keys, then make files that corresponds to each range of keys (the point is that every file must contain just as much integers that can get in memory and no more then that) then when you search for a key, you just read the whole file to the memory and look for it.
exemple, assuming the distribution of the key is uniform, store 500k values corresponding to the 0-500k key values, 500k values corresponding to 500k-1mil keys and so on...
EDIT : if you did try this approach, and it still went slow, i still have some tricks in my sleaves:
First make sure that your division is actually close to equal between all the buckets.
Try to make the buckets smaller, by making more buckets.
The idea about correct division to buckets by ranges is that when you search for a key, you go to the corresponding range bucket and The key either in it or that it is not in the whole collection. so there is no point on Concurnetly reading another bucket.
I never done that, cause im not sure concurrency works on I\O's, but it may be helpfull to Read the whole file with 2 threads one from top to bottom and the other from bottom to top until they meet. (or something like that)
While you read the whole bucket into memory, split it to 3-4 arraylists, Run 3-4 working threads to search your key on each of the arrays, the search must end way faster then.