I am presently comparing the lsit of files using MD5sum. How to group similar kind of files into a folder using these hash values? Will the hash difference between the two files will be less?
For example: I am having a file which contains a name "HELLO" and the other pdf file contains "hello", these both are more or less same. so these files needs to be grouped. will my idea of finding hash difference help?
Or any other idea? Please help me to sort this out.
No. The hashes will be completely different and there will be no correlation. You can use hashes if you want to divide them uniformly into different buckets, but it doesn't work with grouping similar files.
Related
I'm using metadata-extractor to write a Java application that organizes images and finds duplicates. The API is great, but there's something I cannot figure out.
Suppose I have two JPG images. These images, visually, are exactly the same (i.e. same pixel-wise). However, maybe something within the metadata encapsulated in the file differs.
If I calculate MD5 hashes on each complete file, I will get two different hashes. However, I want to calculate a hash of only the image/pixel data, which would yield the same hash for both files.
So - Is there a way to pull out the raw image/pixel data from the JPG using metadata-extractor so that I can calculate my hash on this?
Also, is Javadoc available for this API? I cannot seem to find it.
You can achieve this using the library's JpegSegmentReader class. It'll let you pull out the JPEG segments that contain image data and ignore metadata segments.
I discussed this technique in another answer and the asker indicated they had success with the approach.
This would actually make a nice sample application for the library. If you come up with something and feel like sharing, please do.
I have two JSON files which have a large amount of records(objects). One file has about 1200 records in it and the other has 600. I'm sorry that i couldn't post them here, but i want to compare both of them, and get back the records that are common. The trick here is that i can't iterate through them, as there are a large number of records and the tool which i'm using cannot support this. I'm posting my sample json below:
{"xyz":{"string":"hello"},"abc:{"string":"rts","event":"file","value":"100"}}
{"xyz":{"string":"hello"},"thg{"Integer":"rts","event":"file","value":"100"}}
My question is whether any libraries are available where i can directly compare two JSON files using predifined methods. If no such libraries are available, can you give an optimal way to find the similar records such as "xyz" in the above example.
I'm not supposed to use GSON as it is incompatible with the tool.
I don't know about libraries, but definitely the algorithm will involve first sorting both files and after that, yes, you should perform iteration and recordwise comparison. The general complexity would be O(n*log(n))
Take a loot at below link.
http://tlrobinson.net/projects/javascript-fun/jsondiff/
This will compare two json contents.
Hope this is what you was looking for.
Is there any suggestion (piece of code) that helps me to understand the concept of how to compare two different image files whether they are same or not?
Thanks in advance.
EDIT:
I mean, for example, I should check the CRC32 (that which I do not know how to do) with size check of the file. Then it means they are identically same pictures...
EDIT2: When I say images are the same, I mean images are looks exactly the same to the user.
You can use CRC32 to sum any file. However if you want to find out if two images are the same you have to decide whether two images which look the same are the same. e.g. The following images all have different sizes let alone different CRC32 numbers.
The checksum for the zip entry has the meaning: when different the files are different.
The CRC32 class allows you to calculate the checksum yourself of bytes.
To efficiently check whether two images are almost equal, there are many means, like making a small 8x8 image and comparing the differences in color values.
Currently, I have a server to which 2 clients can connect. Both of those two clients have a text file on their HDD which is read by the program as soon as it starts up. This textfile should contain the EXACT same data (it's just plain text) on both clients (which should get validated by the server) or the server may not serve the clients.
I'm wondering how to do this correctly. What should I do? Calculate an hashcode, or use MD5/SHA1/SHA2 for something like this? Should I first read the file and calculate an hashcode on the created objects or calculate the MD5 directly on the file?
Thanks
To be really, really sure, you have to transfer the contents of both text files to the server and compare them as strings.
For all practical purposes, you can calculate a hash code and compare that value on the server. Have a look at the FileUtil class in apache commons. It defines a checksumCRC32(File file) method that you can use to compute a checksum for a file. If the checksum is equal for both files, the contents can assumed to be equal. The probability that they are different nonetheless is 1 / 2^32.
You can easily compute hash of the file using DigestUtils from Apache Commons. It has nice methods for computing hashes be it MD5 or SHA1. Then just compare the hashes of files for each client.
Also, you should be aware that exact hashes doesn't guarantee for 100% that the files are identical. It would be very rare situation where the files are not identical given their hashes are equal. However, depending on whether this determination is critical in your app, you may have to compare the files byte-by-byte when the hashes are equal to confirm for sure that they have the exact data.
What algorithms or Java libraries are available to do N-way, recursive diff/merge of directories?
I need to be able to generate a list of folder trees that have many identical files, and have subdirectories with many similar files. I want to be able to use 2-way merge operations to quickly remove as much redundancy as possible.
Goals:
Find pairs of directories that have many similar files between them.
Generate short list of directory pairs that can be synchronized with 2-way merge to eliminate duplicates
Should operate recursively (there may be nested duplicates of higher-level directories)
Run time and storage should be O(n log n) in numbers of directories and files
Should be able to use an embedded DB or page to disk for processing more files than fit in memory (100,000+).
Optional: generate an ancestry and change-set between folders
Optional: sort the merge operations by how many duplicates they can elliminate
I know how to use hashes to find duplicate files in roughly O(n) space, but I'm at a loss for how to go from this to finding partially overlapping sets between folders and their children.
EDIT: some clarification
The tricky part is the difference between "exact same" contents (otherwise hashing file hashes would work) and "similar" (which will not). Basically, I want to feed this algorithm at a set of directories and have it return a set of 2-way merge operations I can perform in order to reduce duplicates as much as possible with as few conflicts possible. It's effectively constructing an ancestry tree showing which folders are derived from each other.
The end goal is to let me incorporate a bunch of different folders into one common tree. For example, I may have a folder holding programming projects, and then copy some of its contents to another computer to work on it. Then I might back up and intermediate version to flash drive. Except I may have 8 or 10 different versions, with slightly different organizational structures or folder names. I need to be able to merge them one step at a time, so I can chose how to incorporate changes at each step of the way.
This is actually more or less what I intend to do with my utility (bring together a bunch of scattered backups from different points in time). I figure if I can do it right I may as well release it as a small open source util. I think the same tricks might be useful for comparing XML trees though.
It seems desirable just to work on the filenames and sizes (and timestamps if you find that they are reliable), to avoid reading in all those files and hashing or diffing them.
Here's what comes to mind.
Load all the data from the filesystem. It'll be big, but it'll fit in memory.
Make a list of candidate directory-pairs with similarity scores. For each directory-name that appears in both trees, score 1 point for all pairs of directories that share that name. For each filename that appears in both trees (but not so often that it's meaningless), score 1 point for all pairs of directories that contain a file with that name. Score bonus points if the two files are identical. Score bonus points if the filename doesn't appear anywhere else. Each time you give points, also give some points to all ancestor-pairs, so that if a/x/y/foo.txt is similar to b/z/y/foo.txt, then the pairs (a/x/y, b/z/y) and (a/x, b/z) and (a, b) all get points.
Optionally, discard all pairs with scores too low to bother with, and critically examine the other pairs. Up to now we've only considered ways that directories are similar. Look again, and penalize directory-pairs that show signs of not having common ancestry. (A general way to do this would be to calculate the maximum score the two directories could possibly have, if they both had all the files and they were all identical; and reject the pair if only a small fraction of that possible score was actually achieved. But it might be better to do something cheap and heuristic, or to skip this step entirely.)
Choose the best-scoring candidate directory-pair. Output it. Eliminate those directories and all their subdirectories from contention. Repeat.
Choosing the right data structures is left as an exercise.
This algorithm makes no attempt to find similar files with different filenames. You can do that across large sets of files using something like the rsync algorithm, but I'm not sure you need it.
This algorithm makes no serious attempt to determine whether two files are actually similar. It just scores 1 point for the same filename and bonus points for the same size and timestamp. You certainly could diff them to assign a more precise score. I doubt it's worth it.