Currently, I have a server to which 2 clients can connect. Both of those two clients have a text file on their HDD which is read by the program as soon as it starts up. This textfile should contain the EXACT same data (it's just plain text) on both clients (which should get validated by the server) or the server may not serve the clients.
I'm wondering how to do this correctly. What should I do? Calculate an hashcode, or use MD5/SHA1/SHA2 for something like this? Should I first read the file and calculate an hashcode on the created objects or calculate the MD5 directly on the file?
Thanks
To be really, really sure, you have to transfer the contents of both text files to the server and compare them as strings.
For all practical purposes, you can calculate a hash code and compare that value on the server. Have a look at the FileUtil class in apache commons. It defines a checksumCRC32(File file) method that you can use to compute a checksum for a file. If the checksum is equal for both files, the contents can assumed to be equal. The probability that they are different nonetheless is 1 / 2^32.
You can easily compute hash of the file using DigestUtils from Apache Commons. It has nice methods for computing hashes be it MD5 or SHA1. Then just compare the hashes of files for each client.
Also, you should be aware that exact hashes doesn't guarantee for 100% that the files are identical. It would be very rare situation where the files are not identical given their hashes are equal. However, depending on whether this determination is critical in your app, you may have to compare the files byte-by-byte when the hashes are equal to confirm for sure that they have the exact data.
Related
Is there a way to identify or inspect an AES encrypted file based on the file content (like the way a ZIP file can be identified by looking for letters "PK" at the beginning of the file)? Is there any magic number associated with AES encrypted files?
We have multiple files in the workflow repository that are either in plain text (could be excel, XML, JSON, text etc.) or AES-256 encrypted and don't have an idea which ones are AES encrypted. I need to write Java code to identify the AES encrypted files and decrypt them automatically. Thanks!
In the absence of any standard header, you could look at the byte frequency. AES encrypted data (or indeed anything encrypted with a decent algorithm) will appear to be a random sequence of bytes. This means that the distribution of byte values 0-255 will be approximately flat (i.e. all byte values are equally likely).
However, textual documents will mostly contain printable characters - some much more than others. Spaces, newlines, vowels etc will be disproportionately common.
So, you could build histograms of byte counts for your various files, and look for a simple way to classify them into encrypted or not-encrypted. For example, look at the ratio of the total count of the 5 least common byte values and the total count of the 5 most common byte values. I would expect this ratio to be close to 1.0 for an encrypted file, and quite far from 1.0 for a normal textual document (I'm sure there are much more sophisticated statistical metrics that could be used...).
This might not work so well for extremely short documents, of course.
See also:
https://www.researchgate.net/post/How_to_detect_if_data_are_encrypted_or_not
AES is a block cipher. On its own, it can only transform a 128 bit value into another seemingly random 128 bit value. In order to encrypt more data, a mode of operation and possibly a padding scheme are added. If you want to go further like producing encrypted files, you really need to define a file format, because that's not provided by the previously mentioned mechanisms.
So, if you say you have an AES-encrypted file, it doesn't mean anything aside from your file being encrypted in some way.
The result of modern encryption looks like random noise, so you can compare the hamming weight of an encrypted file to that of a non-compressed structured file. There will likely be differences as DNA mentioned. Compressed files also look like random noise, but they may contain biases which might be significant enough if the file is long enough.
There are some file formats that contain an identifier how the data was encrypted. Most self-made formats don't have anything close to an identifier, because they are written for a specific application and the protocol or file format doesn't change that often. The developer settled for some "cipher suite" and never bothered to make it flexible. If you know the program that the files are produced by, then you can likely find out if they are encrypted. If that program is open source, this is easy. If it is closed source, you can still reverse-engineer it.
I'm using metadata-extractor to write a Java application that organizes images and finds duplicates. The API is great, but there's something I cannot figure out.
Suppose I have two JPG images. These images, visually, are exactly the same (i.e. same pixel-wise). However, maybe something within the metadata encapsulated in the file differs.
If I calculate MD5 hashes on each complete file, I will get two different hashes. However, I want to calculate a hash of only the image/pixel data, which would yield the same hash for both files.
So - Is there a way to pull out the raw image/pixel data from the JPG using metadata-extractor so that I can calculate my hash on this?
Also, is Javadoc available for this API? I cannot seem to find it.
You can achieve this using the library's JpegSegmentReader class. It'll let you pull out the JPEG segments that contain image data and ignore metadata segments.
I discussed this technique in another answer and the asker indicated they had success with the approach.
This would actually make a nice sample application for the library. If you come up with something and feel like sharing, please do.
I am presently comparing the lsit of files using MD5sum. How to group similar kind of files into a folder using these hash values? Will the hash difference between the two files will be less?
For example: I am having a file which contains a name "HELLO" and the other pdf file contains "hello", these both are more or less same. so these files needs to be grouped. will my idea of finding hash difference help?
Or any other idea? Please help me to sort this out.
No. The hashes will be completely different and there will be no correlation. You can use hashes if you want to divide them uniformly into different buckets, but it doesn't work with grouping similar files.
WhatsApp creates duplicate copies of images upon sharing. Although the resolution of the images are same, the MD5 checksum of the original image and it's copy are different. Why is this? How do I get my app to realize that this is a duplicate image.
I've tried MD5 and Sha-1, both of the algorithms generated different checksums for the two images.
Sounds like there's probably a difference in the metadata - e.g. the timestamp might have been changed by the WhatsApp servers when the copy was made.
I suggest you retrieve the pixel data for the images and run your checksums on that. You can use the Bitmap.getPixels() method. e.g.: myBitmap.getPixels(pixels, 0, myBitmap.getWidth(), 0, 0, myBitmap.getWidth(), myBitmap.getHeight());
Remember, just because the checksum is the same that doesn't necessarily mean the images are! If your checksums match, you'll have to do an element-by-element comparison of the data to be 100% sure that the images are identical.
Edit:
There's a good example of how to do a pixel-by-pixel test for equality here. Note you can use the Bitmap.sameAs() method if you're using API 12+!
Is there any suggestion (piece of code) that helps me to understand the concept of how to compare two different image files whether they are same or not?
Thanks in advance.
EDIT:
I mean, for example, I should check the CRC32 (that which I do not know how to do) with size check of the file. Then it means they are identically same pictures...
EDIT2: When I say images are the same, I mean images are looks exactly the same to the user.
You can use CRC32 to sum any file. However if you want to find out if two images are the same you have to decide whether two images which look the same are the same. e.g. The following images all have different sizes let alone different CRC32 numbers.
The checksum for the zip entry has the meaning: when different the files are different.
The CRC32 class allows you to calculate the checksum yourself of bytes.
To efficiently check whether two images are almost equal, there are many means, like making a small 8x8 image and comparing the differences in color values.