Producing checksum in Java - java

I am implementing a code to produce a checksum from a string. I would just like to know the following below:
Why is the checksum produced directly from a string different from the checksum produced from a file containing the same string but was manually copied to the file using ctrl+c?
Edit: I'm not asking for the implementation. I'm asking why are they different to those who may have encountered this
Another example would be, why is the checksum produced from a file created by code different from the checksum produced from a file created manually where the string was copy-pasted?
But when I compared the two strings using a tool like WinMerge, it gives me the two identical strings.
Any enlightening answers are appreciated

Related

EBCDIC unpacking comp-3 data returns 40404** in Java

I have used the unpack data logic provided in below link for java
How to unpack COMP-3 digits using Java?
But for the null data in source it returns 404040404 like on Java unpack code. I understand this was space in ebcdic, but how to unpack by handling this space or to avoid it.
There are two problems that we have to deal with. First, is the data valid comp-3 data and second, is the data considered “valid” by older language implementations like COBOL since Comp-3 was mentioned.
If the offests are not misaligned it would appear that spaces are being interpreted by existing programs as 0 instead of spaces. This would be incorrect but could be an artifact of older programs that were engineered to tolerate this bad behaviour.
The approach I would take in a legacy shop (assuming no misalignment) is to consider “spaces” (which are sequences of 0x404040404040) as being zero. This would be a legacy check to compare the field with spaces and then assume that 0x00000000000f as the actual default. This is something an individual shop would have to determine and is not recognized as a general programming approach.
In terms of Java, one has to remember that bytes are “signed” so comparisons can be tricky based on how the code is written. The only “unsigned” data type I
recall in java is char which is really two bytes (unit 16) basically.
This is less of a programming problem than it is recognizing historical tolerance and remediation.

Torch-rnn sample.lua binary file exporting

I've been experimenting with Torch for a while now, and I wrote my own audio file format.
In that I want to have my data stored in bytes, so I would use all 256 possible values in the file.
I put my file in preprocess.py with 'bytes' encoding, it threw no exceptions. The training goes well, too, but when I'm generating sample data, it is not really bytecode. Some of the characters are written out and some byte values are just in brackets.
[158][170][171][147][164][199][201][179][170][185][184][163][134][130][151][164][150][130]xnjlbUQcq]Vg|ysx{[130]|svzv[144][168][152][137]m[136][150][134][135][135][177][167][130][128][150][167][159][146][132][131][135]Wm{[155]}mqm[143]x[138]r[140][131][135]yv[135]}enj[138][145][141][140][150][128]mrj[132]vv[133][150][152][155][136][140][159][149][152][131]{[139]wmTPQ\bqveMYk[128]uvt[141][147][139][132][132][143][143][132][148][178][187][174][166][164][150]zt[137]xeo~xjt|x~zxx[130]tgp}[147][141][137][139]
How could I change sample.lua's output? I did make a change but I do not know Lua. This is what I wrote:
local sample = model:sample(opt)
local out = io.open(opt.output, "wb")
out:write(sample)
out:close()
instead of
local sample = model:sample(opt)
print(sample)
That resulted the same output. What could I do to get it working?

In dalvik, what expression will generate instructions 'not-int' and 'const-string/jumbo'?

I am new on learning dalvik, and I want to dump out every instruction in dalvik.
But there are still 3 instructions I can not get no matter how I write the code.
They are 'not-int', 'not-long', 'const-string/jumbo'.
I written like this to get 'not-int' but failed:
int y = ~x;
Dalvik generated an 'xor x, -1' instead.
and I know 'const-string/jumbo' means that there is more than 65535 strings in the code and the index is 32bit. But when I decleared 70000 strings in the code, the compiler said the code was too long.
So the question is: how to get 'not-int' and 'const-string/jumbo' in dalvik by java code?
const-string/jumbo is easy. As you noted, you just need to define more than 65535 strings, and reference one of the later ones. They don't all need to be in a single class file, just in the same DEX file.
Take a look at dalvik/tests/056-const-string-jumbo, in particular the "build" script that generates a Java source file with a large number of strings.
As far as not-int and not-long go, I don't think they're ever generated. I ran dexdump -d across a pile of Android 4.4 APKs and didn't find a single instance of either.

Why MD5 hash values are different for two excel files which appear the same

I have two excel files saved at different locations. One is directly downloaded from the browser and another is downloaded using selenium driver. I manually checked both the files, both are exactly the same. But the MD5 hash value generated for both files are coming different. How to fix this issue.
MD5 is a hashing function. People use hashing functions to verify the integrity of a file, stream, or other resource. When it comes to hashing functions, when you're verifying the integrity of a file, you're verifying that at the bit level, the files are the same.
The ramifications of this are that when you're comparing a file with integrity constraints on the bitwise level, then a hashing function works perfectly.
However, given the nature of Excel spreadsheets. If so much as one bit is added, removed, or moved from the document, on the bitwise level, then the hash of that file will be completely different. (Not always, but don't worry about that.)
Since the driver for Excel is quite different from the driver that selenium uses, especially given compression and other alterations/optimizations that may be made to the file by selenium, then -- of course -- the hash is going to be different.
My recommendations:
Firstly: Pull up the file in diff and find out what is different between those two files. It's almost (but not quite) axiomatic that if the hashes for two files are different, then those files are also different.
Secondly: write a driver that compares the information in those spreadsheets to verify integrity (and you can take hashes of that information) of the document, rather than verifying the files on a bitwise level.
I'd recommend exporting both as a CSV and go line by line and compare the two.
MD5 algorithm computes the file entierely, including metadata (filename, dates, etc) which are stored into the file, so two files can be identical in "main content" but different in some bytes.
It could be hard to determine which part of the file is really interesting for MD5 check.
Try this kind of tools if you are on Windows, and interested only on Excel files : http://www.formulasoft.com/download.html
Are you sure metadata is included in the hash? It would be wise to do some research on this.
If this was true you would never find matching hashes because the likelihood of timestamps matching would be very low, you can also change the filename a thousand times and the hash will be the same. Also when an AV scans a file it changes the accessed timestamp properties, you're hashes would be changing constantly if metadata was included in the hash on a machine that is constantly being scanned by an AV.
Old post, new perspective:
TL;DR - The zip specification includes a timestamp. See: the wikipedia entry on zip. The following sequence will answer the question "Do my two spreadsheets actually contain the same data?"
unzip file1.xlsx -d dir1/
unzip file2.xlsx -d dir2/
diff -rq dir1/ dir2/
If the diff command at the end comes up empty, your spreadsheets are the same, despite the different hashes of the two different files.
The accepted answer from alvonellos is correct about hashing. MD5 hashing will, almost certainly, give you different results for files that differ in any way, even by a single bit. In the 10 years since the original question, MD5 hashes have been deprecated in favor of more cryptographic-secure hashes but they are still generally fine for the OP's use case -- validating the files on your local filesystem. Accidental collisions are one-in-several-hundred million, depending on input, and importantly, files that are very similar but not identical are more likely to have different hashes. In other words, crafting two files that have the same hash is actually difficult to do, and requires making very specific changes in many places throughout one of the files. If you don't trust MD5, you can use any flavor of SHA or other hashing algorithm and you'll get similar results.
Deep dive into .XLSX:
The .xlsx format is just a .zip format under the hood. You can use the linux unzip utility to de-compress an .xlsx:
unzip file.xlsx -d dir/
Previous responses suggest calculating a diff on the two files, but have not described the best way to do this. Well, once you have used unzip on the .xlsx file, you will then have a directory structure with the "guts" of your spreadsheet:
dir/
[Content_Types].xml
_rels/
.rels
xl/
workbook.xml
worksheets/
sheet1.xml
sheet2.xml
sheet3.xml
. . .
Once you have done this two two different spreadsheets, say file1.xlsx expanded to dir1/ and file2.xlsx expanded to dir2/, you can do a recursive diff on the two files:
diff -rq dir1/ dir2/ # <-- The -rq flags mean recursive, file-name-only
Note that, if what you really want to know is whether the two files have different content, then this command will answer the question. If there is no output from this command, then there is no difference in content between the directories. IE, there is no difference between the two original spreadsheets' content.
If you are curious about the differences in the .xlsx files themselves, you can dig into the bits of the headers with the linux xxd utility:
xxd file1.xlsx | head -n1 # <-- look at the first line (16 bytes)
00000000: 504b 0304 1400 0000 0800 acab a354 7d3d PK...........T}=
xxd file2.xlsx | head -n1
00000000: 504b 0304 1400 0000 0800 66ac a354 7d3d PK........f..T}=
The time-stamp shows up at in the sixth octet (In this example, acab, and 66ac respectively.). The date is in the seventh octet (In this example, a354 for both).
Keep in mind than a .XLSX file is just a .ZIP file with a set of directories and files that follow Microsoft's standard zipped up inside of it. Each one of the contained files will have its own CRC-32 hash.
The eighth and ninth octets contains a CRC-32 hash that is generated by whatever zip utility generated the file. So, if you have the xxd utility handy, you can skip all the unzipping steps mentioned above and simply do:
xxd -s 14 -l 4 file1.xlsx
xxd -s 14 -l 4 file2.xlsx
With output that will look something like this:
xxd -s 14 -l 4 file1.xlsx
0000000e: 7d3d 6d31 }=m1
xxd -s 14 -l 4 file2.xlsx
0000000e: 7d3d 6d31 }=m1
thus confirming that the two internal files have the same hashes (regardless of timestamp). This is a handy check for the very first file contained within the .ZIP (ie .XLSX).
For an exhaustive view of the CRC32 hashes of the contents of all of the files contained within the .XLSX archive, you can use the following algorithm (pseudocode):
bytes := "504b0102" as binary
chunks_array := file_content split on bytes
crc32_hashes = []
crc32_hashes :=
for each chunk in chunks_array starting at 1: // skip index 0
append substring(24, 8) of hex(chunk) // 24 / 2 = 12 + 4 = offset 16
The magic number 504b0102 at the top is the separator for the file summaries at the end of the .ZIP file, flipped for endian-ness.
The resulting crc32_hashes array contains the CRC-32 hashes of each of the files contained therein. Again, because of internal timestamp mechanisms and other implementation-specific metadata in the internal XMLs, the human-readable parts of a spreadsheet could be identical but the CRC-32 hashes different.
Nevertheless, this is an inexpensive way to get a "fingerprint" of two Excel files to find out if they are in fact two copies of the exact same .XLSX. It's just string manipulation, which is much less time and processor intensive than re-hashing. It is relying on the hashing that was already done at the moment the .XLSX file was created.

Magic number file checking

I'm attempting to read magic numbers/bytes to check the format of a file. Will reading a file byte by byte work in the same way on a Linux machine?
Edit: The following shows to get the magic bytes from a class file using an int. I'm trying to do the same for a variable number of bytes.
http://www.rgagnon.com/javadetails/java-0544.html
I'm not sure that I understand what you are trying to do, but it sounds like what you are trying to do isn't the same thing as what the code that you are linking to is doing.
The java class format is specified to start with a magic number, so that code can only be used to verify if a file might be a java class or not. You can't use the same logic and apply it to arbritraty file formats.
Edit: .. or do you only want to check for wav files?
Edit2: Everything in Java is in big endian, that means that you can use DataInputStream.readInt to read the first four bytes from the file, and then compare the returned int with 0x52494646 (RIFF as a big endian integer)

Categories

Resources