Hashing raw bytes in Python and Java produces different results

Hashing raw bytes in Python and Java produces different results - java

I'm trying to replicate the behavior of a Python 2.7 function in Java, but I'm getting different results when running a (seemingly) identical sequence of bytes through a SHA-256 hash. The bytes are generated by manipulating a very large integer (exactly 2048 bits long) in a specific way (2nd line of my Python code example).
For my examples, the original 2048-bit integer is stored as big_int and bigInt in Python and Java respectively, and both variables contain the same number.
Python2 code I'm trying to replicate:
raw_big_int = ("%x" % big_int).decode("hex")
buff = struct.pack(">i", len(raw_big_int) + 1) + "\x00" + raw_big_int
pprint("Buffer contains: " + buff)
pprint("Encoded: " + buff.encode("hex").upper())
digest = hashlib.sha256(buff).digest()
pprint("Digest contains: " + digest)
pprint("Encoded: " + digest.encode("hex").upper())
Running this code prints the following (note that the only result I'm actually interested in is the last one - the hex-encoded digest. The other 3 prints are just to see what's going on under the hood):
'Buffer contains: \x00\x00\x01\x01\x00\xe3\xbb\xd3\x84\x94P\xff\x9c\'\xd0P\xf2\xf0s,a^\xf0i\xac~\xeb\xb9_\xb0m\xa2&f\x8d~W\xa0\xb3\xcd\xf9\xf0\xa8\xa2\x8f\x85\x02\xd4&\x7f\xfc\xe8\xd0\xf2\xe2y"\xd0\x84ck\xc2\x18\xad\xf6\x81\xb1\xb0q\x19\xabd\x1b>\xc8$g\xd7\xd2g\xe01\xd4r\xa3\x86"+N\\\x8c\n\xb7q\x1c \x0c\xa8\xbcW\x9bt\xb0\xae\xff\xc3\x8aG\x80\xb6\x9a}\xd9*\x9f\x10\x14\x14\xcc\xc0\xb6\xa9\x18*\x01/eC\x0eQ\x1b]\n\xc2\x1f\x9e\xb6\x8d\xbfb\xc7\xce\x0c\xa1\xa3\x82\x98H\x85\xa1\\\xb2\xf1\'\xafmX|\x82\xe7%\x8f\x0eT\xaa\xe4\x04*\x91\xd9\xf4e\xf7\x8c\xd6\xe5\x84\xa8\x01*\x86\x1cx\x8c\xf0d\x9cOs\xebh\xbc1\xd6\'\xb1\xb0\xcfy\xd7(\x8b\xeaIf6\xb4\xb7p\xcdgc\xca\xbb\x94\x01\xb5&\xd7M\xf9\x9co\xf3\x10\x87U\xc3jB3?vv\xc4JY\xc9>\xa3cec\x01\x86\xe9c\x81F-\x1d\x0f\xdd\xbf\xe8\xe9k\xbd\xe7c5'
'Encoded: 0000010100E3BBD3849450FF9C27D050F2F0732C615EF069AC7EEBB95FB06DA226668D7E57A0B3CDF9F0A8A28F8502D4267FFCE8D0F2E27922D084636BC218ADF681B1B07119AB641B3EC82467D7D267E031D472A386222B4E5C8C0AB7711C200CA8BC579B74B0AEFFC38A4780B69A7DD92A9F101414CCC0B6A9182A012F65430E511B5D0AC21F9EB68DBF62C7CE0CA1A382984885A15CB2F127AF6D587C82E7258F0E54AAE4042A91D9F465F78CD6E584A8012A861C788CF0649C4F73EB68BC31D627B1B0CF79D7288BEA496636B4B770CD6763CABB9401B526D74DF99C6FF3108755C36A42333F7676C44A59C93EA36365630186E96381462D1D0FDDBFE8E96BBDE76335'
'Digest contains: Q\xf9\xb9\xaf\xe1\xbey\xdc\xfa\xc4.\xa9 \xfckz\xfeB\xa0>\xb3\xd6\xd0*S\xff\xe1\xe5*\xf0\xa3i'
'Encoded: 51F9B9AFE1BE79DCFAC42EA920FC6B7AFE42A03EB3D6D02A53FFE1E52AF0A369'
Now, below is my Java code so far. When I test it, I get the same value for the input buffer, but a different value for the digest. (bigInt contains a BigInteger object containing the same number as big_int in the Python example above)
byte[] rawBigInt = bigInt.toByteArray();
ByteBuffer buff = ByteBuffer.allocate(rawBigInt.length + 4);
buff.order(ByteOrder.BIG_ENDIAN);
buff.putInt(rawBigInt.length).put(rawBigInt);
System.out.print("Buffer contains: ");
System.out.println( DatatypeConverter.printHexBinary(buff.array()) );
MessageDigest hash = MessageDigest.getInstance("SHA-256");
hash.update(buff);
byte[] digest = hash.digest();
System.out.print("Digest contains: ");
System.out.println( DatatypeConverter.printHexBinary(digest) );
Notice that in my Python example, I started the buffer off with len(raw_big_int) + 1 packed, where in Java I started with just rawBigInt.length. I also omitted the extra 0-byte ("\x00") when writing in Java. I did both of these for the same reason - in my tests, calling toByteArray() on a BigInteger returned a byte array already beginning with a 0-byte that was exactly 1 byte longer than Python's byte sequence. So, at least in my tests, len(raw_big_int) + 1 equaled rawBigInt.length, since rawBigInt began with a 0-byte and raw_big_int did not.
Alright, that aside, here is the Java code's output:
Buffer contains: 0000010100E3BBD3849450FF9C27D050F2F0732C615EF069AC7EEBB95FB06DA226668D7E57A0B3CDF9F0A8A28F8502D4267FFCE8D0F2E27922D084636BC218ADF681B1B07119AB641B3EC82467D7D267E031D472A386222B4E5C8C0AB7711C200CA8BC579B74B0AEFFC38A4780B69A7DD92A9F101414CCC0B6A9182A012F65430E511B5D0AC21F9EB68DBF62C7CE0CA1A382984885A15CB2F127AF6D587C82E7258F0E54AAE4042A91D9F465F78CD6E584A8012A861C788CF0649C4F73EB68BC31D627B1B0CF79D7288BEA496636B4B770CD6763CABB9401B526D74DF99C6FF3108755C36A42333F7676C44A59C93EA36365630186E96381462D1D0FDDBFE8E96BBDE76335
Digest contains: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
As you can see, the buffer contents appear the same in both Python and Java, but the digests are obviously different. Can someone point out where I'm going wrong?
I suspect it has something to do with the strange way Python seems to store bytes - the variables raw_big_int and buff show as type str in the interpreter, and when printed out by themselves have that strange format with the '\x's that is almost the same as the bytes themselves in some places, but is utter gibberish in others. I don't have enough Python experience to understand exactly what's going on here, and my searches have turned up fruitless.
Also, since I'm trying to port the Python code into Java, I can't just change the Python - my goal is to write Java code that takes the same input and produces the same output. I've searched around (this question in particular seemed related) but didn't find anything to help me out. Thanks in advance, if for nothing else than for reading this long-winded question! :)

In Java, you've got the data in the buffer, but the cursor positions are all wrong. After you've written your data to the ByteBuffer it looks like this, where the x's represent your data and the 0's are unwritten bytes in the buffer:
xxxxxxxxxxxxxxxxxxxx00000000000000000000000000000000000000000
^ position ^ limit
The cursor is positioned after the data you've written. A read at this point will read from position to limit, which is the bytes you haven't written.
Instead, you want this:
xxxxxxxxxxxxxxxxxxxx00000000000000000000000000000000000000000
^ position ^ limit
where the position is 0 and the limit is the number of bytes you've written. To get there, call flip(). Flipping a buffer conceptually switches it from write mode to read mode. I say "conceptually" because ByteBuffers don't have explicit read and write modes, but you should think of them as if they do.
(The opposite operation is compact(), which goes back to read mode.)

Related

Reading in Partial InputStream

I'm looking to read an InputStream in sections because I need the first n bytes of the file and last m bytes as well as the contents between.
byte[] beginning = inputStream.readNBytes(16);
This works just fine, but to get the last m bytes, I tried the following:
byte[] middle = inputStream.readNBytes(inputStream.available() - 32);
byte[] end = inputStream.readNBytes(inputStream.available());
The end variable looks how I expect it to but not the middle variable, which ends up cutting out part of the stream.
I'm also a bit confused why the buf parameter size in the input stream doesn't seem to be equal to the byte array size when converting one to the other.
Anyway, I assume this isn't working how I want it to because (inputStream.available() - 32) is not adding up to a value compatible with readNBytes, so part of the stream is lost.
Is there a way to go about doing this?
EDIT: What I ended up doing which seemed to work(mostly) is when creating the file, to prepend both pieces I will later be extracting instead of prepending one and appending the other. That way I can just call inputStream.readAllBytes() on the last piece.
I also had to change where I'm writing to the file. I was writing to a CipherOutputStream when I should've been writing to the FileOutputStream and using that to create the Cipher OS.
Even after doing this I still have an extra 16 bytes at the end of the file, which confuses me, but I can easily ignore that last bit if I can't figure out why it's doing that.

How to correctly save stream of bytes to file in Java/Scala? How to fix wrongly saved stream?

Story
While conducting an experiment I was saving a stream of random Bytes generated by a hardware RNG device. After the experiment was finished, I realized that the saving method was incorrect. I hope I can find the way how to fix the corrupted file so that I obtain the correct stream of random numbers back.
Example
The story of the problem can be explained in the following simple example.
Let's say I have a stream of random numbers in an input file randomInput.bin. I will simulate the stream of random numbers coming from the hardware RNG device by sending the input file to stdout via cat. I found two ways how to save this stream to a file:
A) Harmless saving method
This method gives me exactly the original stream of random Bytes.
import scala.sys.process._
import java.io.File
val res = ("cat randomInput.bin" #> new File(outputFile))!
B) Saving method leading to corruption
Unfortunately, this is the original saving method I chose.
import scala.sys.process._
import java.io.PrintWriter
val randomBits = "cat randomInput.bin".!!
val out = new PrintWriter(outputFile)
out.println(randomBits)
if (out != null) {
out.close()
Seq("chmod", "600", outputFile).!
}
The file saved using method B) is still binary, however, is is approximately 2x larger that the file saved by method A). Further analysis shows that the stream of random Bits is significantly less random.
Summary
I suspect that the saving method B) adds something to almost every byte, however, the understanding of this is behind my expertise in Java/Scala I/O.
I would very much appreciate if somebody explained me the low-level difference between methods A) and B). The goal is to revert the changes created by saving method B) and obtain the original stream of random Bytes.
Thank you very much in advance!

The problem is probably that println is meant for text, and this text is being encoded as Unicode, which uses multiple bytes for some or all characters, depending on which version of Unicode.
If the file is exactly 2x larger than it should be, then you've probably got a null byte every other byte, which could be easy to fix. Otherwise, it may be harder to figure out what you would need to do to recover the binary data. Viewing the corrupted file in a hex editor may help you see what happened. Either way, I think it may be easier to just generate new random data and save it correctly.
Especially if this is for an experiment, if your random data has been corrupted and then fixed, it may be harder to justify that the data is truly random compared to just generating it properly in the first place.

Keeping Java String Offsets With Unicode Consistent in Python

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.
As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:
0,6
7,14
15,20
Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.
All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.
For example, given the string "I feel 🙂 today", the Java program will output:
0,1
2,6
7,9
10,15
On the Python side, these translate to:
0,1 "I"
2,6 "feel"
7,9 "🙂 "
10,15 "oday"
Where the last index is technically invalid. Java sees "🙂" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.
Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.
So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?

You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:
x = "I feel 🙂 today"
y = bytearray(x, "UTF-16LE")
offsets = [(0,1),(2,6),(7,9),(10,15)]
for word in offsets:
print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))
Output:
I
feel
🙂
today
Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:
from itertools import accumulate
x = "I feel 🙂 today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program
# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))
# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]
# now you can just use those indices as normal
for word in offsets:
print(x[word[0]:word[1]])
Output:
I
feel
🙂
today
The above code is messy and can probably be made clearer, but you get the idea.

This solves the problem given the proper encoding, which, in our situation appears to be 'UTF-16BE':
def correct_offsets(input, offsets, encoding):
offset_list = [{'old': o, 'new': [o[0],o[1]]} for o in offsets]
for idx in range(0, len(input)):
if len(input[idx].encode(encoding)) > 2:
for o in offset_list:
if o['old'][0] > idx:
o['new'][0] -= 1
if o['old'][1] > idx:
o['new'][1] -= 1
return [o['new'] for o in offset_list]
This may be pretty inefficient though. I gladly welcome any performance improvements.

C# - Writing strings to a stream using two bytes for length, not one

I am creating an easy to use server-client model with an extensible protocol, where the server is in Java and clients can be Java, C#, what-have-you.
I ran into this issue: Java data streams write strings with a short designating the length, followed by the data.
C# lets me specify the encoding I want, but it only reads one byte for the length. (actually, it says '7 bits at a time'...this is odd. This might be part of my problem?)
Here is my setup: The server sends a string to the client once it connects. It's a short string, so the first byte is 0 and the second byte is 9; the string is 9 bytes long.
//...
_socket.Connect(host, port);
var stream = new NetworkStream(_socket);
_in = new BinaryReader(stream, Encoding.UTF8);
Console.WriteLine(_in.ReadString()); //outputs nothing
Reading a single byte before reading the string of course outputs the expected string. But, how can I set up my stream reader to read a string using two bytes as the length, not one? Do I need to subclass BinaryReader and override ReadString()?

The C# BinaryWriter/Reader behavior uses, if I recall correctly, the 8th bit to signify where the last byte of the count is. This allows for counts up to 127 to fit in a single byte while still allowing for actual count values much larger (i.e. up to 2^31-1); it's a bit like UTF8 in that respect.
For your own purposes, note that you are writing the whole protocol (presumably), so you have complete control over both ends. Both behaviors you describe, in C# and Java, are implemented by what are essentially helper classes in each language. There's nothing saying that you have to use them, and both languages offer a way to simply encode text directly into an array of bytes which you can send however you like.
If you do want to stick with the Java-based protocol, you can use BitConverter to convert between a short to a byte[] so that you can send and receive those two bytes explicitly. For example:
_in = new BinaryReader(stream, Encoding.UTF8);
byte[] header = _in.ReadBytes(2);
short count = BitConverter.ToInt16(header, 0);
byte[] data = _in.ReadBytes(count);
string text = Encoding.UTF8.GetString(data);
Console.WriteLine(text); // outputs something

Writing Bits to a file using BitSet & FileOutputStream

I've run into a bit of a problem when it comes to writing specific bits to a file. I apologise if this is a duplicate of anything but I could not find a reasonable answer with the searches I ran.
I have a number of difficulties with the following:
Writing a header (Long) bit by bit (converted to a byte array so the
FileOutputStream can utilise it) to the file.
Writing single bits to the file. For example, at one stage I am required to write a single bit set to 0 to the file so my initial thought would be to use a BitSet but Java seems to treat this as a null?
BitSet initialPadding = new BitSet();
initialPadding.set(0, false);
fileOutputStream.write(initialPadding.toByteArray());
1)
I create a FileOutputStream as shown below with the necessary file name:
FileOutputStream fileOutputStream = new FileOutputStream(file.getAbsolutePath());
I am attempting to create an ".amr" file so the first step before I perform any bit manipulation is to write a header to the beginning of the file. This has the following value:
Long defaultHeader = 0x2321414d520aL;
I've tried writing this to the file using the following method but I am pretty sure it does not write the correct result:
fileOutputStream.write(defaultHeader.byteValue());
Am I using the correct streams? Are my convertions completely wrong?
2)
I have a public BitSet fileBitSet;which has bits read in from a ".raw" file as the input. I need to be able to extract certain bits from the BitSet in order to write them to the file later. I do this using the following method:
public int getOctetPayloadHeader(int startPoint) {
int readLength = 0;
octetCMR = fileBitSet.get(0, 3);
octetRES = fileBitSet.get(4, 7);
if (octetRES.get(0, 3).isEmpty()) {
/* Keep constructing the payload header. */
octetFBit = fileBitSet.get(8, 8);
octetMode = fileBitSet.get(9, 12);
octetQuality = fileBitSet.get(13, 13);
octetPadding = fileBitSet.get(14, 15);
... }
What would be the best way to go for writing these bits to a file bearing in mind that I may be required to sometimes write a single bit or 81 bits at a particular offset in the fileBitSet ?

There is only one thing you can write to an OutputStream: bytes. You have to do the composing of your bits into bytes yourself; only you know the rules how the bits are to be put together into bytes.
As for stuff like:
Long defaultHeader = 0x2321414d520aL;
fileOutputStream.write(defaultHeader.byteValue());
You should take a close look at the javadocs for the methods you are using. byteValue() returns a single byte; so of course its not doing what you expect. Working with streams is well explained in oracles tutorials: http://docs.oracle.com/javase/tutorial/essential/io/streams.html
For writing single bits or groups of bits, you will need a custom OutputStream that handles grouping the bits into bytes to be written. Thats commonly called a BitStream (there is no such class in the JDK); you have to either write it yourself (which I highly recommend, its a very good excercise to teach you about bits and bytes) or find one on the web.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.