I have a an array of bytes I'm calculating a checksum for in Java. And I'm trying to convert it to Kotlin. But the problem is that I am getting different values when calculating the -128 & 0xff in Java than it's equivalent in Kotlin. When passing in the -128, when I make the calculation in Java, it gives me a positive 128, but when I run it in Kotlin, it gives me a -128.
public class Bytes {
public static byte[] getByteArray() {
return new byte [] {-128};
}
public static int getJavaChecksum() {
int checksum = 0;
for (Byte b : getByteArray()) {
checksum += (b & 0xff);
}
return checksum;
}
}
This is my Kotlin code. I'm calling into the above bytes class to get the "byte array" I'm working with. So both pieces are running on the same input.
fun getKotlinChecksum(array: ByteArray): Byte {
var checksum = 0
for (b in array) {
checksum += (b and 0xFF.toByte())
}
return checksum.toByte()
}
fun main(args: Array<String>) {
println(Bytes.getJavaChecksum())
print(getKotlinChecksum(Bytes.getByteArray()))
}
The Java code is from a legacy code base that uses I2C to send over these bytes to a microcontroller. Is the Java just wrong? Or is there a way that I can get the 128 in the Kotlin code?
Returning a Byte means the function cannot return 128 no matter what, so that needs to be Int. The Java checksum does a "full" sum, not just the lowest byte of the sum. Similarly the bitwise AND cannot be done on bytes, that is useless, the sign extension (the AND with 0xFF is to remove the extended sign bits) would still happen. Then it could be written as a loop, Java style, but in Kotlin it probably makes more sense to write something like this:
fun getKotlinChecksum(array: ByteArray): Int {
return array.map({ it.toInt() and 0xFF }).sum()
}
I've been writing a port of a networking library from Java and this is the last line of code I have yet to decipher and move on over. The line of code is as follows:
Float.floatToIntBits(Float);
Which returns an integer.
The code of floatToIntBits in Java
public static int floatToIntBits(float value) {
int result = floatToRawIntBits(value);
// Check for NaN based on values of bit fields, maximum
// exponent and nonzero significand.
if ( ((result & FloatConsts.EXP_BIT_MASK) ==
FloatConsts.EXP_BIT_MASK) &&
(result & FloatConsts.SIGNIF_BIT_MASK) != 0)
result = 0x7fc00000;
return result;
}
I'm not nearly experienced enough with memory and hex values to port this over myself, not to mention the bit shifting that's all over the place that's been driving me absolutely mad.
Take a look at the BitConverter class. For doubles it has methods DoubleToInt64Bits and Int64BitsToDouble. For floats you could do something like this:
float f = ...;
int i = BitConverter.ToInt32(BitConverter.GetBytes(f), 0);
Or changing endianness:
byte[] bytes = BitConverter.GetBytes(f);
Array.Reverse(bytes);
int i = BitConverter.ToInt32(bytes, 0);
If you can compile with unsafe, this becomes trivial:
public static unsafe uint FloatToUInt32Bits(float f) {
return *((uint*)&f);
}
Replace uint with int if you want to work with signed values, but I would say unsigned makes more sense. This is actually equivalent to Java's floatToRawIntBits(); floatToIntBits() is identical except that it always returns the same bitmask for all NaN values. If you want that functionality, you can just replicate that if statement from the Java version, but it's probably unnecesssary.
You'll need to switch on 'unsafe' support for your assembly, so it's up to you whether you want to go this route. It's not at all uncommon for high performance networking libraries to use unsafe code.
We have an app that the Python module will write data to redis shards and the Java module will read data from redis shards, so I need to implement the exact same consistent hashing algorithm for Java and Python to make sure the data can be found.
I googled around and tried several implementations, but found the Java and Python implementations are always different, can't be used togather. Need your help.
Edit, online implementations I have tried:
Java: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
Python: http://techspot.zzzeek.org/2012/07/07/the-absolutely-simplest-consistent-hashing-example/
http://amix.dk/blog/post/19367
Edit, attached Java (Google Guava lib used) and Python code I wrote. Code are based on the above articles.
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
import com.google.common.hash.HashFunction;
public class ConsistentHash<T> {
private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Long, T> circle = new TreeMap<Long, T>();
public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
Collection<T> nodes) {
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;
for (T node : nodes) {
add(node);
}
}
public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hashString(node.toString() + i).asLong(),
node);
}
}
public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hashString(node.toString() + i).asLong());
}
}
public T get(Object key) {
if (circle.isEmpty()) {
return null;
}
long hash = hashFunction.hashString(key.toString()).asLong();
if (!circle.containsKey(hash)) {
SortedMap<Long, T> tailMap = circle.tailMap(hash);
hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
}
return circle.get(hash);
}
}
Test code:
ArrayList<String> al = new ArrayList<String>();
al.add("redis1");
al.add("redis2");
al.add("redis3");
al.add("redis4");
String[] userIds =
{"-84942321036308",
"-76029520310209",
"-68343931116147",
"-54921760962352"
};
HashFunction hf = Hashing.md5();
ConsistentHash<String> consistentHash = new ConsistentHash<String>(hf, 100, al);
for (String userId : userIds) {
System.out.println(consistentHash.get(userId));
}
Python code:
import bisect
import md5
class ConsistentHashRing(object):
"""Implement a consistent hashing ring."""
def __init__(self, replicas=100):
"""Create a new ConsistentHashRing.
:param replicas: number of replicas.
"""
self.replicas = replicas
self._keys = []
self._nodes = {}
def _hash(self, key):
"""Given a string key, return a hash value."""
return long(md5.md5(key).hexdigest(), 16)
def _repl_iterator(self, nodename):
"""Given a node name, return an iterable of replica hashes."""
return (self._hash("%s%s" % (nodename, i))
for i in xrange(self.replicas))
def __setitem__(self, nodename, node):
"""Add a node, given its name.
The given nodename is hashed
among the number of replicas.
"""
for hash_ in self._repl_iterator(nodename):
if hash_ in self._nodes:
raise ValueError("Node name %r is "
"already present" % nodename)
self._nodes[hash_] = node
bisect.insort(self._keys, hash_)
def __delitem__(self, nodename):
"""Remove a node, given its name."""
for hash_ in self._repl_iterator(nodename):
# will raise KeyError for nonexistent node name
del self._nodes[hash_]
index = bisect.bisect_left(self._keys, hash_)
del self._keys[index]
def __getitem__(self, key):
"""Return a node, given a key.
The node replica with a hash value nearest
but not less than that of the given
name is returned. If the hash of the
given name is greater than the greatest
hash, returns the lowest hashed node.
"""
hash_ = self._hash(key)
start = bisect.bisect(self._keys, hash_)
if start == len(self._keys):
start = 0
return self._nodes[self._keys[start]]
Test code:
import ConsistentHashRing
if __name__ == '__main__':
server_infos = ["redis1", "redis2", "redis3", "redis4"];
hash_ring = ConsistentHashRing()
test_keys = ["-84942321036308",
"-76029520310209",
"-68343931116147",
"-54921760962352",
"-53401599829545"
];
for server in server_infos:
hash_ring[server] = server
for key in test_keys:
print str(hash_ring[key])
You seem to be running into two issues simultaneously: encoding issues and representation issues.
Encoding issues come about particularly since you appear to be using Python 2 - Python 2's str type is not at all like Java's String type, and is actually more like a Java array of byte. But Java's String.getBytes() isn't guaranteed to give you a byte array with the same contents as a Python str (they probably use compatible encodings, but aren't guaranteed to - even if this fix doesn't change things, it's a good idea in general to avoid problems in the future).
So, the way around this is to use a Python type that behaves like Java's String, and convert the corresponding objects from both languages to bytes specifying the same encoding. From the Python side, this means you want to use the unicode type, which is the default string literal type if you are using Python 3, or put this near the top of your .py file:
from __future__ import unicode_literals
If neither of those is an option, specify your string literals this way:
u'text'
The u at the front forces it to unicode. This can then be converted to bytes using its encode method, which takes (unsurprisingly) an encoding:
u'text'.encode('utf-8')
From the Java side, there is an overloaded version of String.getBytes that takes an encoding - but it takes it as a java.nio.Charset rather than a string - so, you'll want to do:
"text".getBytes(java.nio.charset.Charset.forName("UTF-8"))
These will give you equivalent sequences of bytes in both languages, so that the hashes have the same input and will give you the same answer.
The other issue you may have is representation, depending on which hash function you use. Python's hashlib (which is the preferred implementation of md5 and other cryptographic hashes since Python 2.5) is exactly compatible with Java's MessageDigest in this - they both give bytes, so their output should be equivalent.
Python's zlib.crc32 and Java's java.util.zip.CRC32, on the other hand, both give numeric results - but Java's is always an unsigned 64 bit number, while Python's (in Python 2) is a signed 32 bit number (in Python 3, its now an unsigned 32-bit number, so this problem goes away). To convert a signed result to an unsigned one, do: result & 0xffffffff, and the result should be comparable to the Java one.
According to this analysis of hash functions:
Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.
Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.
Python has zlib.crc32 and Java has a CRC32 class. Since it's a standard algorithm, you should get the same result in both languages.
MurmurHash 3 is available in Google Guava (a very useful Java library) and in pyfasthash for Python.
Note that these aren't cryptographic hash functions, so they're fast but don't provide the same guarantees. If these hashes are important for security, use a cryptographic hash.
Differnt language implementations of a hashing algorithm does not make the hash value different. The SHA-1 hash whether generated in java or python will be the same.
I'm not familiar with Redis, but the Python example appears to be hashing keys, so I'm assuming we're talking about some sort of HashMap implementation.
Your python example appears to be using MD5 hashes, which will be the same in both Java and Python.
Here is an example of MD5 hashing in Java:
http://www.dzone.com/snippets/get-md5-hash-few-lines-java
And in Python:
http://docs.python.org/library/md5.html
Now, you may want to find a faster hashing algorithm. MD5 is focused on cryptographic security, which isn't really needed in this case.
Here is a simple hashing function that produces the same result on both python and java for your keys:
Python
def hash(key):
h = 0
for c in key:
h = ((h*37) + ord(c)) & 0xFFFFFFFF
return h;
Java
public static int hash(String key) {
int h = 0;
for (char c : key.toCharArray())
h = (h * 37 + c) & 0xFFFFFFFF;
return h;
}
You don't need a cryptographically secure hash for this. That's just overkill.
Let's get this straight: the same binary input to the same hash function (SHA-1, MD5, ...) in different environments/implementations (Python, Java, ...) will yield the same binary output. That's because these hash functions are implemented according to standards.
Hence, you will discover the sources of the problem(s) you experience when answering these questions:
do you provide the same binary input to both hash functions (e.g. MD5 in Python and Java)?
do you interpret the binary output of both hash functions (e.g. MD5 in Python and Java) equivalently?
#lvc's answer provides much more detail on these questions.
For the java version, I would recommend using MD5 which generates 128bit string result and it can then be converted into BigInteger (Integer and Long are not enough to hold 128bit data).
Sample code here:
private static class HashFunc {
static MessageDigest md5;
static {
try {
md5 = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
//
}
}
public synchronized int hash(String s) {
md5.update(StandardCharsets.UTF_8.encode(s));
return new BigInteger(1, md5.digest()).intValue();
}
}
Note that:
The java.math.BigInteger.intValue() converts this BigInteger to an int. This conversion is analogous to a narrowing primitive conversion from long to int. If this BigInteger is too big to fit in an int, only the low-order 32 bits are returned. This conversion can lose information about the overall magnitude of the BigInteger value as well as return a result with the opposite sign.
I have integers from 0 to 255, and I need to pass them along to an OutputStream encoded as unsigned bytes. I've tried to convert using a mask like so, but if i=1, the other end of my stream (a serial device expecting uint8_t) thinks I've sent an unsigned integer = 6.
OutputStream out;
public void writeToStream(int i) throws Exception {
out.write(((byte)(i & 0xff)));
}
I'm talking to an Arduino at /dev/ttyUSB0 using Ubuntu if this makes things any more or less interesting.
Here's the Arduino code:
uint8_t nextByte() {
while(1) {
if(Serial.available() > 0) {
uint8_t b = Serial.read();
return b;
}
}
}
I also have some Python code that works great with the Arduino code, and the Arduino happily receives the correct integer if I use this code in Python:
class writerThread(threading.Thread):
def __init__(self, threadID, name):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
def run(self):
while True:
input = raw_input("[W}Give Me Input!")
if (input == "exit"):
exit("Goodbye");
print ("[W]You input %s\n" % input.strip())
fval = [ int(input.strip()) ]
ser.write("".join([chr(x) for x in fval]))
I'd also eventually like to do this in Scala, but I'm falling back to Java to avoid the complexity while I solve this issue.
I think you just want out.write(i) here. Only the eight low-order bits are written from the int argument i.
Cast, then mask: ((byte)(i)&0xff)
But, something is very strange since:
(dec)8 - (binary)1000
(dec)6 - (binary)0110
[edit]
How is your Arduino receiving 6 (binary)0110 when you send 1 (binary)0001?
[/edit]
byte s[] = getByteArray()
for(.....)
Integer.toHexString((0x000000ff & s[i]) | 0xffffff00).substring(6);
I understand that you are trying to convert the byte into hex string. What I don't understand is how that is done. For instance if s[i] was 00000001 (decimal 1) than could you please explain:
Why 0x000000ff & 00000001 ? Why not directly use 00000001?
Why result from #1 | 0xffffff00?
Finally why substring(6) is applied?
Thanks.
It's basically because bytes are signed in Java. If you promote a byte to an int, it will sign extend, meaning that the byte 0xf2 will become 0xfffffff2. Sign extension is a method to keep the value the same when widening it, by copying the most significant (sign) bit into all the higher-order bits. Both those values above are -14 in two's complement notation. If instead you had widened 0xf2 to 0x000000f2, it would be 242, probably not what you want.
So the & operation is to strip off any of those extended bits, leaving only the least significant 8 bits. However, since you're going to be forcing those bits to 1 in the next step anyway, this step seems a bit of a waste.
The | operation following that will force all those upper bits to be 1 so that you're guaranteed to get an 8-character string from ffffff00 through ffffffff inclusive (since toHexString doesn't give you leading zeroes, it would translate 7 into "7" rather than the "07" that you want).
The substring(6) is then applied so that you only get the last two of those eight hex digits.
It seems a very convoluted way of ensuring you get a two-character hex string to me when you can just use String.format ("%02x", s[i]). However, it's possible that this particular snippet of code may predate Java 5 when String.format was introduced.
If you run the following program:
public class testprog {
public static void compare (String s1, String s2) {
if (!s1.equals(s2))
System.out.println ("Different: " + s1 + " " + s2);
}
public static void main(String args[]) {
byte b = -128;
while (b < 127) {
compare (
Integer.toHexString((0x000000ff & b) | 0xffffff00).substring(6),
String.format("%02x", b, args));
b++;
}
compare (
Integer.toHexString((0x000000ff & b) | 0xffffff00).substring(6),
String.format("%02x", b, args));
System.out.println ("Done");
}
}
you'll see that the two expressions are identical - it just spits out Done since the two expressions produce the same result in all cases.