Overview
Javas UUID class implements Comparable. But the order it implements appears to be incompatible with the specificiation given in RFC 4122.
In particular, it is inconsistent with the natural order implied by its string representation (uuid1.toString().compareTo(uuid2.toString())), which lines up with the RFC.
Example
You can reproduce and observe the problem by using the following code:
UUID uuid1 = UUID.randomUUID();
UUID uuid2 = UUID.randomUUID();
Assert.assertEquals(
Math.signum((int) uuid1.compareTo(uuid2)),
Math.signum((int) uuid1.toString().compareTo(uuid2.toString())));
Details
My main problem with this is that almost all other tools and languages seem to be consistent and compatible with RFC 4122, but Java is not.
In my particular case, I am using postgresql 13 and order by a column that contains the UUID, e.g. myColumnd::UUID or myColumnd::text (using uuid_v4), but the order I obtain by this differs from the order obtained with Java.
Well, in one case you compare UUIDs, in another case two string in lexical order.
According to the Javadoc:
The first of two UUIDs is greater than the second if the most significant field in which the UUIDs differ is greater for the first UUID.
Reason
What you are observing here is a known bug which will not be fixed anymore to preserve backwards compatibility.
For details, see JDK-7025832:
Though the bug is accurate that the compareTo implementation is not consistent with other implementations the Java UUID.compareTo() method must remain consistent among versions of Java. The compareTo() function is used primarily for sorting and the sort order of UUIDs must remain stable from version to version of Java.
Signed comparison
The underlying root problem is that Javas long type is a signed type but the reference implementation from RFC 4122, and implementations in other tools and languages, do the math with unsigned types.
This results in small differences in the outcome of the order, since the point where the numbers over-/underflow is different. E.g. Long.MAX_NUMBER is bigger than LONG.MAX_NUMBER + 1, but not for their unsigned counterparts.
The issue with Javas implementation was detected too late and now we have to live with this incompatibility.
Implementation Appendix
Here is the correct reference implementation from RFC 4122:
/* uuid_compare -- Compare two UUID's "lexically" and return */
#define CHECK(f1, f2) if (f1 != f2) return f1 < f2 ? -1 : 1;
int uuid_compare(uuid_t *u1, uuid_t *u2)
{
int i;
CHECK(u1->time_low, u2->time_low);
CHECK(u1->time_mid, u2->time_mid);
CHECK(u1->time_hi_and_version, u2->time_hi_and_version);
CHECK(u1->clock_seq_hi_and_reserved, u2->clock_seq_hi_and_reserved);
CHECK(u1->clock_seq_low, u2->clock_seq_low)
for (i = 0; i < 6; i++) {
if (u1->node[i] < u2->node[i])
return -1;
if (u1->node[i] > u2->node[i])
return 1;
}
return 0;
}
#undef CHECK
defined on the struct
typedef struct {
unsigned32 time_low;
unsigned16 time_mid;
unsigned16 time_hi_and_version;
unsigned8 clock_seq_hi_and_reserved;
unsigned8 clock_seq_low;
byte node[6];
} uuid_t;
as you see, they compare the nodes, which are byte, one by one (in the correct order).
Javas implementation however is this:
#Override
public int compareTo(UUID val) {
// The ordering is intentionally set up so that the UUIDs
// can simply be numerically compared as two numbers
return (this.mostSigBits < val.mostSigBits ? -1 :
(this.mostSigBits > val.mostSigBits ? 1 :
(this.leastSigBits < val.leastSigBits ? -1 :
(this.leastSigBits > val.leastSigBits ? 1 :
0))));
}
based on the two (signed) longs:
private final long mostSigBits;
private final long leastSigBits;
It's because Java doesn't have unsigned types, and UUIDs are compared by comparing two pairs of signed longs. It's frustrating.
Related
This question already has answers here:
Why is 128==128 false but 127==127 is true when comparing Integer wrappers in Java?
(8 answers)
Closed 8 years ago.
Why Integer == operator does not work for 128 and after Integer values? Can someone explain this situation?
This is my Java environment:
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
Sample Code:
Integer a;
Integer b;
a = 129;
b = 129;
for (int i = 0; i < 200; i++) {
a = i;
b = i;
if (a != b) {
System.out.println("Value:" + i + " - Different values");
} else {
System.out.println("Value:" + i + " - Same values");
}
}
Some part of console output:
Value:124 - Same values
Value:125 - Same values
Value:126 - Same values
Value:127 - Same values
Value:128 - Different values
Value:129 - Different values
Value:130 - Different values
Value:131 - Different values
Value:132 - Different values
Check out the source code of Integer . You can see the caching of values there.
The caching happens only if you use Integer.valueOf(int), not if you use new Integer(int). The autoboxing used by you uses Integer.valueOf.
According to the JLS, you can always count on the fact that for values between -128 and 127, you get the identical Integer objects after autoboxing, and on some implementations you might get identical objects even for higher values.
Actually in Java 7 (and I think in newer versions of Java 6), the implementation of the IntegerCache class has changed, and the upper bound is no longer hardcoded, but it is configurable via the property "java.lang.Integer.IntegerCache.high", so if you run your program with the VM parameter -Djava.lang.Integer.IntegerCache.high=1000, you get "Same values" for all values.
But the JLS still guarantees it only until 127:
Ideally, boxing a given primitive value p, would always yield an identical reference. In practice, this may not be feasible using existing implementation techniques. The rules above are a pragmatic compromise. The final clause above requires that certain common values always be boxed into indistinguishable objects. The implementation may cache these, lazily or eagerly.
For other values, this formulation disallows any assumptions about the identity of the boxed values on the programmer's part. This would allow (but not require) sharing of some or all of these references.
This ensures that in most common cases, the behavior will be the desired one, without imposing an undue performance penalty, especially on small devices. Less memory-limited implementations might, for example, cache all characters and shorts, as well as integers and longs in the range of -32K - +32K.
Integer is a wrapper class for int.
Integer != Integer compares the actual object reference, where int != int will compare the values.
As already stated, values -128 to 127 are cached, so the same objects are returned for those.
If outside that range, separate objects will be created so the reference will be different.
To fix it:
Make the types int or
Cast the types to int or
Use .equals()
According to Java Language Specifications:
If the value p being boxed is true, false, a byte, a char in the range
\u0000 to \u007f, or an int or short number between -128 and 127, then
let r1 and r2 be the results of any two boxing conversions of p. It is
always the case that r1 == r2.
JLS Boxing Conversions
Refer to this article for more information on int caching
The Integer object has an internal cache mechanism:
private static class IntegerCache {
static final int high;
static final Integer cache[];
static {
final int low = -128;
// high value may be configured by property
int h = 127;
if (integerCacheHighPropValue != null) {
// Use Long.decode here to avoid invoking methods that
// require Integer's autoboxing cache to be initialized
int i = Long.decode(integerCacheHighPropValue).intValue();
i = Math.max(i, 127);
// Maximum array size is Integer.MAX_VALUE
h = Math.min(i, Integer.MAX_VALUE - -low);
}
high = h;
cache = new Integer[(high - low) + 1];
int j = low;
for(int k = 0; k < cache.length; k++)
cache[k] = new Integer(j++);
}
private IntegerCache() {}
}
Also see valueOf method:
public static Integer valueOf(int i) {
if(i >= -128 && i <= IntegerCache.high)
return IntegerCache.cache[i + 128];
else
return new Integer(i);
}
This is why you should use valueOf instead of new Integer. Autoboxing uses this cache.
Also see this post: https://effective-java.com/2010/01/java-performance-tuning-with-maximizing-integer-valueofint/
Using == is not a good idea, use equals to compare the values.
Use .equals() instead of ==.
Integer values are only cached for numbers between -127 and 128, because they are used most often.
if (a.equals(b)) { ... }
Depending on how you get your Integer instances, it may not work for any value:
System.out.println(new Integer(1) == new Integer(1));
prints
false
This is because the == operator applied to reference-typed operands has nothing to do with the value those operands represent.
It's because Integer class implementation logic. It has prepared objects for numbers till 128. You can checkout http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/Integer.java source of open-jdk for example (search for cache[]).
Basically objects shouldn't be compared using == at all, with one exception to Enums.
This question already has answers here:
Why is 128==128 false but 127==127 is true when comparing Integer wrappers in Java?
(8 answers)
Closed 8 years ago.
Why Integer == operator does not work for 128 and after Integer values? Can someone explain this situation?
This is my Java environment:
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
Sample Code:
Integer a;
Integer b;
a = 129;
b = 129;
for (int i = 0; i < 200; i++) {
a = i;
b = i;
if (a != b) {
System.out.println("Value:" + i + " - Different values");
} else {
System.out.println("Value:" + i + " - Same values");
}
}
Some part of console output:
Value:124 - Same values
Value:125 - Same values
Value:126 - Same values
Value:127 - Same values
Value:128 - Different values
Value:129 - Different values
Value:130 - Different values
Value:131 - Different values
Value:132 - Different values
Check out the source code of Integer . You can see the caching of values there.
The caching happens only if you use Integer.valueOf(int), not if you use new Integer(int). The autoboxing used by you uses Integer.valueOf.
According to the JLS, you can always count on the fact that for values between -128 and 127, you get the identical Integer objects after autoboxing, and on some implementations you might get identical objects even for higher values.
Actually in Java 7 (and I think in newer versions of Java 6), the implementation of the IntegerCache class has changed, and the upper bound is no longer hardcoded, but it is configurable via the property "java.lang.Integer.IntegerCache.high", so if you run your program with the VM parameter -Djava.lang.Integer.IntegerCache.high=1000, you get "Same values" for all values.
But the JLS still guarantees it only until 127:
Ideally, boxing a given primitive value p, would always yield an identical reference. In practice, this may not be feasible using existing implementation techniques. The rules above are a pragmatic compromise. The final clause above requires that certain common values always be boxed into indistinguishable objects. The implementation may cache these, lazily or eagerly.
For other values, this formulation disallows any assumptions about the identity of the boxed values on the programmer's part. This would allow (but not require) sharing of some or all of these references.
This ensures that in most common cases, the behavior will be the desired one, without imposing an undue performance penalty, especially on small devices. Less memory-limited implementations might, for example, cache all characters and shorts, as well as integers and longs in the range of -32K - +32K.
Integer is a wrapper class for int.
Integer != Integer compares the actual object reference, where int != int will compare the values.
As already stated, values -128 to 127 are cached, so the same objects are returned for those.
If outside that range, separate objects will be created so the reference will be different.
To fix it:
Make the types int or
Cast the types to int or
Use .equals()
According to Java Language Specifications:
If the value p being boxed is true, false, a byte, a char in the range
\u0000 to \u007f, or an int or short number between -128 and 127, then
let r1 and r2 be the results of any two boxing conversions of p. It is
always the case that r1 == r2.
JLS Boxing Conversions
Refer to this article for more information on int caching
The Integer object has an internal cache mechanism:
private static class IntegerCache {
static final int high;
static final Integer cache[];
static {
final int low = -128;
// high value may be configured by property
int h = 127;
if (integerCacheHighPropValue != null) {
// Use Long.decode here to avoid invoking methods that
// require Integer's autoboxing cache to be initialized
int i = Long.decode(integerCacheHighPropValue).intValue();
i = Math.max(i, 127);
// Maximum array size is Integer.MAX_VALUE
h = Math.min(i, Integer.MAX_VALUE - -low);
}
high = h;
cache = new Integer[(high - low) + 1];
int j = low;
for(int k = 0; k < cache.length; k++)
cache[k] = new Integer(j++);
}
private IntegerCache() {}
}
Also see valueOf method:
public static Integer valueOf(int i) {
if(i >= -128 && i <= IntegerCache.high)
return IntegerCache.cache[i + 128];
else
return new Integer(i);
}
This is why you should use valueOf instead of new Integer. Autoboxing uses this cache.
Also see this post: https://effective-java.com/2010/01/java-performance-tuning-with-maximizing-integer-valueofint/
Using == is not a good idea, use equals to compare the values.
Use .equals() instead of ==.
Integer values are only cached for numbers between -127 and 128, because they are used most often.
if (a.equals(b)) { ... }
Depending on how you get your Integer instances, it may not work for any value:
System.out.println(new Integer(1) == new Integer(1));
prints
false
This is because the == operator applied to reference-typed operands has nothing to do with the value those operands represent.
It's because Integer class implementation logic. It has prepared objects for numbers till 128. You can checkout http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/Integer.java source of open-jdk for example (search for cache[]).
Basically objects shouldn't be compared using == at all, with one exception to Enums.
We have an app that the Python module will write data to redis shards and the Java module will read data from redis shards, so I need to implement the exact same consistent hashing algorithm for Java and Python to make sure the data can be found.
I googled around and tried several implementations, but found the Java and Python implementations are always different, can't be used togather. Need your help.
Edit, online implementations I have tried:
Java: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
Python: http://techspot.zzzeek.org/2012/07/07/the-absolutely-simplest-consistent-hashing-example/
http://amix.dk/blog/post/19367
Edit, attached Java (Google Guava lib used) and Python code I wrote. Code are based on the above articles.
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
import com.google.common.hash.HashFunction;
public class ConsistentHash<T> {
private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Long, T> circle = new TreeMap<Long, T>();
public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
Collection<T> nodes) {
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;
for (T node : nodes) {
add(node);
}
}
public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hashString(node.toString() + i).asLong(),
node);
}
}
public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hashString(node.toString() + i).asLong());
}
}
public T get(Object key) {
if (circle.isEmpty()) {
return null;
}
long hash = hashFunction.hashString(key.toString()).asLong();
if (!circle.containsKey(hash)) {
SortedMap<Long, T> tailMap = circle.tailMap(hash);
hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
}
return circle.get(hash);
}
}
Test code:
ArrayList<String> al = new ArrayList<String>();
al.add("redis1");
al.add("redis2");
al.add("redis3");
al.add("redis4");
String[] userIds =
{"-84942321036308",
"-76029520310209",
"-68343931116147",
"-54921760962352"
};
HashFunction hf = Hashing.md5();
ConsistentHash<String> consistentHash = new ConsistentHash<String>(hf, 100, al);
for (String userId : userIds) {
System.out.println(consistentHash.get(userId));
}
Python code:
import bisect
import md5
class ConsistentHashRing(object):
"""Implement a consistent hashing ring."""
def __init__(self, replicas=100):
"""Create a new ConsistentHashRing.
:param replicas: number of replicas.
"""
self.replicas = replicas
self._keys = []
self._nodes = {}
def _hash(self, key):
"""Given a string key, return a hash value."""
return long(md5.md5(key).hexdigest(), 16)
def _repl_iterator(self, nodename):
"""Given a node name, return an iterable of replica hashes."""
return (self._hash("%s%s" % (nodename, i))
for i in xrange(self.replicas))
def __setitem__(self, nodename, node):
"""Add a node, given its name.
The given nodename is hashed
among the number of replicas.
"""
for hash_ in self._repl_iterator(nodename):
if hash_ in self._nodes:
raise ValueError("Node name %r is "
"already present" % nodename)
self._nodes[hash_] = node
bisect.insort(self._keys, hash_)
def __delitem__(self, nodename):
"""Remove a node, given its name."""
for hash_ in self._repl_iterator(nodename):
# will raise KeyError for nonexistent node name
del self._nodes[hash_]
index = bisect.bisect_left(self._keys, hash_)
del self._keys[index]
def __getitem__(self, key):
"""Return a node, given a key.
The node replica with a hash value nearest
but not less than that of the given
name is returned. If the hash of the
given name is greater than the greatest
hash, returns the lowest hashed node.
"""
hash_ = self._hash(key)
start = bisect.bisect(self._keys, hash_)
if start == len(self._keys):
start = 0
return self._nodes[self._keys[start]]
Test code:
import ConsistentHashRing
if __name__ == '__main__':
server_infos = ["redis1", "redis2", "redis3", "redis4"];
hash_ring = ConsistentHashRing()
test_keys = ["-84942321036308",
"-76029520310209",
"-68343931116147",
"-54921760962352",
"-53401599829545"
];
for server in server_infos:
hash_ring[server] = server
for key in test_keys:
print str(hash_ring[key])
You seem to be running into two issues simultaneously: encoding issues and representation issues.
Encoding issues come about particularly since you appear to be using Python 2 - Python 2's str type is not at all like Java's String type, and is actually more like a Java array of byte. But Java's String.getBytes() isn't guaranteed to give you a byte array with the same contents as a Python str (they probably use compatible encodings, but aren't guaranteed to - even if this fix doesn't change things, it's a good idea in general to avoid problems in the future).
So, the way around this is to use a Python type that behaves like Java's String, and convert the corresponding objects from both languages to bytes specifying the same encoding. From the Python side, this means you want to use the unicode type, which is the default string literal type if you are using Python 3, or put this near the top of your .py file:
from __future__ import unicode_literals
If neither of those is an option, specify your string literals this way:
u'text'
The u at the front forces it to unicode. This can then be converted to bytes using its encode method, which takes (unsurprisingly) an encoding:
u'text'.encode('utf-8')
From the Java side, there is an overloaded version of String.getBytes that takes an encoding - but it takes it as a java.nio.Charset rather than a string - so, you'll want to do:
"text".getBytes(java.nio.charset.Charset.forName("UTF-8"))
These will give you equivalent sequences of bytes in both languages, so that the hashes have the same input and will give you the same answer.
The other issue you may have is representation, depending on which hash function you use. Python's hashlib (which is the preferred implementation of md5 and other cryptographic hashes since Python 2.5) is exactly compatible with Java's MessageDigest in this - they both give bytes, so their output should be equivalent.
Python's zlib.crc32 and Java's java.util.zip.CRC32, on the other hand, both give numeric results - but Java's is always an unsigned 64 bit number, while Python's (in Python 2) is a signed 32 bit number (in Python 3, its now an unsigned 32-bit number, so this problem goes away). To convert a signed result to an unsigned one, do: result & 0xffffffff, and the result should be comparable to the Java one.
According to this analysis of hash functions:
Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.
Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.
Python has zlib.crc32 and Java has a CRC32 class. Since it's a standard algorithm, you should get the same result in both languages.
MurmurHash 3 is available in Google Guava (a very useful Java library) and in pyfasthash for Python.
Note that these aren't cryptographic hash functions, so they're fast but don't provide the same guarantees. If these hashes are important for security, use a cryptographic hash.
Differnt language implementations of a hashing algorithm does not make the hash value different. The SHA-1 hash whether generated in java or python will be the same.
I'm not familiar with Redis, but the Python example appears to be hashing keys, so I'm assuming we're talking about some sort of HashMap implementation.
Your python example appears to be using MD5 hashes, which will be the same in both Java and Python.
Here is an example of MD5 hashing in Java:
http://www.dzone.com/snippets/get-md5-hash-few-lines-java
And in Python:
http://docs.python.org/library/md5.html
Now, you may want to find a faster hashing algorithm. MD5 is focused on cryptographic security, which isn't really needed in this case.
Here is a simple hashing function that produces the same result on both python and java for your keys:
Python
def hash(key):
h = 0
for c in key:
h = ((h*37) + ord(c)) & 0xFFFFFFFF
return h;
Java
public static int hash(String key) {
int h = 0;
for (char c : key.toCharArray())
h = (h * 37 + c) & 0xFFFFFFFF;
return h;
}
You don't need a cryptographically secure hash for this. That's just overkill.
Let's get this straight: the same binary input to the same hash function (SHA-1, MD5, ...) in different environments/implementations (Python, Java, ...) will yield the same binary output. That's because these hash functions are implemented according to standards.
Hence, you will discover the sources of the problem(s) you experience when answering these questions:
do you provide the same binary input to both hash functions (e.g. MD5 in Python and Java)?
do you interpret the binary output of both hash functions (e.g. MD5 in Python and Java) equivalently?
#lvc's answer provides much more detail on these questions.
For the java version, I would recommend using MD5 which generates 128bit string result and it can then be converted into BigInteger (Integer and Long are not enough to hold 128bit data).
Sample code here:
private static class HashFunc {
static MessageDigest md5;
static {
try {
md5 = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
//
}
}
public synchronized int hash(String s) {
md5.update(StandardCharsets.UTF_8.encode(s));
return new BigInteger(1, md5.digest()).intValue();
}
}
Note that:
The java.math.BigInteger.intValue() converts this BigInteger to an int. This conversion is analogous to a narrowing primitive conversion from long to int. If this BigInteger is too big to fit in an int, only the low-order 32 bits are returned. This conversion can lose information about the overall magnitude of the BigInteger value as well as return a result with the opposite sign.
java.lang.Comparable#compareTo method states as first provision
The implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compare-
To(x)) for all x and y. (This implies that x.compareTo(y) must throw
an exception if and only if y.compareTo(x) throws an exception.)
and according Joshua Bloch in Effective Java in item 12
This trick works fine here but should be used with extreme caution.
Don’t use it unless you’re certain the fields in question are
non-negative or, more generally, that the difference between the
lowest and highest possible field values is less than or equal to
Integer.MAX_VALUE (231-1). The reason this trick doesn’t always work
is that a signed 32-bit integer isn’t big enough to hold the
difference between two arbitrary signed 32-bit integers. If i is a
large positive int and j is a large negative int, (i - j) will
overflow and return a negative value. The resulting compareTo method
will return incorrect results for some arguments and violate the first
and second provisions of the compareTo contract. This is not a purely
theoretical problem: it has caused failures in real systems. These
failures can be difficult to debug, as the broken compareTo method
works properly for most input values.
With integers overflow you can violate the first provision and I can't find how, this example shows how the first provision would be violated:
public class ProblemsWithLargeIntegers implements Comparable<ProblemsWithLargeIntegers> {
private int zas;
#Override
public int compareTo(ProblemsWithLargeIntegers o) {
return zas - o.zas;
}
public ProblemsWithLargeIntegers(int zas) {
this.zas = zas;
}
public static void main(String[] args) {
int value1 = ...;
int value2 = ...;
ProblemsWithLargeIntegers d = new ProblemsWithLargeIntegers(value1);
ProblemsWithLargeIntegers e = new ProblemsWithLargeIntegers(value2);
if (!(Math.signum(d.compareTo(e)) == -Math.signum(e.compareTo(d)))){
System.out.println("hey!");
}
}
So I want a value1 and a value2 for getting that? Any idea? Or Joshua was wrong?
Well, this violates the general contract to start with. For example, take value1 = Integer.MIN_VALUE and value2 = 1. That will report that Integer.MIN_VALUE > 1, effectively.
EDIT: Actually, I was wrong - it's easy to violate the first provision:
int value1 = Integer.MIN_VALUE;
int value2 = 0;
You'll get a negative result for both comparisons, because Integer.MIN_VALUE - 0 == 0 - Integer.MIN_VALUE.
I need to generate a unique integer id for a string.
Reason:
I have a database application that can run on different databases. This databases contains parameters with parameter types that are generated from external xml data.
the current situation is that i use the ordinal number of the Enum. But when a parameter is inserted or removed, the ordinals get mixed up:
(FOOD = 0 , TOYS = 1) <--> (FOOD = 0, NONFOOD = 1, TOYS = 2)
The ammount of Parameter types is between 200 and 2000, so i am scared a bit using hashCode() for a string.
P.S.: I am using Java.
Thanks a lot
I would use a mapping table in the database to map these Strings to an auto increment value. These mapping should then be cached in the application.
Use a cryptographic hash. MD5 would probably be sufficient and relatively fast. It will be unique enough for your set of input.
How can I generate an MD5 hash?
The only problem is that the hash is 128 bits, so a standard 64-bit integer won't hold it.
If you need to be absolute certain that the id are unique (no collissions) and your strings are up to 32 chars, and your number must be of no more than 10 digits (approx 32 bits), you obviously cannot do it by a one way function id=F(string).
The natural way is to keep some mapping of the string to unique numbers (typically a sequence), either in the DB or in the application.
If you know the type of string values (length, letter patterns), you can count the total number of strings in this set and if it fits within 32 bits, the count function is your integer value.
Otherwise, the string itself is your integer value (integer in math terms, not Java).
By Enum you mean a Java Enum? Then you could give each enum value a unique int by your self instead of using its ordinal number:
public enum MyEnum {
FOOD(0),
TOYS(1),
private final int id;
private MyEnum(int id)
{
this.id = id;
}
}
I came across this post that's sensible: How to convert string to unique identifier in Java
In it the author describes his implementation:
public static long longHash(String string) {
long h = 98764321261L;
int l = string.length();
char[] chars = string.toCharArray();
for (int i = 0; i < l; i++) {
h = 31*h + chars[i];
}
return h;
}