I need to generate a unique integer id for a string.
Reason:
I have a database application that can run on different databases. This databases contains parameters with parameter types that are generated from external xml data.
the current situation is that i use the ordinal number of the Enum. But when a parameter is inserted or removed, the ordinals get mixed up:
(FOOD = 0 , TOYS = 1) <--> (FOOD = 0, NONFOOD = 1, TOYS = 2)
The ammount of Parameter types is between 200 and 2000, so i am scared a bit using hashCode() for a string.
P.S.: I am using Java.
Thanks a lot
I would use a mapping table in the database to map these Strings to an auto increment value. These mapping should then be cached in the application.
Use a cryptographic hash. MD5 would probably be sufficient and relatively fast. It will be unique enough for your set of input.
How can I generate an MD5 hash?
The only problem is that the hash is 128 bits, so a standard 64-bit integer won't hold it.
If you need to be absolute certain that the id are unique (no collissions) and your strings are up to 32 chars, and your number must be of no more than 10 digits (approx 32 bits), you obviously cannot do it by a one way function id=F(string).
The natural way is to keep some mapping of the string to unique numbers (typically a sequence), either in the DB or in the application.
If you know the type of string values (length, letter patterns), you can count the total number of strings in this set and if it fits within 32 bits, the count function is your integer value.
Otherwise, the string itself is your integer value (integer in math terms, not Java).
By Enum you mean a Java Enum? Then you could give each enum value a unique int by your self instead of using its ordinal number:
public enum MyEnum {
FOOD(0),
TOYS(1),
private final int id;
private MyEnum(int id)
{
this.id = id;
}
}
I came across this post that's sensible: How to convert string to unique identifier in Java
In it the author describes his implementation:
public static long longHash(String string) {
long h = 98764321261L;
int l = string.length();
char[] chars = string.toCharArray();
for (int i = 0; i < l; i++) {
h = 31*h + chars[i];
}
return h;
}
Related
I'm to implement a hash function, and here is my hash function (the first draft version that is)
public int hashCode(){
String fixedISBN = getIsbn().toString().replace("-", "");
fixedISBN = fixedISBN.substring(fixedISBN.length()-4, fixedISBN.length());
int ISBN = Integer.parseInt(fixedISBN);
int ASCII = 0;
for (int i = 0; i < getTitle().toString().length(); i++) {
ASCII += getTitle().toString().charAt(i);
}
int hashValue = (ISBN * 37 + ASCII*23);
return hashValue;
}
I am meant to hash books, and to do so I initially thought to use the ISBN value of a book, which serves as a wholly unique identifier for every book. Then I looked at the list of ISBNs and saw that using the entire ISBN since there isn't a lot of variation of the ISBN numbers. As such I use only the last four numbers of the ISBN since those numbers tend to be the ones that vary. I also plan to use the ASCII value of the title's chars for my hashValue, but I believe a problem arises since ASCII values can only amount to 127, which means there would be a problem if the title is short, say only 8 chars or less which would produce a maximum value 1016. If the table size is very large, say 10 007 it wouldn't produce a very even spread. Is there any way I could make ASCII values more suitable to produce a hash value of a large table
I'm trying to add a row with id number 3791318595.
long l = 3791318595L;
myTable.addRow(l,"testing");
however Jackcess transforms it into an integer and cuts to its max value:
2,147,483,647.
How can I correctly add the above number with Jackcess?
If the "id number" field in the Access table is defined as Number (Long Integer) then the short answer to your question is:
You can't.
A Long Integer in Access is a 32-bit signed integer whose maximum value is (2^31)-1 as you have seen. That is, a Long Integer in Access corresponds to int (not long) in Java. The value you are trying to insert into the Access field simply will not fit "as is", so there is no way that Jackcess (or any other application) can do it.
If your "id numbers" are all positive integers then one possible workaround would be to have your code mimic unsigned 32-bit integers by wrapping values between 2,147,483,648 and 4,294,967,296 to their non-positive signed values:
long unsignedAdjustment = 4294967296L; // 2^32
long l = 3791318595L; // our test value
if (l > unsignedAdjustment) {
System.out.println("Error: ID value too large to fit, even if wrapped to non-positive integer.");
}
else {
int signedID = 0;
if (l > 2147483647L) { // (2^31)-1
signedID = (int) (l - unsignedAdjustment);
}
else {
signedID = (int) l;
}
myTable.addRow(signedID, "testing");
}
That would store the row in the Access table with an "id number" of -503,648,701. Of course...
it would be up to your code to perform the corresponding conversion when retrieving rows, and
this approach has obvious implications for searching and ordering by "id number"
...but if the "id number" is really just a unique row identifier then it may not be too much of an inconvenience.
How can I convert a non-numeric String to an Integer?
I got for instance:
String unique = "FUBAR";
What's a good way to represent the String as an Integer with no collisions e.g. "FUBAR" should always be represented as the same number and shan't collide with any other String. For instance, String a = "A"; should be represented as the Integer 1 and so on, but what is a method that does this (preferrably for all unicode strings, but in my case ASCII values could be sufficient).
This is impossible. Think about it, an Integer can only be 32 bits. So, by the pigeonhole principle, there must exist at least two strings that have the same Integer value no matter what technique you use for conversion. In reality, there are infinite with the same values...
If you're just looking for an efficient mapping, then I suggest that you just use the int returned by hashCode(), which for reference is actually 31 bits.
You can map Strings to unique IDs using table. There is not way to do this generically.
final Map<String, Integer> map = new HashMap<>();
public int idFor(String s) {
Integer id = map.get(s);
if (id == null)
map.put(s, id = map.size());
return id;
}
Note: having unique id's doesn't guarantee no collisions in a hash collection.
http://vanillajava.blogspot.co.uk/2013/10/unique-hashcodes-is-not-enough-to-avoid.html
If you know the character set used in your strings, then you can think of the string as number with base other than 10. For example, hexadecimal numbers contain letters from A to F.
Therefore, if you know that your strings only contain letters from an 8-bit character set, you can treat the string as a 256-base number. In pseudo code this would be:
number n;
for each letter in string
n = 256 * n + (letter's position in character set)
If your character set contains 65535 characters, then just multiply 'n' with that number on each step. But beware, the 32 bits of an integer will be easily overflown. You probably need to use a type that can hold a larger number.
private BigDecimal createBigDecimalFromString(String data)
{
BigDecimal value = BigDecimal.ZERO;
try
{
byte[] tmp = data.getBytes("UTF-8");
int numBytes = tmp.length;
for(int i = numBytes - 1; i >= 0; i--)
{
BigDecimal exponent = new BigDecimal(256).pow(i);
value = value.add(exponent.multiply(new BigDecimal(tmp[i])));
}
}
catch (UnsupportedEncodingException e)
{
}
return value;
}
Maybe a little bit late, but I'm going to give my 10 cents to simplify it (internally is similar to BigDecimal suggested by #Romain Hippeau)
public static BigInteger getNumberId(final String value) {
return new BigInteger(value.getBytes(Charset.availableCharsets().get("UTF-8")));
}
Regardless of the accepted answer, it is possible to represent any String as an Integer by computing that String's Gödelnumber, which is a unique product of prime numbers for every possible String. With that being said it's quite impractical and slow to implement, also for most Strings you would need a BigInteger rather than a normal Integer and to decode a Gödelnumber into its corresponding String you need to have a defined Charset.
I am trying to generate a unique identifier of a fixed length such as the IDs that are generated by Megaupload for the uploaded files.
For example:
ALGYTAB5
BCLD23A6
In this example using from A-Z and 0-9 and with a fixed length of 8 the total different combinations are 2,821,109,907,456.
What if one of the generated id is already taken. Those ids are going to be stored in a database and it shouldn't be used more than once.
How can I achieve that in Java?
Thank you.
Hmm... You could imitate a smaller GUID the following way. Let first 4 bytes of your string be the encoded current time - seconds passed after Unix. And the last 4 just a random combination. In this case the only way two ID's would coincide is that they were built at the same second. And the chances of that would be very veeery low because of the other 4 random characters.
Pseudocode:
get current time (4 byte integer
id[0] = 1st byte of current time (encoded to be a digit or a letter)
id[1] = 2nd
id[2] = 3rd
id[3] = 4th
id[4] = random character
id[5] = random character
id[6] = random character
id[7] = random character
I have tried #Armen's solution however I would like to give another solution
UUID idOne = UUID.randomUUID();
UUID idTwo = UUID.randomUUID();
UUID idThree = UUID.randomUUID();
UUID idFour = UUID.randomUUID();
String time = idOne.toString().replace("-", "");
String time2 = idTwo.toString().replace("-", "");
String time3 = idThree.toString().replace("-", "");
String time4 = idFour.toString().replace("-", "");
StringBuffer data = new StringBuffer();
data.append(time);
data.append(time2);
data.append(time3);
data.append(time4);
SecureRandom random = new SecureRandom();
int beginIndex = random.nextInt(100); //Begin index + length of your string < data length
int endIndex = beginIndex + 10; //Length of string which you want
String yourID = data.substring(beginIndex, endIndex);
Hope this help!
We're using the database to check whether they already exist. If the number of IDs is low compared to the possible number you should be relatively safe.
You might also have a look at the UUID class (although it's 16-byte UUIDs).
Sounds like a job for a hash function. You're not 100% guaranteed that a hash function will return a unique identifier, but it works most of the time. Hash collisions must be dealt with separately, but there are many standard techniques for you to look into.
Specifically how you deal with collisions depends on what you're using this unique identifier for. If it's a simple one-way identifier where you give your program the ID and it returns the data, then you can simply use the next available ID in the case of a collision.
I want to devise an algorithm which takes a set of values and distributes it uniformly over a much larger range. eg. i have 1000 values and want to distribute them over a range of value 2^16.
Also, the input values can change continuously and i need to keep parsing each input value through the hash function so that it gets distributed uniformly over my output range.
What hashing algorithm should i use for this?
I am writing the code in Java.
If you're just hashing integers, here's one way.
public class Hasho {
private static final Long LARGE_PRIME = 948701839L;
private static final Long LARGE_PRIME2 = 6920451961L;
public static void main(String[] args) {
for (int i = 0; i < 100; i++) {
System.out.println(i + " -> " + hash(i));
}
}
public static int hash(int i) {
// Spread out values
long scaled = (long) i * LARGE_PRIME;
// Fill in the lower bits
long shifted = scaled + LARGE_PRIME2;
// Add to the lower 32 bits the upper bits which would be lost in
// the conversion to an int.
long filled = shifted + ((shifted & 0xFFFFFFFF00000000L) >> 32);
// Pare it down to 31 bits in this case. Replace 7 with F if you
// want negative numbers or leave off the `& mask` part entirely.
int masked = (int) (filled & 0x7FFFFFFF);
return masked;
}
}
This is merely an example to show how it can be done. There is some serious math in a professional quality hash function.
I'm sure this has a name, but this is what we used to do with ISAM files back in the dark ages
Increment a number eg 16001
Reverse the String ie. 10061 and you have your hash
You might want to reverse the string bitwise
This produces a nice even spread. we used to use it with job numbers so that you could retrieve the job fairly easily, so if you have a 'magic number' candidate this can be useful.