I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?
No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.
You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.
Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.
Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.
represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string
One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.
Related
From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.
For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.
I am looking for a mechanism in which I can represent a set of strings using unique numbers, so that when I want to sort them I can use the numbers to sort this values.
For eg, this is what I have in mind
I am keeping a fixed length number 20 digits
Each alphabet is represented with its ASCII/some alphabetical order value
cat - (03)(01)(20)(00)(00)(00)(00)(00)(00)(00) - 03012000000000000000
cataract - (03)(01)(20)(01)(18)(01)(03)(20)(00)(00) - 03012001180103200000
capital - (03)(01)(16)(09)(20)(01)(12)(00)(00)(00) - 03011609200112000000
So if I sort it based on the numbers, it will sort and say
capital, cat, cataract
Is this a good way of doing this?
Is there any other way for doing this so that I have more accuracy?
Thanks,
Sen
If your string length is fixed and your character set is fixed to say 100 different characters you could treat each character of your string as a number in a base 100 number to turn the string into a double.
If your set of strings is much smaller than the set of possible strings you could hash them and for collisions define the sort order arbitrarily but consistently.
In a specific case I probably wouldn't recommend either of those, but as a SUPER GENERAL solution to what you stated it works. But if you ask what seems to be a theoretical question, a theoretical answer seems appropriate.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Short version - I'm looking for a Java algorithm that given a String and an integer representing a number of buckets returns which bucket to place the String into.
Long version - I need to distribute a large number of objects into bins, evenly (or approximately evenly). The number of bins/buckets will vary, so the algorithm can't assume a particular number of bins. It may be 1, 30, or 200. The key for these objects will be a String.
The String has some predictable qualities that are important. The first 2 characters of the string actually appear to be a hex representation of a byte. i.e. 00-ff , and the strings themselves are quite evenly distributed within that range. There are a couple of outliers that start differently though, so this can't be relied on 100% (though easily 99.999%). This just means that edge cases do need to be handled.
It's critical that once all the strings have been distributed that there is zero overlap in range between what values appear in any 2 bins. So, that if I know what range of values appear in a bin, I don't have to look in any other bins to find the object. So for example, if I had 2 bins, it could be that bin 0 has Strings starting with letters a-m and bin 1 starting with n-z. However, that wouldn't satisfy the need for even distribution given what we know about the Strings.
Lastly, the implementation can have no knowledge of the current state of the bins. The method signature should literally be:
public int determineBucketIndex(String key, int numBuckets);
I believe that the foreknowledge about the distribution of the Strings should be sufficient.
EDIT: Clarifying for some questions
Number of buckets can exceed 256. The strings do contain additional characters after the first 2, so this can be leveraged.
The buckets should hold a range of Strings to enable fast lookup later. In fact, that's why they're being binned to begin with. With only the knowledge of ranges, I should be able to look in exactly 1 bucket to see if the value is there or not. I shouldn't have to look in others.
Hashcodes won't work. I need the buckets to contain only String within a certain range of the String value (not the hash). Hashing would lose that.
EDIT 2: Apparently not communicating well.
After bins have been chosen, these values are written out to files. 1 file per bin. The system that uses these files after binning is NOT Java. It's already implemented, and it needs values in the bins that fit within a range. I repeat, hashcode will not work. I explicitly said the ranges for strings cannot overlap between two bins, using hashcode cannot work.
I have read through your question twice and I still don't understand the constraints. Therefore, I am making a suggestion here and you can give feedback on it. If this won't work, please explain why.
First, do some math on the number of bins, to determine how many bits you need for a unique bin number. Take the logarithm to base 2 of the number of bins, then take the ceiling of number of bits divided by 8. This is the number of bytes of data you need, numBytes.
Take the first two letters and convert them to a byte. Then grab numBytes - 1 characters and convert them to bytes. Take the ordinal value of the character ('A' becomes 65, and so on). If the next characters could be Unicode, pick some rule to convert them to bytes... probably grab the least significant byte (modulus by 256). Get numBytes bytes total, including the byte made from the first two letters, and convert to an integer. Make the byte from the first two letters the least significant 8 bits of the integer, the next byte the next 8 significant bits, and so on. Now simply take the modulus of this value by the number of bins, and you have an integer bin number.
If the string is too short and there are no more characters to turn into byte values, use 0 for each missing character.
If there are any predictable characters (for example, the third character is always a space) then don't use those characters; skip past them.
Now, if this doesn't work for you, please explain why, and then maybe we will understand the question well enough to answer it.
answer edited after 2 updates to original post
It would have been an excellent idea to include all the information in your question from the start - with your new edits, your description already gives you the answer: stick your objects into a Balanced Tree (giving you the homogenous distribution you say you need) based on the hashCode for your string's substring(0,2) or something similarly head-based. Then write each leaf (being a set of strings) in the BTree to file.
I seriously doubt that the problem, as described, can be done perfectly. How about this:
Create 257 bins.
Put all normal Strings into bins 0-255.
Put all the outliers into bin 256.
Other than the "even distribution", doesn't this meet all your requirements?
At this point, if you really want more even distribution, you could reorganize bins 0-255 into a smaller number of more evenly distributed bins. But I think you may just have to lesses the requirements there.
I am storing certain entities in my database with integer Ids of size 32 bits thus using the range of -2.14 billion to +2.14 billion.
I have given tried giving some meaning to my ids due to which my Ids, in the positive range, have finished up a bit quickly. I am looking forward to use the negative integer range of -2.14 billion to 0.
Wanted to know, if you could see any downsides of using negative integers as ids, though personally I don't see any downsides.
There is an old saying in database design that goes like this: "Intelligent keys are not". You should never design for special meaning in an id when a descriptive attribute is more appropriate.
Given than dumb keys are only compared for equality, sign or lack thereof has no impact.
I'm curious why Object.toString() returns this:
return getClass().getName() + "#" + Integer.toHexString(hashCode());
as opposed to this:
return getClass().getName() + "#" + hashCode();
What benefits does displaying the hash code as a hex rather than a decimal buy you?
The Short Answer:
Hash Codes are usually displayed in hexadecimal because this way it is easier for us to retain them in our short-term memory, since hexadecimal numbers are shorter and have a larger character variety than the same numbers expressed in decimal.
Also, (as supercat states in a comment,) hexadecimal representation tends to prevent folks from trying to assign some meaning to the numbers, because they don't have any. (To use supercat's example, Fnord#194 is absolutely not the 194th Fnord; it is just Fnord with some unique number next to it.)
The Long Answer:
Decimal is convenient for two things:
Doing arithmetic
Estimating magnitude
However, these operations are inapplicable to hashcodes. You are certainly not going to be adding hashcodes together in your head, nor would you ever care how big a hashcode is compared to another hashcode.
What you are likely to be doing with hashcodes is the one and only thing that they were intended for: to tell whether two hash codes possibly refer to the same object, or definitely refer to different objects.
In other words, you will be using them as unique identifiers or mnemonics for objects. Thus, the fact that a hashcode is a number is in fact entirely irrelevant; you might as well think of it as a hash string.
Well, it just so happens that our brains find it a lot easier to retain in short-term memory (for the purpose of comparison) short strings consisting of 16 different characters, than longer strings consisting of only 10 different characters.
To further illustrate the analogy by taking it to absurdity, imagine if hash codes were represented in binary, where each number is far longer than in decimal, and has a much smaller character variety. If you saw the hash code 010001011011100010100100101011 now, and again 10 seconds later, would you stand the slightest chance of being able to tell that you are looking at the same hash code? (I can't, even if I am looking at the two numbers simultaneously. I have to compare them digit by digit.)
On the opposite end lies the tetrasexagesimal numbering system, which means base 64. Numbers in this system consist of:
the digits 0-9, plus:
the uppercase letters A-Z, plus:
the lowercase letters a-z, plus:
a couple of symbols like '+' and '/' to reach 64.
Tetrasexagesimal obviously has a much greater character variety than lower-base systems, and it should come as no surprise that numbers expressed in it are admirably terse. (I am not sure why the JVM is not using this system for hashcodes; perhaps some prude feared that chance might lead to certain inconvenient four-letter words being formed?)
So, on a hypothetical JVM with 32-bit object hash codes, the hash code of your "Foo" object could look like any of the following:
Binary: com.acme.Foo#11000001110101010110101100100011
Decimal: com.acme.Foo#3251989283
Hexadecimal: com.acme.Foo#C1D56B23
Tetrasexagesimal: com.acme.Foo#31rMiZ
Which one would you prefer?
I would definitely prefer the tetrasexagesimal, and in lack of that, I would settle for the hexadecimal one. Most people would agree.
One web site where you can play with conversions is here:
https://www.mobilefish.com/services/big_number/big_number.php
Object.hashCode used to be computed based on a memory location where the object is located. Memory locations are almost universally displayed as hexadecimal.
The default return value of toString isn’t so much interested in the hash code but rather in a way to uniquely identify the object for the purpose of debugging, and the hash code serve well for the purpose of identification (in fact, the combination of class name + memory address is truly unique; and while a hash code isn’t guaranteed to be unique, it often comes close).