Attempting to write my own hash function in Java. I'm aware that this is the same one that java implements but wanted to test it out myself. I'm getting collisions when I input different values and am not sure why.
public static int hashCodeForString(String s) {
int m = 1;
int myhash = 0;
for (int i = 0; i < s.length(); i++, m++){
myhash += s.charAt(i) * Math.pow(31,(s.length() - m));
}
return myhash;
}
Kindly remember just how a hash-table (in any language ...) actually works: it consists of a (usually, prime) number of "buckets." The purpose of the hash-function is simply to convert any incoming key-value into a bucket-number. (The worst-case scenario is always that 100% of the incoming keys wind-up in a single bucket, leaving you with "a linked list.") You simply strive to devise a hash-function that will "typically" produce a "widely scattered" distribution of values so that, when calculated modulo the (prime ...) number of buckets, "most of the time, most of the buckets" will be "more-or-less equally" filled. (But remember: you can never be sure.)
"Collisions" are entirely to be expected: in fact, "they happen all the time."
In my humble opinion, you're "over-thinking" the hash-function: I see no compelling reason to use Math.pow() at all. Expect that any value which you produce will be converted to a hash-bucket number by taking its absolute value modulo the number of buckets. The best way to see if you came up with a good one (for your data ...) is to observe the resulting distribution of bucket-sizes. (Is it "good enough" for your purposes yet?)
Related
Arrays below are sorted without duplicates (contain unique positive integers) of small size (less than 5000) and intersection (see below) is called billion of times so any micro-optimization does matter. This article nicely describes how to speed up the below code in C language.
int i = 0, j = 0, c = 0, la = a.length, lb = b.length;
intersection = new int[Math.min(la, lb)];
while (i < la && j < lb) {
if (a[i] < b[j]) i++;
else if (a[i] > b[j]) j++;
else {
intersection[c] = a[i];
i++; j++; c++;
}
}
int[] intersectionZip = new int[c];
System.arraycopy(intersection, 0, intersectionZip, 0, c);
In Java I guess it is impossible to call those low-level instructions. But they mention that "it is possible to improve this approach using branchless implementation". How one would do it? Using switch? Or maybe substitute a[i] < b[j], a[i] > b[j] or a[i] == b[i] comparisons with binary operations on integer operands?
Binary search approach (with complexity O(la log(lb))) is not the case because la is not << than lb. Interesting how to change the if statements.
I don't think there's much you could do to improve that performance of that Java code. However, I would note that it is not doing the same thing as the C version. The C version is putting the intersection into an array that was preallocated by the caller. The Java version allocates the array itself ... and then reallocates and copies to a smaller array when it is finished.
I guess, you could change the Java version to make two passes over the input arrays, with the first one working out how big the input array needs to be ... but whether it helps or hinders will depend on the inputs.
There might be other special cases you could optimize for; e.g. if there were likely to be long runs of numbers in one array with nothing in that range in the other array you might be able to "optimistically" try to skip multiple numbers in one go; i.e. increment i or j by a larger number than 1.
But they mention that "it is possible to improve this approach using branchless implementation". How one would do it? Using switch?
Not a Java switch ... or a conditional expression because they both involve branches when translated to the native code.
I think he is referring to something like this: Branchless code that maps zero, negative, and positive to 0, 1, 2
FWIW it is a bad idea to try to do this kind of thing in Java. The problem is that the performance of tricky code sequences like that is dependent on details of the hardware architecture, instruction set, clock counts, etc that vary from one platform to the next. The Java JIT compiler's optimizer can do a pretty good job of optimizing your code ... but if you include tricky sequences:
it is not at all obvious or predictable how they will be translated to native code, and
you may well find that the trickiness actually inhibits useful optimizations that the JIT compiler might otherwise be able to do.
Having said that, it is not impossible that some future release of Java might include a superoptimizer ... along the lines of the one mentioned on the linked Q&A above ... that would be able to generate branchless sequences automatically. But bear in mind that superoptimization is very expensive to perform.
Maybe using ? : operator:
(a[i] < b[j]) ? i++ : ((a[i] > b[j]) ? j++ : ....
This was questions asked in one of the interviews that I recently attended.
As far as I know a random number between two numbers can be generated as follows
public static int rand(int low, int high) {
return low + (int)(Math.random() * (high - low + 1));
}
But here I am using Math.random() to generate a random number between 0 and 1 and using that to help me generate between low and high. Is there any other way I can directly do without using external functions?
Typical pseudo-random number generators calculate new numbers based on previous ones, so in theory they are completely deterministic. The only randomness is guaranteed by providing a good seed (initialization of the random number generation algorithm). As long as the random numbers aren't very security critical (this would require "real" random numbers), such a recursive random number generator often satisfies the needs.
The recursive generation can be expressed without any "external" functions, once a seed was provided. There are a couple of algorithms solving this problem. A good example is the Linear Congruential Generator.
A pseudo-code implementation might look like the following:
long a = 25214903917; // These Values for a and c are the actual values found
long c = 11; // in the implementation of java.util.Random(), see link
long previous = 0;
void rseed(long seed) {
previous = seed;
}
long rand() {
long r = a * previous + c;
// Note: typically, one chooses only a couple of bits of this value, see link
previous = r;
return r;
}
You still need to seed this generator with some initial value. This can be done by doing one of the following:
Using something like the current time (good in most non-security-critical cases like games)
Using hardware noise (good for security-critical randomness)
Using a constant number (good for debugging, since you get always the same sequence)
If you can't use any function and don't want to use a constant seed, and if you are using a language which allows this, you could also use some uninitialized memory. In C and C++ for example, define a new variable, don't assign something to it and use its value to seed the generator. But note that this is far from being a "good seed" and only a hack to fulfill your requirements. Never use this in real code.
Note that there is no algorithm which can generate different values for different runs with the same inputs without access to some external sources like the system environment. Every well-seeded random number generator makes use of some external sources.
Here I am suggesting some sources with comment may be you find helpful:
System Time : Monotonic in a day poor random. Fast, Easy.
Mouse Point : Random But not useful on standalone system.
Raw Socket/ Local Network
(Packet 's info-part ) : Good Random Technical and time consuming - Possible to model a attack mode to reduce randomness.
Some input text with permutation : Fast, Common way and good too (in my opinion).
Timing of the Interrupt due to keyboard, disk-drive and other events: Common way – error prone if not used carefully.
Another approach is to feed an analog noise signal : example like temp.
/proc file data: On Linux system. I feel you should use this.
/proc/sys/kernel/random:
This directory contains various parameters controlling the operation of the file /dev/random.
The character special files /dev/random and /dev/urandom (present since Linux
1.3.30) provide an interface to the kernel's random number generator.
try this commads:
$cat /dev/urandom
and
$cat /dev/random
You can write a file read function that read from this file.
Read (also suggests): Is a rand from /dev/urandom secure for a login key?
`
Does System.currentTimeMillis() count as external? You could always get this and calculate mod by some max value:
int rand = (int)(System.currentTimeMillis()%high)+low;
You can get near randomness (actually chaotic and definitely not uniform*) from the logistic map x = 4x(1-x) starting with a "non-rational" x between 0 and 1.
The "randomness" appears because of the rounding errors at the edge of the accuracy of the floating point representation.
(*)You can undo the skewing once you know it is there.
You may use the address of a variable or combine the address of more variables to make a more complex one...
You could get the current system time, but that would also require a function in most languages.
You can do it without external functions if you are allowed to use some external state (e.g. a long initialised with the current system time). This is enough for you to implement a simple psuedo-random number generator.
In each call to your random function, you would use the state to create a new random value, and update the state, so that subsequent calls get different results.
You can do this with just regular Java arithmetic and/or bitwise operations, so no external functions are required.
public class randomNumberGenerator {
int generateRandomNumber(int min, int max) {
return (int) ((System.currentTimeMillis() % max) + min);
}
public static void main(String[] args) {
randomNumberGenerator rn = new randomNumberGenerator();
int cv = 0;
int min = 1, max = 4;
Map<Integer, Integer> hmap = new HashMap<Integer, Integer>();
int count = min;
while (count <= max) {
cv = rn.generateRandomNumber(min, max);
if ((hmap.get(cv) == null) && cv >= min && cv <= max) {
System.out.print(cv + ",");
hmap.put(cv, 1);
count++;
}
}
}
}
Poisson Random Generator
Lets say we start with an expected value 'v' of the random numbers. Then to say that a sequence of non negative integers satisfies a Poisson Distribution with expected value v means that over subsequences, the mean(average) of the value will appear 'v'.
Poisson Distribution is part of statistics and the details can be found on wikipedia.
But here the main advantage of using this function are:
1. Only integer values are generated.
2. The mean of those integers will be equal to the value we initially provided.
It is helpful in applications where fractional values don't make sense. Like number of planes arriving on an airport in 1min is 2.5(doesn't make sense) but it implies that in 2 mins 5 plans arrive.
int poissonRandom(double expectedValue) {
int n = 0; //counter of iteration
double limit;
double x; //pseudo random number
limit = exp(-expectedValue);
x = rand() / INT_MAX;
while (x > limit) {
n++;
x *= rand() / INT_MAX;
}
return n;
}
The line
rand() / INT_MAX
should generate a random number between 0 and 1. So we can use time of the system.
Seconds / 60 will serve the purpose.
Which function we should use is totally application dependent.
In WikiPedia article for Binary Search there is a section called Deferred detection of equality which presents a somewhat "optimized" version of binary search as follows:
int binary_search(int A[], int key, int imin, int imax)
{
while (imax > imin)
{
int imid = (imin + imax) / 2;
if (A[imid] < key)
imin = imid + 1;
else
imax = imid;
}
if((imax == imin) && (A[imin] == key))
return imin;
else
return KEY_NOT_FOUND;
}
It is claimed that this is a better version than the conventional textbook binary search since the .... algorithm uses only one conditional branch per iteration
Is this true? I mean the if instructions are translated in CMP and Branch instructions in assembly so I can not think how an if-else is better than an if-else if-else
Is there such a difference that I should take into account in higher level languages? The code of the "deffered" version seems more tight I admin, but are there optimizations or penalties in how you form if-else statements?
The key concept is that it uses one less conditional per iteration. That is, the equality check has been moved outside the while loop so that it only runs once, while in the basic version it would need to be checked every time¹.
That said, I 'm not sure if there would actually be a measurable difference when using the optimized form. For example, consider that:
If all you are comparing is two integers then the compiler can detect that it can compute the comparison result just once and then evaluate which branch to take just as well.
Binary search is O(logN), so the number of iterations taken would actually be very small even if the number of elements to search is quite large. It's arguable whether you 'd see any difference.
The implementation of modern CPUs features such as speculative execution and branch prediction (especially in "nice" algorithms like binary search) might very well have more visible effects than this optimization (out of my league to check though).
Notes:
¹ Actually it is another condition that doesn't need to be checked when the equality comparison moves out, but conceptually there is no difference.
I want to implement a String comparison function that doesn't take a different amount of time depending on the number of characters that match or the position of the first mismatch. I assume there must be a library out there somewhere that provides this, but I was unable to find it via a quick search.
So far, the best idea I've got is to sum the XOR of each character and return whether or not the sum is 0. However, I'm pretty sure this wouldn't work so well with Unicode. I also have a vague concern that HotSpot would do some optimizations that would change my constant-time property, but I can't think of a specific optimization that would do this off the top of my head.
Thanks.
UPDATE: Sorry, I don't believe I was clear. I'm not looking for O(1), I'm looking for something that won't leak timing information. This would be used to compare hashed password values, and if the time it took to compare was different based on where the first mismatch occurred, that would be leaking information to an attacker.
I want to implement a String comparison function that doesn't take a
different amount of time depending on the number of characters that
match or the position of the first mismatch. I assume there must be a
library out there somewhere that provides this, but I was unable to
find it via a quick search.
So far, the best idea I've got is to sum the XOR of each character
Do you see the contradiction?
update:
To the updated, and therefore different question:
Can you gain information once, how much time is spent for comparing 2 Strings, in terms of constant amount and time, depending on the length of the two strings?
a + b*s(1).length + c*s(2).length + d*f(s(1), s(2))?
Is there an upper bound of characters for String 1 and 2?
If the time is, depending on a factor for the machine, for example for the longest strings you expect 0.01ms. You measure the time to encode the string, and stay idle until you reach that time, maybe + a factor of rand(10%) of the time.
If the length of the input is not limited, you could calculate the timing in a way, that will fit for 99%, 99.9% or 99.99% of typical input, depending on your security needs, and the speed of the machine. If the program is interacting with the user, a delay up to 0.2s is normally experienced as instant reaction, so it wouldn't annoy the user, if your code sleeps for 0.19994s, while doing real calculations for 0.00006s.
I see two immediate possibilities for not leaking password-related information in timing:
1/ Pad both the password string and candidate string out to 1K, with a known, fixed character (like A). Then run the following (pseudo-code):
match = true
for i = 0 to 1023:
if password[i] != candidate[i]:
match = false
That way, you're always taking the same amount of loops to do the comparison regardless of where it matches.
There's no need to muck about with xor since you can still do a simple comparison, but without exiting the loop early.
Just set the match flag to false if a mismatch is found and keep going. Once the loop exits (taking the same time regardless of size or content of password and candidate), then check whether it matched.
2/ Just add a large (relative to the normal comparison time) but slightly random delay at the end of the comparison. For example, a random value between 0.9 and 1.1 seconds. The time taken for the comparison should be swamped by the delay and the randomness should fully mask any information leakage (unless your randomness algorithm leaks information, of course).
That also has the added advantage of preventing brute force attacks since a password check takes at least about a second.
This should take approximately the same time for any matching length Strings. It's constant-time with a big constant.
public static boolean areEqualConstantTime(String a, String b) {
if ( a.length != b.length ) {
return false;
}
boolean equal = true;
for ( long i = 0; i < (Long)Integer.MAX_INT; i++ ) {
if ( a.charAt((int)(i % aChars.length)) != b.charAt((int)(i % bChars.length))) {
equal = false;
}
}
return equal;
}
Edit
Wow, if you're just trying to avoid leaking timing information this facetious answer got pretty close to the mark! We can start with a naive approach like this:
public static boolean arePasswordsEqual(String a, String b) {
boolean equal = true;
if ( a.length != b.length ) {
equal = false;
}
for ( int i = 0; i < MAX_PASSWORD_LENGTH; i++ ) {
if ( a.charAt(i%a.length()) != b.charAt(i%b.length()) ) {
equal = false;
}
}
return equal;
}
We need the MAX_PASSWORD_LENGTH constant because we can't simply use either the max or the min of the two input lengths as that would also leak timing information. An attacker could start with a very small guess and see how long the function takes. When the function time plateaus, he would know his password has the right length which eliminates much of the range of values he needs to try.
The following code is one common way to do a constant-time byte[] comparison in Java and avoid leaking password info via the time taken:
public static boolean isEqual(byte[] a, byte[] b) {
if (a.length != b.length) {
return false;
}
int result = 0;
for (int i = 0; i < a.length; i++) {
result |= a[i] ^ b[i];
}
return result == 0;
}
See http://codahale.com/a-lesson-in-timing-attacks/ for more discussion of this issue.
Current implementation in openjdk as an inspiration is available in isEqual method.
(This assumes that the length of the secret is not sensitive, for example if it is a hash. You should pad both sides to the same length if that is not true.)
This is essentially the same as your first suggestion:
So far, the best idea I've got is to sum the XOR of each character and return whether or not the sum is 0.
You asked:
However, I'm pretty sure this wouldn't work so well with Unicode.
This is a valid concern, but you need to clarify what you will accept as "equal" for a solution to be proposed. Luckily, you also say "This would be used to compare hashed password values", so I don't think that any of the unicode concerns will be in play.
I also have a vague concern that HotSpot would do some optimizations that would change my constant-time property,
Hopefully that's not true. I expect that the literature on how to avoid timing attacks in Java would address this if it were true, but I can't offer you any citations to back this up :-)
You can do constant time string comparisons if you intern your strings. Then you can compare them with == operator resulting in a constant time equality check.
Read this for further detials
The usual solution is to set a timer, and not return the result until the timer has expired.
The time taken to compare strings or hashes is not important, just set the time-out value sufficiently large.
I have a string which is lost forever. The only thing I have about it is some magic hash number. Now I have a new string, which could be similar or equal to the lost one. I need to find out how close it is.
Integer savedHash = 352736;
String newText = "this is new string";
if (Math.abs(hash(newText) - savedHash) < 100) {
// wow, they are very close!
}
Are there any algorithms for this purpose?
ps. The length of the text is not fixed.
pps. I know how usual hash codes work. I'm interested in an algorithm that will work differently, giving me the functionality explained above.
ppps. In a very simple scenario this hash() method would look like:
public int hash(String txt) {
return txt.length();
}
Standard hashing will not work in this case since close hash values do not imply close strings. In fact, most hash functions are designed to give close strings very different values, so as to create a random distribution of hash values for any given set of input strings.
If you had access to both strings, then you could use some kind of string distance function, such as Levenshtein distance. This calculates the edit distance between two strings, or the number of edits required to transform one to the other.
In this case however, the best approach might be to use some kind of fuzzy hashing technique. That way you don't have to store the original string, and can still get some measure of similarity.
If the hashes don't match then the strings are different.
If the hashes match then the strings are probably the same.
There is nothing else you can infer from the hash value.
No, this isn't going to work. The similarity of a hash bears no relation to the similarity of the original strings. In fact, it is entirely possible for 2 different strings to have the same hash. All you can say for sure is that if the hashes are different the strings were different.
[Edited in light of comment, possibility of collision is of course very real]
Edit for clarification:
If you only have the hash of the old string then there is no way you are going to find the original value of that string. There is no algorithm that would tell you if the hashes of 2 different strings represented strings that were close, and even if there was it wouldn't help. Even if you find a string that has an exact hash match with your old string there is still no way you would know if it was your original string, as any number of strings can produce the same hash value. In fact, there is a vast* number of strings that can produce the same hash.
[In theory this vast number is actually infinite but on any real storage system you can't generate an infinte number of strings. In any case your chance of matching an unknown string via this approach is very slim unless your hashes are large in relation to the input string, and even then you will need to brute force your way through every possible string]
As others have pointed out, with a typical hash algorithm, it just doesn't work like that at all.
There are, however, a few people who've worked out algorithms that are at least somewhat similar to that. For one example, there's a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow things like that. They'll let you compare for degree of similarity, or (for example) let you combine hashes so hash(a) + hash(b) == hash(a+b) (for some definition of +, not just simple addition of the numbers). Like with most hashes, there's always a possibility of collision, so you have some chance of a false positive (but by picking the hash size, you can set that chance to an arbitrarily small value).
As such, if you're dealing with existing data, you're probably out of luck. If you're creating something new, and want capabilities on this order, it is possible -- though trying to do it on your own is seriously non-trivial.
No. Hashes are designed so that minor variations in the input string cause huge differences in the resulting hashe. This is very useful for dictionary implementations, as well as verifying the integrity of a file (a single changed bit will cause a completely different hash). So no, it's not some kind of thing you can ever use as an inequality comparison.
If the hashCodes are different it cannot be the same String, however many Strings can have the same hashCode().
Depending on the nature of the Strings, doing a plain comparision could be more efficent than comparing the hashCode() it has to inspect and perform a calculation on every character, whereas comparision can store early e.g. if the length is different or as soon as it see a different character.
Any good hashing algorithm will by definition NEVER yield similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value of "aaaa" looks similar to "aaab", then that is a poor hash. I have racked ones like that before without too much difficulty (fun puzzle to solve!) But you never know maybe your hash algorithm is poor. An idea what it is?
If you have time, you can just brute force this solution by hashing every possible word. Not elegant, but possible. Easier if you know the length of the original word as well.
If it is a standard has algorithm, like MD5, you can find websites that already have large mappings of source and hash, and get the answer that way. Try http://hashcrack.com/
I used this website successfully after one of our devs left and I needed to recover a password.
Cheers,
Daniel
You can treat the string as a really big number, but that's about the extent of your abilities in the general situation. If you have a specific problem domain, you may be able to compress a representation of the string to something smaller without losses, but still it will not be very useful.
For example, if you are working with individual words, you can use soundex to compare how similar two words will sound...
The best you can do with traditional hash codes will be to compare two strings for equality vs likely inequality. False positives are possible, but there will be no false negatives. You cannot compare for similarity this way, though.
a normal hash code changes a lot when the object changes a bit. that's made to distinguish different objects and don't care how resembling they could be. therefore the answer is no
Well, seems you want not real hash of string, but some fingerprint of string. Because you want it to be of 32-bits one way could be:
Calculate Pearson correlation coefficient between first and second half of string (if string length is odd number of chars, then add some padding) and store this number as 32-bit floating point number. But I'm not sure how reliable this method will be.
==EDIT==
Here is C example code (un-optimized) which implements this idea (a little bit modified):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
float mean(char *str) {
char *x;
float sum = 0.0;
for(x=str; *x!='\0'; x++) {
sum += (float) *x;
}
return sum/strlen(str);
}
float stddev(char *str) {
char *x;
float sum = 0.0;
float u = mean(str);
for(x=str; *x!='\0'; x++) {
sum += ((float)*x - u)*((float)*x - u);
}
return sqrt(sum/strlen(str));
}
float covariance(char *str1, char *str2) {
int i;
int im = fmin(strlen(str1),strlen(str2));
float sum = 0.0;
float u1 = mean(str1);
float u2 = mean(str2);
for(i=0; i<im; i++) {
sum += ((float)str1[i] - u1)*((float)str2[i] - u2);
}
return sum/im;
}
float correlation(char *str1, char *str2) {
float cov = covariance(str1,str2);
float dev1 = stddev(str1);
float dev2 = stddev(str2);
return cov/(dev1*dev2);
}
float string_fingerprint(char *str) {
int len = strlen(str);
char *rot = (char*) malloc((len+1)*sizeof(char));
int i;
// rotate string by CHAR_COUNT/2
for(i=0; i<len; i++){
rot[i] = str[(i+len/2)%len];
}
rot[len] = '\0';
// now calculate correlation between original and rotated strings
float corr = correlation(str,rot);
free(rot);
return corr;
}
int main() {
char string1[] = "The quick brown fox jumps over the lazy dog";
char string2[] = "The slow brown fox jumps over the crazy dog";
float f1 = string_fingerprint(string1);
float f2 = string_fingerprint(string2);
if (fabs(f1 - f2) < 0.2) {
printf("wow, they are very close!\n");
}
return 0;
}
hth!