Mapping hash values to a range, with minimal collisions

Mapping hash values to a range, with minimal collisions - java

Context
Hi, I'm working on an assignment for school that asks us to implement a hash table in Java. There are no requirements that collisions be kept to a minimum, but low collision rate and speed seem to be the two most sought-after qualities in all the reading (some more) that I've done.
Problem
I'd like some guidance on how to map the output of a hash function to a smaller range, without having >20% of my keys collide (yikes).
In all of the algorithms that I've explored, keys are mapped to the entire range of an unsigned 32 bit integer (or in many cases, 64, even 128 bit). I'm not finding much about this on here, Wikipedia, or in any of the hash-related articles / discussions I've come across.
In terms of the specifics of my implementation, I'm working in Java (mandate of my school), which is problematic since there are no unsigned types to work with. To get around this, I've been using the 64-bit long integer type, then using a bit mask to map back down to 32 bits. Instead of simply truncating, I XOR the top 32 bits with the bottom 32, then perform a bitwise AND to mask out any upper bits that might result in a negative value when I cast it down to a 32 bit integer. After all that, a separate function compresses the resulting hash value down to fit into the bounds of the hash table's inner array.
It ends up looking like:
int hash( String key ) {
long h;
for( int i = 0; i < key.length(); i++ )
//do some stuff with each character in the key
h = h ^ ( h << 32 );
return h & 2147483647;
}
Where the inner-loop depends on the hash function (I've implemented a few: polynomial hashing, FNV1, SuperFastHash, and a custom one tailored to the input data).
They basically all perform horribly. I have yet to see <20% keys collide. Even before I compress the hash values down to array indices, none of my hash functions will get me less thank 10k collisions. My inputs are two text files, each ~220,000 lines. One is English words, the other is random strings of varying length.
My lecture notes recommend the following, for compressing the hashed keys:
(hashed key) % P
Where P is the largest prime < the size of the inner array.
Is this an accepted method of compressing hash values? I have a feeling it isn't, but since performance is so poor even before compression, I have a feeling it's not the primary culprit, either.

I don´t know if I understand well your concrete problem, but I will try to help in hash performance and collisions.
The hash based objects will determine in which bucket they will store the key-value pair based on hash value. Inside each bucket there is a structure (In HashMap case a LinkedList) in where the pair is stored.
If the hash value is usually the same, the bucket will be usually the same so the performance will degrade a lot, let´s see an example:
Consider this class
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return 0;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that hashCode in class MyKey returns always '0' as hash. It is ok with the hashcode definition (http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()). If we run that program, this is the result
100000
tiempo: 62866 mls
Is a very poor performance, now we are going to change the MyKey hashcode code:
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return str.hashCode() * 31;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that only hashcode in MyKey has changed, now when we run the code te result is
100000
tiempo: 47 mls
There is an incredible better performance now with a minor change. Is a very common practice return the hashcode multiplied by a prime number (in this case 31), using the same hashcode members that you use inside equals method in order to determine if two objects are the same (in this case only str).
I hope that this little example can you point out a solution for your problem.

Related

Direct Recursion vs While Loop for time complexity performance

I was wondering how time complexity compares between these two methods. I have written the first findEmpty function and a friend wrote the 2nd. Both more or less achieve the same thing, however, I'm unsure which exactly computes faster (if at all) and why?
these examples come from an implementation of a hashtable class we've been working on. This function finds the next empty location in the array after the given parameters and returns it. Data is stored in the array "arr" as a Pair object containing a key and a value.
I believe this would run at O(1):
private int findEmpty(int startPos, int stepNum, String key) {
if (arr[startPos] == null || ((Pair) arr[startPos]).key.equals(key)) {
return startPos;
} else {
return findEmpty(getNextLocation(startPos), stepNum++, key);
}
}
I believe this would run at O(n):
private int findEmpty(int startPos, int stepNum, String key) {
while (arr[startPos] != null) {
if (((Pair) arr[startPos]).key.equals(key)) {
return startPos;
}
startPos = getNextLocation(startPos);
}
return startPos;
}
here is the code for the Pair object and getNextLocation:
private class Pair {
private String key;
private V value;
public Pair(String key, V value) {
this.key = key;
this.value = value;
}
}
private int getNextLocation(int startPos) {
int step = startPos;
step++;
return step % arr.length;
}
I expect my understanding is off and probably haven't approached this question as concisely as possible, but I appreciate and welcome any corrections.

Your solution has the same time complexity as your friend's. Both are linear to the length of your array. recursion did not reduce your time complexity to O(1), as it keeps calling getNextLocation until it finds the key.
And also in your function, getNextLocation
private int getNextLocation(int startPos, int stepNum) {
int step = startPos;
step++;
return step % arr.length;
}
the second parameter stepNum is never used in this function, and it should be cleared from all your functions to make it easier to read and understand. please write concise and clean code from the beginning.

Severe Java performance drop after changing ID strings

Summary
We have recently changed our String-based ID schema in a complex retrieval engine and observed a severe performance drop. In essence, we changed the IDs from XXX-00000001 to X384840564 (see below for details on the ID schema) and suffer from almost doubled runtimes. Choosing a different string hash function solved the problem, but we still lack a good explanation. Thus, our questions are:
Why do we see such a strong performance drop when changing from
the old to the new ID schema?
Why does our solution of using the “parent hash” actually work?
To approach the problem, we hereafter provide (a) detailed information about the ID schemata and hash functions used, (b) a minimal working example in Java that highlights the performance defect, and (c) our performance results and observations.
(Despite the lengthy description, we have already massively reduced the code example to 4 performance critical lines – see phase 2 in the listing.)
(a) Old and new ID schema; hash functions
Our ID objects consist of a parent ID object (string of 16 characters in [A-Z0-9]) and a child ID string. The same parent ID string is on average used by 1–10 child IDs. The old child IDs had a three-letter prefix, a dash, and a zero-padded running index number of length 8, for example, XXX-00000001 (12 characters in total; X may be any letter [A-Z]). The new child IDs have one letter and 9 non-consecutive digits, for example, X384840564 (10 characters in total, X may be any letter [A-Z]). An obvious difference is that the old child ID strings are often recurring (i.e., the string ABC-00000002 occurs with multiple different parent IDs, as the running index typically starts with 1), while the new child IDs with their arbitrary digit combinations typically occur only a few times or even only with a single parent ID.
Since the ID objects are put into HashSets and HashMaps, the choice of a hash function seems crucial. Currently, the system uses the standard string hash for the parent IDs. For the child IDs, we used to XOR the string hashes of parent and child ID – called XOR hash henceforth. In theory, this should distribute different child IDs quite well. As a variant, we experimented with using only the string hash of the parent ID as the hash code of the child ID – called parent hash henceforth. That is, all child IDs sharing the same parent ID share the same hash. In theory, the parent hash could be suboptimal, as all children sharing the same parent ID end up in the same bucket, while the XOR hash should yield a better data distribution.
(b) Minimal working example
Please refer to the following listing (explanation below):
package mwe;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;
public class Main {
private static final Random RANDOM = new Random(42);
private static final String DIGITS = "0123456789";
private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + DIGITS;
private static final int NUM_IDS = 5_000_000;
private static final int MAX_CHILDREN = 5;
private static final int REPETITIONS = 5;
private static final boolean RUN_ID_OLD = true; // e.g., 8IBKMAO2T1ORICNZ__XXX-00000002
private static final boolean RUN_ID_NEW = false; // e.g., 6TEG9R5JP1KHJN55__X580104176
private static final boolean USE_PARENT_HASH = false;
private static final boolean SHUFFLE_SET = false;
private abstract static class BaseID {
protected int hash;
public abstract BaseID getParentID();
#Override
public int hashCode() {
return this.hash;
}
}
private static class ParentID extends BaseID {
private final String id;
public ParentID(final String id) {
this.id = id;
this.hash = id.hashCode();
}
#Override
public BaseID getParentID() {
return null;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (obj instanceof ParentID) {
final ParentID o = (ParentID) obj;
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.id;
}
}
private static class ChildID extends BaseID {
private final String id;
private final BaseID parent;
public ChildID(final String id, final BaseID parent) {
this.id = id;
this.parent = parent;
// Initialize the hash code of the child ID:
if (USE_PARENT_HASH) {
// Only use the parent hash (i.e., all children have the same hash).
this.hash = parent.hashCode();
} else {
// XOR parent and child hash.
this.hash = parent.hashCode() ^ id.hashCode();
}
}
#Override
public BaseID getParentID() {
return this.parent;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (this.hash != obj.hashCode()) {
return false;
}
if (obj instanceof ChildID) {
final ChildID o = (ChildID) obj;
final BaseID oParent = o.getParentID();
if (this.parent == null && oParent != null) {
return false;
}
if (this.parent != null && oParent == null) {
return false;
}
if (this.parent == null || !this.parent.equals(oParent)) {
return false;
}
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.parent.toString() + "__" + this.id;
}
}
public static void run(final int repetitions, final boolean useVariant2IDs) throws IOException {
for (int i = 0; i < repetitions; i++) {
System.gc(); // Force memory reset for the next repetition.
// -- PHASE 1: CREATE DATA --------------------------------------------------------------------------------
// Fill a set of several millions random IDs. Each ID is a child ID with a reference to its parent ID.
// Each parent ID has between 1 and MAX_CHILDREN children.
Set<BaseID> ids = new HashSet<>(NUM_IDS);
for (int parentIDIdx = 0; parentIDIdx < NUM_IDS; parentIDIdx++) {
// Generate parent ID: 16 random characters.
final StringBuilder parentID = new StringBuilder();
for (int k = 0; k < 16; k++) {
parentID.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
}
// Generate between 1 and MAX_CHILDREN child IDs.
final int childIDCount = RANDOM.nextInt(MAX_CHILDREN) + 1;
for (int childIDIdx = 0; childIDIdx < childIDCount; childIDIdx++) {
final StringBuilder childID = new StringBuilder();
if (useVariant2IDs) {
// Variant 2: Child ID = letter X plus 9 random digits.
childID.append("X");
for (int k = 0; k < 9; k++) {
childID.append(DIGITS.charAt(RANDOM.nextInt(DIGITS.length())));
}
} else {
// Variant 1: Child ID = XXX- plus zero-padded index of length 8.
childID.append("XXX-").append(String.format("%08d", childIDIdx + 1));
}
final BaseID id = new ChildID(childID.toString(), new ParentID(parentID.toString()));
ids.add(id);
}
}
System.out.print(ids.iterator().next().toString());
System.out.flush();
if (SHUFFLE_SET) {
final List<BaseID> list = new ArrayList<>(ids);
Collections.shuffle(list);
ids = new LinkedHashSet<>(list);
}
System.gc(); // Clean up temp data before starting the timer.
// -- PHASE 2: INDEX DATA ---------------------------------------------------------------------------------
// Iterate over the ID set and fill a map indexed by parent IDs. The map values are irrelevant here, so
// use empty objects.
final long timer = System.currentTimeMillis();
final HashMap<BaseID, Object> map = new HashMap<>();
for (final BaseID id : ids) {
map.put(id.getParentID(), new Object());
}
System.out.println("\t" + (System.currentTimeMillis() - timer));
// Ensure that map and IDs are not GC:ed before the timer stops.
if (map.get(new ParentID("_do_not_gc")) == null) {
map.put(new ParentID("_do_not_gc"), new Object());
}
ids.add(new ParentID("_do_not_gc"));
}
}
public static void main(final String[] args) throws IOException {
if (RUN_ID_OLD) {
run(REPETITIONS, false);
}
if (RUN_ID_NEW) {
run(REPETITIONS, true);
}
}
}
In essence, the program first generates a HashSet of IDs and then indexes these IDs by their parent ID in a HashMap. In detail:
The first phase (PHASE 1) generates 5 million parent IDs, each with 1 to 10 child IDs using either the old (e.g., XXX-00000001) or the new ID schema (e.g., X384840564) and one of the two hash functions. The generated child IDs are collected in a HashSet. We explicitly create new parent ID objects for each child ID to match the functionality of the original system. For experimentation, the IDs can optionally be shuffled in a LinkedHashSet to distort the hash-based ordering (cf. boolean SHUFFLE_SET).
The second phase (PHASE 2) simulates the performance-critical path. It reads all IDs (child IDs with their parents) from the HashSet and puts them into a HashMap with the parent IDs as keys (i.e., aggregate IDs by parent).
Note: The actual retrieval system has a more complex logic, such as reading IDs from multiple sets and merging child IDs as the map entry’s values, but it turned out that none of these steps was responsible for the strong performance gap in question.
The remaining lines try to control for the GC, such that the data structures are not GC:ed too early. We’ve tried different alternatives for controlling the GC, but the results seemed pretty stable overall.
When running the program, the constants RUN_ID_OLD and RUN_ID_NEW activate the old and the new ID schema, respectively (best activate only one at a time). USE_PARENT_HASH switches between the XOR hash (false) and the parent hash (true). SHUFFLE_SET distorts the item order in the ID set. All other constants can remain as they are.
(c) Results
All results here are based on a typical Windows desktop with OpenJDK 11. We also tested Oracle JDK 8 and a Linux machine, but observed similar effects in all cases. For the following figure, we tested each configuration in independent runs, whereas each run repeats the timing 5 times. To avoid outliers, we report the median of the repetitions. Note, however, that the timings of the repetitions do not differ much. The performance is measured in milliseconds.
Observations:
Using XOR hash yields a substantial performance drop in the
HashSet setting when switching to the new ID schema. This hash
function seems suboptimal, but we lack a good explanation.
Using the parent hash function speeds up the process regardless of the ID
schema. We speculate that the internal HashSet order is beneficial,
since the resulting HashMap will build up the same order (because
ID.hash = ID.parent.hash). Interestingly, this effect can also be
observed if the HashSet is split into, say, 50 parts, each holding a
random partition of the full HashSet. This leaves us puzzled.
The entire process seems to be heavily dependent of the reading
order in the for loop of the second phase (i.e., the internal order of the
HashSet). If we distort the order in the shuffled LinkHashSet, we
end up in a worst-case scenario, regardless of the ID schema.
In a separate experiment, we also diagnosed the number of
collisions when filling the HashMap, but could not find obvious
differences when changing the ID schema.
Who can shed more light on explaining these results?
Update
The image below shows some profiling results (using VisualVM) for the non-shuffled runs. Indent indicates nested calls. All percentage values are relative to the phase 2 timing (100%).
An obious difference seems to be HashMap.putVal's self time. There was no obvious difference for treeifying buckets.

Why is my HashMap implementation 10 times slower than the JDK's?

I would like to know what makes the difference, what should i aware of when im writing code.
Used the same parameters and methods put(), get() when testing
without printing
Used System.NanoTime() to test runtime
I tried it with 1-10 int keys with 10 values, so every single hash returns unique index, which is the most optimal scenario
My HashSet implementation which is based on this is almost as fast as the JDK's
Here's my simple implementation:
public MyHashMap(int s) {
this.TABLE_SIZE=s;
table = new HashEntry[s];
}
class HashEntry {
int key;
String value;
public HashEntry(int k, String v) {
this.key=k;
this.value=v;
}
public int getKey() {
return key;
}
}
int TABLE_SIZE;
HashEntry[] table;
public void put(int key, String value) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].getKey() != key)
hash = (hash +1) % TABLE_SIZE;
table[hash] = new HashEntry(key, value);
}
public String get(int key) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].key != key)
hash = (hash+1) % TABLE_SIZE;
if(table[hash] == null)
return null;
else
return table[hash].value;
}
Here's the benchmark:
public static void main(String[] args) {
long start = System.nanoTime();
MyHashMap map = new MyHashMap(11);
map.put(1,"A");
map.put(2,"B");
map.put(3,"C");
map.put(4,"D");
map.put(5,"E");
map.put(6,"F");
map.put(7,"G");
map.put(8,"H");
map.put(9,"I");
map.put(10,"J");
map.get(1);
map.get(2);
map.get(3);
map.get(4);
map.get(5);
map.get(6);
map.get(7);
map.get(8);
map.get(9);
map.get(10);
long end = System.nanoTime();
System.out.println(end-start+" ns");
}

If you read the documentation of the HashMap class, you see that it implements a hash table implementation based on the hashCode of the keys. This is dramatically more efficient than a brute-force search if the map contains a non-trivial number of entries, assuming reasonable key distribution amongst the "buckets" that it sorts the entries into.
That said, benchmarking the JVM is non-trivial and easy to get wrong, if you're seeing big differences with small numbers of entries, it could easily be a benchmarking error rather than the code.

When it is up to performance, never assume something.
Your assumption was "My HashSet implementation which is based on this is almost as fast as the JDK's". No, obviously it is not.
That is the tricky part when doing performance work: doubt everything unless you have measured with great accuracy. Worse, you even measured, and the measurement told you that your implementation is slower; and instead of checking your source, and the source of the thing you are measuring against; you decided that the measuring process must be wrong ...

Comparing Array Values and HashMap

I was making a rock paper scissors game and I'm supposed to save the last four throws of the user into a HashMap. The last four throws will be inside a Pattern class. I have it so that if the pattern is already in the HashMap, then the value will be incremented by one, showing that the user have repeated that pattern one time. The patterns will be used to predict the user next move. However, when I compare the two patterns, the one in the HashMap and the one I just passed in, even though they are not the same, it returns that they are the same. I have tried looking into this for a while but I couldn't find out what's wrong. Some help would be greatly appreciated! The error comes right at the second input. If I input R, it will save it in the HashMapbut when I input anything else, it will throw a NullPointerException, which I think because the new pattern is not stored inside the hashmap but I tried to get the value of it since the program thinks that it is equal to the one already inside the HashMap. I think the problem is inside the equals() in Pattern but I'm not entirely sure.
import java.util.*;
public class RockPaperScisors{
public static void main(String[] args){
Scanner key = new Scanner(System.in);
Pattern pattern = new Pattern();
Pattern pattern1;
Computer comp = new Computer();
boolean stop = false;
int full=0;;
while ( !stop ){
System.out.println("Enter R P S. Enter Q to quit.");
char a = key.next().charAt(0);
if ( a == 'Q' ){
stop = true;
break;
}
pattern.newPattern(a);
char[] patt = pattern.getPattern();
for ( int i = 0; i < patt.length; i++ ){
System.out.print(patt[i] + " ");
}
pattern1 = new Pattern(patt);
comp.storePattern(pattern1);
System.out.println();
System.out.println("Patterns: " + comp.getSize());
full++;
}
}
}
import java.util.*;
public class Pattern{
private char[] pattern;
private int full = 0;
public Pattern(){
pattern = new char[4];
}
public Pattern(char[] patt){
pattern = patt;
}
public char[] getPattern(){
return pattern;
}
public void newPattern(char p){
if ( full <= 3 ){
pattern[full] = p;
full ++;
}
else{
for (int i = 0; i <= pattern.length-2; i++) {
pattern[i] = pattern[i+1];
}
pattern[pattern.length-1] = p;
}
}
public int HashCode(){
char[] a = pattern;
return a.hashCode();
}
public boolean equals( Object o ) {
if( o == this ) { return true; }
if(!(o instanceof Pattern)){ return false; }
Pattern s = (Pattern) o;
if (Arrays.equals(s.getPattern(), pattern))
System.out.println("Yes");
return Arrays.equals(s.getPattern(), pattern);
}
}
import java.util.*;
import java.util.Map.Entry;
public class Computer{
private HashMap<Pattern, Integer> map;
public Computer(){
map = new HashMap <Pattern, Integer>();
}
public void storePattern(Pattern p){
boolean contains = false;
for (Entry<Pattern, Integer> entry : map.entrySet())
{
Pattern patt = entry.getKey();
if ( p.equals(patt) ){
contains = true;
}
}
if ( contains ){
int time = map.get(p);
time++;
map.put(p, time);
}
else
map.put(p, 0);
}
public int getSize(){
return map.size();
}
}

Your HashCode is wrong.
It should be written in lower case.
public int hashCode()
In order to make sure that the method is overwritten, use the #Override annotation.

As noted by another answer, the first thing to do is rename and annotate your hashcode() method.
And then, you also have to fix it. It uses
char[] a = pattern;
return a.hashCode();
This means it uses the char[] object's hashCode() function. But that function is inherited directly from Object, and gives you a different hash code for two equal character arrays. For example, try this:
char[] c = { 'a','b','c' };
char[] d = { 'a','b','c' };
System.out.printf("%d %d%n", c.hashCode(), d.hashCode());
And you'll see that it prints two different numbers.
So you need to use a better hash code function. You can make your own, or use Arrays.hashCode(pattern) (there is no need for the local a variable). The important thing is that when two Patterns are equal according to the equals() method, they should have the same hash code.
What happens in your case is that you look up the HashCode by testing equality of all the entry keys (I'll get to that in a minute, it's a bad thing to do), so equals tell you you have the same key in the hash map. But the hash map itself uses the hashCode() method in get() to locate the object. And according to the hashCode() method, there is no object in the hash map that has the same key!
So they must always agree when the objects are equal.
Now, as for your method of looking up the object:
boolean contains = false;
for (Entry<Pattern, Integer> entry : map.entrySet())
{
Pattern patt = entry.getKey();
if ( p.equals(patt) ){
contains = true;
}
}
if ( contains ){
int time = map.get(p);
time++;
map.put(p, time);
} else
map.put(p, 0);
This is not how you use a Map. The whole point of a HashMap is that you can see if it contains a certain key or not in O(1). What you are doing is iterating it and comparing - and that`s O(N), very wasteful.
If you implement your hashCode() properly, you can just look it up by doing map.containsKey(p) instead of that loop. And if you are certain that you are not putting null values in the map, you can simply use get() to get your pattern:
Integer time = map.get(p);
if ( time == null ) {
map.put( p, 0 );
} else {
map.put( p, time+1);
}
(You don't need to use ++, because you are not actually using time after you put it in the map).

It's entirely possible that the issue in Pattern#HashCode.
The first issue is that it's not being used (it should be Pattern#hashCode), the second is that it's not calculating what you think it is.
You may find java.util.Arrays#hashCode very useful, changing the backing from an array to a List would also work.
As a side note, Pattern is not a great choice for the name of that class, as it clashes with java.util.regex.Pattern. This is more of a problem in this case than it might be otherwise, as it can be used with Scanner.

Fast Incremental Hash in Java

I'm looking for a hash function to hash Strings. For my purposes (identifying changed objects during an import) it should have the following properties:
fast
can be used incremental, i.e. I can use it like this:
Hasher h = new Hasher();
h.add("somestring");
h.add("another part");
h.add("eveno more");
Long hash = h.create();
without compromising the other properties or keeping the strings in memory during the complete process.
Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.
It does not have to be secure against malicious attempts to create collisions.
What algorithm can I use? An algorithm with an existent free implementation in Java is preferred.
Clarification
The hash doesn't have to be a long. A String for example would be just fine.
The data to be hashed will come from a file or a db, with many 10MB or up to a few GB of data, that will get distributed into different Hashes. So keeping the complete Strings in memory is not really an option.

Hashs are a sensible topic and it is hard to recommend any such hash based upon your question. You might want to ask this question on https://security.stackexchange.com/ to get expert opinions on the usability of hashs in certain usecases.
What I understood so far is that most hashs are implemented incrementally in the very core; the execution-timing on the other hand is not that easy to predict.
I present you two Hasher implementations which rely on "an existent free implementation in Java". Both implementations are constructed in a way that you can arbitrarily split your Strings before calling add() and get the same result as long as you do not change the order of the characters in them:
import java.math.BigInteger;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;
/**
* Created for https://stackoverflow.com/q/26928529/1266906.
*/
public class Hashs {
public static class JavaHasher {
private int hashCode;
public JavaHasher() {
hashCode = 0;
}
public void add(String value) {
hashCode = 31 * hashCode + value.hashCode();
}
public int create() {
return hashCode;
}
}
public static class ShaHasher {
public static final Charset UTF_8 = Charset.forName("UTF-8");
private final MessageDigest messageDigest;
public ShaHasher() throws NoSuchAlgorithmException {
messageDigest = MessageDigest.getInstance("SHA-256");
}
public void add(String value) {
messageDigest.update(value.getBytes(UTF_8));
}
public byte[] create() {
return messageDigest.digest();
}
}
public static void main(String[] args) {
javaHash();
try {
shaHash();
} catch (NoSuchAlgorithmException e) {
e.printStackTrace(); // TODO: implement catch
}
}
private static void javaHash() {
JavaHasher h = new JavaHasher();
h.add("somestring");
h.add("another part");
h.add("eveno more");
int hash = h.create();
System.out.println(hash);
}
private static void shaHash() throws NoSuchAlgorithmException {
ShaHasher h = new ShaHasher();
h.add("somestring");
h.add("another part");
h.add("eveno more");
byte[] hash = h.create();
System.out.println(Arrays.toString(hash));
System.out.println(new BigInteger(1, hash));
}
}
Here obviously "SHA-256" could be replaced with other common hash-algorithms; Java ships quite a few of them.
Now you called out for a Long as return-value which would imply you are looking for a 64bit-Hash. If this really was on purpose have a look at the answers to What is a good 64bit hash function in Java for textual strings?. The accepted answer is a slight variant of the JavaHasher as String.hashCode() does basically the same calculation, but with lower overflow-boundary:
public static class Java64Hasher {
private long hashCode;
public Java64Hasher() {
hashCode = 1125899906842597L;
}
public void add(CharSequence value) {
final int len = value.length();
for(int i = 0; i < len; i++) {
hashCode = 31*hashCode + value.charAt(i);
}
}
public long create() {
return hashCode;
}
}
Unto your points:
fast
With SHA-256 being slower than the other two I still would call all three presented approaches fast.
can be used incremental without compromising the other properties or keeping the strings in memory during the complete process.
I can not guarantee that property for the ShaHasher as I understand it is block-based and I lack the source code.Still I would suggest that at most one block, the hash and some internal states are kept. The other two obviously only store the partial hash between calls to add()
Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.
For every hash there are collisions. Given a good distribution the bit-size of the hash is the main factor on how often a collision happens. The JavaHasher is used in e.g. HashMaps and seems to be "collision-free" enough to distribute similar keys far apart each other. As for any deeper analysis: do your own tests or ask your local security engineer - sorry.
I hope this gives a good starting point, details are probably mainly opinion-based.

Not intended as an answer, just to demonstrate that hash collisions are much more likely than human intuition tends to assume.
The following tiny program generates 2^31 distinct strings and checks if any of their hashes collide. It does this by keeping a tracking bit per possible hash value (so you need >512MB heap to run it), to mark each hash value as "used" as they are encountered. It takes several minutes to complete.
public class TestStringHashCollisions {
public static void main(String[] argv) {
long collisions = 0;
long testcount = 0;
StringBuilder b = new StringBuilder(64);
for (int i=0; i>=0; ++i) {
// construct distinct string
b.setLength(0);
b.append("www.");
b.append(Integer.toBinaryString(i));
b.append(".com");
// check for hash collision
String s = b.toString();
++testcount;
if (isColliding(s.hashCode()))
++collisions;
// progress printing
if ((i & 0xFFFFFF) == 0) {
System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
}
}
System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
System.out.println("Collision ratio: " + (collisions / (double) testcount));
}
// storage for 2^32 bits in 2^27 ints
static int[] bitSet = new int[1 << 27];
// test if hash code has appeared before, mark hash as "used"
static boolean isColliding(int hash) {
int index = hash >>> 5;
int bitMask = 1 << (hash & 31);
if ((bitSet[index] & bitMask) != 0)
return true;
bitSet[index] |= bitMask;
return false;
}
}
You can adjust the string generation part easily to test different patterns.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Mapping hash values to a range, with minimal collisions - java

Related

Direct Recursion vs While Loop for time complexity performance

Severe Java performance drop after changing ID strings

Why is my HashMap implementation 10 times slower than the JDK's?

Comparing Array Values and HashMap

Fast Incremental Hash in Java

Categories

Resources