Context
Hi, I'm working on an assignment for school that asks us to implement a hash table in Java. There are no requirements that collisions be kept to a minimum, but low collision rate and speed seem to be the two most sought-after qualities in all the reading (some more) that I've done.
Problem
I'd like some guidance on how to map the output of a hash function to a smaller range, without having >20% of my keys collide (yikes).
In all of the algorithms that I've explored, keys are mapped to the entire range of an unsigned 32 bit integer (or in many cases, 64, even 128 bit). I'm not finding much about this on here, Wikipedia, or in any of the hash-related articles / discussions I've come across.
In terms of the specifics of my implementation, I'm working in Java (mandate of my school), which is problematic since there are no unsigned types to work with. To get around this, I've been using the 64-bit long integer type, then using a bit mask to map back down to 32 bits. Instead of simply truncating, I XOR the top 32 bits with the bottom 32, then perform a bitwise AND to mask out any upper bits that might result in a negative value when I cast it down to a 32 bit integer. After all that, a separate function compresses the resulting hash value down to fit into the bounds of the hash table's inner array.
It ends up looking like:
int hash( String key ) {
long h;
for( int i = 0; i < key.length(); i++ )
//do some stuff with each character in the key
h = h ^ ( h << 32 );
return h & 2147483647;
}
Where the inner-loop depends on the hash function (I've implemented a few: polynomial hashing, FNV1, SuperFastHash, and a custom one tailored to the input data).
They basically all perform horribly. I have yet to see <20% keys collide. Even before I compress the hash values down to array indices, none of my hash functions will get me less thank 10k collisions. My inputs are two text files, each ~220,000 lines. One is English words, the other is random strings of varying length.
My lecture notes recommend the following, for compressing the hashed keys:
(hashed key) % P
Where P is the largest prime < the size of the inner array.
Is this an accepted method of compressing hash values? I have a feeling it isn't, but since performance is so poor even before compression, I have a feeling it's not the primary culprit, either.
I don´t know if I understand well your concrete problem, but I will try to help in hash performance and collisions.
The hash based objects will determine in which bucket they will store the key-value pair based on hash value. Inside each bucket there is a structure (In HashMap case a LinkedList) in where the pair is stored.
If the hash value is usually the same, the bucket will be usually the same so the performance will degrade a lot, let´s see an example:
Consider this class
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return 0;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that hashCode in class MyKey returns always '0' as hash. It is ok with the hashcode definition (http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()). If we run that program, this is the result
100000
tiempo: 62866 mls
Is a very poor performance, now we are going to change the MyKey hashcode code:
package hashTest;
import java.util.Hashtable;
public class HashTest {
public static void main (String[] args) {
Hashtable<MyKey, String> hm = new Hashtable<>();
long ini = System.currentTimeMillis();
for (int i=0; i<100000; i++) {
MyKey a = new HashTest().new MyKey(String.valueOf(i));
hm.put(a, String.valueOf(i));
}
System.out.println(hm.size());
long fin = System.currentTimeMillis();
System.out.println("tiempo: " + (fin-ini) + " mls");
}
private class MyKey {
private String str;
public MyKey(String i) {
str = i;
}
public String getStr() {
return str;
}
#Override
public int hashCode() {
return str.hashCode() * 31;
}
#Override
public boolean equals(Object o) {
if (o instanceof MyKey) {
MyKey aux = (MyKey) o;
if (this.str.equals(aux.getStr())) {
return true;
}
}
return false;
}
}
}
Note that only hashcode in MyKey has changed, now when we run the code te result is
100000
tiempo: 47 mls
There is an incredible better performance now with a minor change. Is a very common practice return the hashcode multiplied by a prime number (in this case 31), using the same hashcode members that you use inside equals method in order to determine if two objects are the same (in this case only str).
I hope that this little example can you point out a solution for your problem.
I would like to know what makes the difference, what should i aware of when im writing code.
Used the same parameters and methods put(), get() when testing
without printing
Used System.NanoTime() to test runtime
I tried it with 1-10 int keys with 10 values, so every single hash returns unique index, which is the most optimal scenario
My HashSet implementation which is based on this is almost as fast as the JDK's
Here's my simple implementation:
public MyHashMap(int s) {
this.TABLE_SIZE=s;
table = new HashEntry[s];
}
class HashEntry {
int key;
String value;
public HashEntry(int k, String v) {
this.key=k;
this.value=v;
}
public int getKey() {
return key;
}
}
int TABLE_SIZE;
HashEntry[] table;
public void put(int key, String value) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].getKey() != key)
hash = (hash +1) % TABLE_SIZE;
table[hash] = new HashEntry(key, value);
}
public String get(int key) {
int hash = key % TABLE_SIZE;
while(table[hash] != null && table[hash].key != key)
hash = (hash+1) % TABLE_SIZE;
if(table[hash] == null)
return null;
else
return table[hash].value;
}
Here's the benchmark:
public static void main(String[] args) {
long start = System.nanoTime();
MyHashMap map = new MyHashMap(11);
map.put(1,"A");
map.put(2,"B");
map.put(3,"C");
map.put(4,"D");
map.put(5,"E");
map.put(6,"F");
map.put(7,"G");
map.put(8,"H");
map.put(9,"I");
map.put(10,"J");
map.get(1);
map.get(2);
map.get(3);
map.get(4);
map.get(5);
map.get(6);
map.get(7);
map.get(8);
map.get(9);
map.get(10);
long end = System.nanoTime();
System.out.println(end-start+" ns");
}
If you read the documentation of the HashMap class, you see that it implements a hash table implementation based on the hashCode of the keys. This is dramatically more efficient than a brute-force search if the map contains a non-trivial number of entries, assuming reasonable key distribution amongst the "buckets" that it sorts the entries into.
That said, benchmarking the JVM is non-trivial and easy to get wrong, if you're seeing big differences with small numbers of entries, it could easily be a benchmarking error rather than the code.
When it is up to performance, never assume something.
Your assumption was "My HashSet implementation which is based on this is almost as fast as the JDK's". No, obviously it is not.
That is the tricky part when doing performance work: doubt everything unless you have measured with great accuracy. Worse, you even measured, and the measurement told you that your implementation is slower; and instead of checking your source, and the source of the thing you are measuring against; you decided that the measuring process must be wrong ...
I wanted to create a method that takes an enum and uses it directly in an computation
private static int getEntries(List<Integer> vector, Sign sign)
{
//assert isPrimitiveTypeCompliant(vector) : "Vector has null components!";
int entries = 0;
for (Integer entry : vector)
if (entry * sign > 0) // does not compile
entries++;
return entries;
}
I thought sth. like that was possible, since I assumed System.out.println(Object) does implicit type conversion, too. Which it doesn't, it uses following approach:
public void println(Object x) {
String s = String.valueOf(x);
synchronized (this) {
print(s);
newLine();
}
}
public static String valueOf(Object obj) {
return (obj == null) ? "null" : obj.toString();
}
Question
So is it possible to achieve this in java? Or is this reserved to C++ and overloading of operators? What are the common workarounds? Utility/Adapter classes that do the work?
Btw, I eventually ended up with this approach
private enum Sign
{
POSITIVE(+1), NEGATIVE(-1);
private int sign;
private Sign(int sign)
{
this.sign = sign;
}
public int process(int n)
{
if (n * sign > 0)
{
return n;
}
return 0;
}
}
private static int getEntries(List<Integer> vector, Sign sign)
{
//assert isPrimitiveTypeCompliant(vector) : "Vector has null components";
int entries = 0;
for (Integer entry : vector)
entries += sign.process(entry);
return entries;
}
Yes, it is possible to achieve it. In fact, you did in the second piece of code.
Java doesn't have operator overloading or implicit conversions (beyond numerical conversions and "widening" type casts). So, there is no way of allowing syntax like entry * sign (except the one you used).
What do you mean workarounds? This is not a problem. It is a language design decision. And you already arrived successfully to the appropriate Java idiom.
why not just use the int value of the sign
if (entry * sign.value > 0)
enum Sign
public final int value;
I think for this case it would work better for you to use final static variables.
public final class Sign {
public final static int POSITIVE = 1;
public final static int NEGATIVE = -1;
private Sign() {
}
}
Then you can use Sign.POSITIVE and Sign.NEGATIVE for the operations you want.
For my data structures class our homework is to create a generic heap ADT. In the siftUp() method I need to do comparison and if the parent is smaller I need to do a swap. The problem I am having is that the comparison operators are not valid on generic types. I believe I need to use the Comparable interface but from what I read it’s not a good idea to use with Arrays. I have also search this site and I have found good information that relates to this post none of them helped me find the solution
I removed some of the code that wasn’t relevant
Thanks
public class HeapQueue<E> implements Cloneable {
private int highest;
private Integer manyItems;
private E[] data;
public HeapQueue(int a_highest) {
data = (E[]) new Object[10];
highest = a_highest;
}
public void add(E item, int priority) {
// check to see is priority value is within range
if(priority < 0 || priority > highest) {
throw new IllegalArgumentException
("Priority value is out of range: " + priority);
}
// increase the heaps capacity if array is out of space
if(manyItems == data.length)
ensureCapacity();
manyItems++;
data[manyItems - 1] = item;
siftUp(manyItems - 1);
}
private void siftUp(int nodeIndex) {
int parentIndex;
E tmp;
if (nodeIndex != 0) {
parentIndex = parent(nodeIndex);
if (data[parentIndex] < data[nodeIndex]) { <-- problem ****
tmp = data[parentIndex];
data[parentIndex] = data[nodeIndex];
data[nodeIndex] = tmp;
siftUp(parentIndex);
}
}
}
private int parent(int nodeIndex) {
return (nodeIndex - 1) / 2;
}
}
Technically you're using the comparable interface on on item, not an array. One item in the array specifically. I think the best solution here is to accept, in the constructor, a Comparator that the user can pass to compare his generic objects.
Comparator<E> comparator;
public HeapQueue(int a_highest, Comparator<E> compare)
{
this.comparator = compare;
Then, you would store that comparator in a member function and use
if (comparator.compare(data[parentIndex],data[nodeIndex]) < 0)
In place of the less than operator.
If I am reading this right, E simply needs to extend Comparable and then your problem line becomes...
if (data[parentIndex].compareTo(ata[nodeIndex]) < 0)
This is not breaking any bet-practice rules that I know of.
I'm pretty new to the idea of recursion and this is actually my first attempt at writing a recursive method.
I tried to implement a recursive function Max that passes an array, along with a variable that holds the array's size in order to print the largest element.
It works, but it just doesn't feel right!
I have also noticed that I seem to use the static modifier much more than my classmates in general...
Can anybody please provide any general tips as well as feedback as to how I can improve my code?
public class RecursiveTry{
static int[] n = new int[] {1,2,4,3,3,32,100};
static int current = 0;
static int maxValue = 0;
static int SIZE = n.length;
public static void main(String[] args){
System.out.println(Max(n, SIZE));
}
public static int Max(int[] n, int SIZE) {
if(current <= SIZE - 1){
if (maxValue <= n[current]) {
maxValue = n[current];
current++;
Max(n, SIZE);
}
else {
current++;
Max(n, SIZE);
}
}
return maxValue;
}
}
Your use of static variables for holding state outside the function will be a source of difficulty.
An example of a recursive implementation of a max() function in pseudocode might be:
function Max(data, size) {
assert(size > 0)
if (size == 1) {
return data[0]
}
maxtail = Max(data[1..size], size-1)
if (data[0] > maxtail) {
return data[0]
} else {
return maxtail
}
}
The key here is the recursive call to Max(), where you pass everything except the first element, and one less than the size. The general idea is this function says "the maximum value in this data is either the first element, or the maximum of the values in the rest of the array, whichever is larger".
This implementation requires no static data outside the function definition.
One of the hallmarks of recursive implementations is a so-called "termination condition" which prevents the recursion from going on forever (or, until you get a stack overflow). In the above case, the test for size == 1 is the termination condition.
Making your function dependent on static variables is not a good idea. Here is possible implementation of recursive Max function:
int Max(int[] array, int currentPos, int maxValue) {
// Ouch!
if (currentPos < 0) {
raise some error
}
// We reached the end of the array, return latest maxValue
if (currentPos >= array.length) {
return maxValue;
}
// Is current value greater then latest maxValue ?
int currentValue = array[currentPos];
if (currentValue > maxValue) {
// currentValue is a new maxValue
return Max(array, currentPos + 1, currentValue);
} else {
// maxValue is still a max value
return Max(array, currentPos + 1, maxValue);
}
}
...
int[] array = new int[] {...};
int currentPos = 0;
int maxValue = array[currentPos] or minimum int value;
maxValue = Max(array, currentPos, maxValue);
A "max" function is the wrong type of thing to write a recursive function for -- and the fact you're using static values for "current" and "maxValue" makes your function not really a recursive function.
Why not do something a little more amenable to a recursive algorithm, like factorial?
"not-homework"?
Anyway. First things first. The
static int[] n = new int[] {1,2,4,3,3,32,100};
static int SIZE = n.length;
have nothing to do with the parameters of Max() with which they share their names. Move these over to main and lose the "static" specifiers. They are used only once, when calling the first instance of Max() from inside main(). Their scope shouldn't extend beyond main().
There is no reason for all invocations of Max() to share a single "current" index. "current" should be local to Max(). But then how would successive recurrences of Max() know what value of "current" to use? (Hint: Max() is already passing other Max()'s lower down the line some data. Add "current" to this data.)
The same thing goes for maxValue, though the situation here is a bit more complex. Not only do you need to pass a current "maxValue" down the line, but when the recursion finishes, you have to pass it back up all the way to the first Max() function, which will return it to main(). You may need to look at some other examples of recursion and spend some time with this one.
Finally, Max() itself is static. Once you've eliminated the need to refer to external data (the static variables) however; it doesn't really matter. It just means that you can call Max() without having to instantiate an object.
As others have observed, there is no need for recursion to implement a Max function, but it can be instructive to use a familiar algorithm to experiment with a new concept. So, here is the simplified code, with an explanation below:
public class RecursiveTry
{
public static void main(String[] args)
{
System.out.println(Max(new int[] {1,2,4,3,3,32,100}, 0, 0));
}
public static int Max(int[] n, int current, int maxValue)
{
if(current < n.Length)
{
if (maxValue <= n[current] || current == 0))
{
return Max(n, current+1, n[current]);
}
return Max(n, current+1, maxValue);
}
return maxValue;
}
}
all of the static state is gone as unnecessary; instead everything is passed on the stack. the internal logic of the Max function is streamlined, and we recurse in two different ways just for fun
Here's a Java version for you.
public class Recursion {
public static void main(String[] args) {
int[] data = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
System.out.println("Max: " + max(0, data));
}
public static int max(int i, int[] arr) {
if(i == arr.length-1) {
return arr[i];
}
int memo = max(i+1, arr);
if(arr[i] > memo) {
return arr[i];
}
return memo;
}
}
The recurrence relation is that the maximum element of an array is either the first element, or the maximum of the rest of the array. The stop condition is reached when you reach the end of the array. Note the use of memoization to reduce the recursive calls (roughly) in half.
You are essentially writing an iterative version but using tail recursion for the looping. Also, by making so many variables static, you are essentially using global variables instead of objects. Here is an attempt at something closer to a typical recursive implementation. Of course, in real life if you were using a language like Java that doesn't optimize tail calls, you would implement a "Max" function using a loop.
public class RecursiveTry{
static int[] n;
public static void main(String[] args){
RecursiveTry t = new RecursiveTry(new int[] {1,2,4,3,3,32,100});
System.out.println(t.Max());
}
RecursiveTry(int[] arg) {
n = arg;
}
public int Max() {
return MaxHelper(0);
}
private int MaxHelper(int index) {
if(index == n.length-1) {
return n[index];
} else {
int maxrest = MaxHelper(index+1);
int current = n[index];
if(current > maxrest)
return current;
else
return maxrest;
}
}
}
In Scheme this can be written very concisely:
(define (max l)
(if (= (length l) 1)
(first l)
(local ([define maxRest (max (rest l))])
(if (> (first l) maxRest)
(first l)
maxRest))))
Granted, this uses linked lists and not arrays, which is why I didn't pass it a size element, but I feel this distills the problem to its essence. This is the pseudocode definition:
define max of a list as:
if the list has one element, return that element
otherwise, the max of the list will be the max between the first element and the max of the rest of the list
A nicer way of getting the max value of an array recursively would be to implement quicksort (which is a nice, recursive sorting algorithm), and then just return the first value.
Here is some Java code for quicksort.
Smallest codesize I could get:
public class RecursiveTry {
public static void main(String[] args) {
int[] x = new int[] {1,2,4,3,3,32,100};
System.out.println(Max(x, 0));
}
public static int Max(int[] arr, int currPos) {
if (arr.length == 0) return -1;
if (currPos == arr.length) return arr[0];
int len = Max (arr, currPos + 1);
if (len < arr[currPos]) return arr[currPos];
return len;
}
}
A few things:
1/ If the array is zero-size, it returns a max of -1 (you could have another marker value, say, -MAX_INT, or throw an exception). I've made the assumption for code clarity here to assume all values are zero or more. Otherwise I would have peppered the code with all sorts of unnecessary stuff (in regards to answering the question).
2/ Most recursions are 'cleaner' in my opinion if the terminating case is no-data rather than last-data, hence I return a value guaranteed to be less than or equal to the max when we've finished the array. Others may differ in their opinion but it wouldn't be the first or last time that they've been wrong :-).
3/ The recursive call just gets the max of the rest of the list and compares it to the current element, returning the maximum of the two.
4/ The 'ideal' solution would have been to pass a modified array on each recursive call so that you're only comparing the first element with the rest of the list, removing the need for currPos. But that would have been inefficient and would have bought down the wrath of SO.
5/ This may not necessarily be the best solution. It may be that by gray matter has been compromised from too much use of LISP with its CAR, CDR and those interminable parentheses.
First, let's take care of the static scope issue ... Your class is defining an object, but never actually instantiating one. Since main is statically scoped, the first thing to do is get an object, then execute it's methods like this:
public class RecursiveTry{
private int[] n = {1,2,4,3,3,32,100};
public static void main(String[] args){
RecursiveTry maxObject = new RecursiveTry();
System.out.println(maxObject.Max(maxObject.n, 0));
}
public int Max(int[] n, int start) {
if(start == n.length - 1) {
return n[start];
} else {
int maxRest = Max(n, start + 1);
if(n[start] > maxRest) {
return n[start];
}
return maxRest;
}
}
}
So now we have a RecursiveTry object named maxObject that does not require the static scope. I'm not sure that finding a maximum is effective using recursion as the number of iterations in the traditional looping method is roughly equivalent, but the amount of stack used is larger using recursion. But for this example, I'd pare it down a lot.
One of the advantages of recursion is that your state doesn't generally need to be persisted during the repeated tests like it does in iteration. Here, I've conceded to the use of a variable to hold the starting point, because it's less CPU intensive that passing a new int[] that contains all the items except for the first one.