Java On-Memory Efficient Key-Value Store - java

I have store 111 million key-value pairs (one key can have multiple values - maximum 2/3) whose key are 50 bit Integers and values are 32 bit (maximum) Integers. Now, my requirements are:
Fast Insertion of (Key, Value) pair [allowing duplicates]
Fast retrieving of value/values based on key.
A nice solution of it is given here based on MultiMap. However, I want to store more key-values pairs in main memory with no/little bit performance penalty. I studied from web articles that B+ Tree, R+ Tree, B Tree, Compact Multimap etc. can be a nice solution for that. Can anybody help me:
Is there any Java library which satisfies my all those needs properly
(above mentioned/other ds also acceptable. no issue with that) ?
Actually, I want an efficient java library data structure to store/retrieve
key-value/values pairs which takes less memory footprint and must be
built in-memory.
NB: I have tried with HashMultiMap (Guava with some modification with trove) as mentioned by Louis Wasserman, Kyoto/Tokyo Cabinet etc etc.My experience is not good with disk-baked solutions. So please avoid that :). Another point is that, for choosing library/ds one important point is: keys are 50 bit (so if we assign 64bit) 14 bit will lost and values are 32 bit Int (maximum)- mostly they are 10-12-14 bits. So, we can save space there also.

I don't think there's anything in the JDK which will do this.
However, implementing such a thing is a simple matter of programming. Here is an open-addressed hashtable with linear probing, with keys and values stored in parallel arrays:
public class LongIntParallelHashMultimap {
private static final long NULL = 0L;
private final long[] keys;
private final int[] values;
private int size;
public LongIntParallelHashMultimap(int capacity) {
keys = new long[capacity];
values = new int[capacity];
}
public void put(long key, int value) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
if (size == keys.length) throw new IllegalStateException("map is full");
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int[] get(long key) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
int index = indexFor(key);
int count = countHits(key, index);
int[] hits = new int[count];
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
}
index = successor(index);
}
return hits;
}
private int countHits(long key, int index) {
int numHits = 0;
while (keys[index] != NULL) {
if (keys[index] == key) ++numHits;
index = successor(index);
}
return numHits;
}
private int indexFor(long key) {
// the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
// see The Art of Computer Programming, section 6.4
// the constant has two important properties:
// (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
// (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
long hash = key * 5700357409661598721L;
return Math.abs((int) (hash % keys.length));
}
private int successor(int index) {
return (index + 1) % keys.length;
}
public int size() {
return size;
}
}
Note that this is a fixed-size structure. You will need to create it big enough to hold all your data - 110 million entries for me takes up 1.32 GB. The bigger you make it, in excess of what you need to store the data, the faster that insertions and lookups will be. I found that for 110 million entries, with a load factor of 0.5 (2.64 GB, twice as much space as needed), it took on average 403 nanoseconds to look up a key, but with a load factor of 0.75 (1.76 GB, a third more space than is needed), it took 575 nanoseconds. Decreasing the load factor below 0.5 usually doesn't make much difference, and indeed, with a load factor of 0.33 (4.00 GB, three times more space than needed), i get an average time of 394 nanoseconds. So, even though you have 5 GB available, don't use it all.
Note also that zero is not allowed as a key. If this is a problem, change the null value to be something else, and pre-fill the keys array with that on creation.

Is there any Java library which satisfies my all those needs properly.
AFAIK no. Or at least, not one that minimizes the memory footprint.
However, it should be easy write a custom map class that is specialized to these requirements.

It's a good idea to look for databases, because problems like these are what they are designed for. In recent years Key-Value databases became very popular, e.g. for web services (keyword "NoSQL"), so you should find something.
The choice for a custom data structure also depends if you want to use a hard drive to store your data (and how safe that has to be) or if it completely lost on program exit.
If implementing manually and the whole db fits into memory somewhat easily, I'd just implement a hashmap in C. Create a hash function that gives a (well-spread) memory address from a value. Insert there or next to it if already assigned. Assigning and retrieval is then O(1). If you implement it in Java, you'll have the 4 byte overhead for each (primitive) object.

Based on #Tom Andersons solution I removed the need to allocate objects, and added a performance test.
import java.util.Arrays;
import java.util.Random;
public class LongIntParallelHashMultimap {
private static final long NULL = Long.MIN_VALUE;
private final long[] keys;
private final int[] values;
private int size;
public LongIntParallelHashMultimap(int capacity) {
keys = new long[capacity];
values = new int[capacity];
Arrays.fill(keys, NULL);
}
public void put(long key, int value) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
if (size == keys.length) throw new IllegalStateException("map is full");
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int get(long key, int[] hits) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
int index = indexFor(key);
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
if (hitIndex == hits.length)
break;
}
index = successor(index);
}
return hitIndex;
}
private int indexFor(long key) {
return Math.abs((int) (key % keys.length));
}
private int successor(int index) {
index++;
return index >= keys.length ? index - keys.length : index;
}
public int size() {
return size;
}
public static class PerfTest {
public static void main(String... args) {
int values = 110* 1000 * 1000;
long start0 = System.nanoTime();
long[] keysValues = generateKeys(values);
LongIntParallelHashMultimap map = new LongIntParallelHashMultimap(222222227);
long start = System.nanoTime();
addKeyValues(values, keysValues, map);
long mid = System.nanoTime();
int sum = lookUpKeyValues(values, keysValues, map);
long time = System.nanoTime();
System.out.printf("Generated %.1f M keys/s, Added %.1f M/s and looked up %.1f M/s%n",
values * 1e3 / (start - start0), values * 1e3 / (mid - start), values * 1e3 / (time - mid));
System.out.println("Expected " + values + " got " + sum);
}
private static long[] generateKeys(int values) {
Random rand = new Random();
long[] keysValues = new long[values];
for (int i = 0; i < values; i++)
keysValues[i] = rand.nextLong();
return keysValues;
}
private static void addKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
for (int i = 0; i < values; i++) {
map.put(keysValues[i], i);
}
assert map.size() == values;
}
private static int lookUpKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
int[] found = new int[8];
int sum = 0;
for (int i = 0; i < values; i++) {
sum += map.get(keysValues[i], found);
}
return sum;
}
}
}
prints
Generated 34.8 M keys/s, Added 11.1 M/s and looked up 7.6 M/s
Run on an 3.8 GHz i7 with Java 7 update 3.
This is much slower than the previous test because you are accessing main memory, rather than the cache at random. This is really a test of the speed of your memory. The writes are faster because they can be performed asynchronously to main memory.
Using this collection
final SetMultimap<Long, Integer> map = Multimaps.newSetMultimap(
TDecorators.wrap(new TLongObjectHashMap<Collection<Integer>>()),
new Supplier<Set<Integer>>() {
public Set<Integer> get() {
return TDecorators.wrap(new TIntHashSet());
}
});
Running the same test with 50 million entries (which used about 16 GB) and -mx20g I go the following result.
Generated 47.2 M keys/s, Added 0.5 M/s and looked up 0.7 M/s
For 110 M entries you will need about 35 GB of memory and a machine 10 x faster than mine (3.8 GHz) to perform 5 million adds per second.

If you must use Java, then implement your own hashtable/hashmap. An important property of your table is to use a linkedlist to handle collisions. Hence when you do a lookup, you may return all the elements on the list.

Might be I am late in answering this question but elastic search will solve your problem.

Related

Why does this implementation of Quadratic Probing fail when not overriding values on collision?

My current implementation of Quadratic Probing overrides the item being stored at the current index with the new item when a collision occurs. I insert three Person objects which are stored by using their lastname as key. To test the collision resolution of the implementation they all have the same last name which is "Windmill".
I need the implementation to keep all person objects but just move them to a different index instead of overriding them.
The list size has been set as 7, stored in variable "M" used for modulo in the insert function.
Insert function
#Override
public void put(String key, Person value) {
int tmp = hash(key);
int i, h = 0;
for (i = tmp; keys[i] != null; i = (i + h * h++) % M) {
collisionCount++;
if (keys[i].equals(key)) {
values[i] = value;
return;
}
}
keys[i] = key;
values[i] = value;
N++;
}
Hash function
private int hash(String key) {
return (key.hashCode() & 0x7fffffff) % M;
}
get function
#Override
public List<Person> get(String key) {
List<Person> results = new ArrayList<>();
int tmp = hash(key);
int i = hash(key), h = 0;
while (keys[i] != null)
{
if (keys[i].equals(key))
results.add(values[i]);
i = (i + h * h++) % M;
}
return results;
}
When i remove the piece of code that overrides previous values, the index int overflows and turns into a negative number, causing the program to crash.
You get overflow because you do % M after some operations with ints that cause overflow.
You need to replace i = (i + h * h++) % M with some additional operations based on modulo operation properties (https://en.wikipedia.org/wiki/Modulo_operation):
(a + b) mod n = [(a mod n) + (b mod n)] mod n.
ab mod n = [(a mod n)(b mod n)] mod n.
I think there are two issues with your code:
You don't check whether the (multi-)map is full. In practice you want to do 2 checks:
check if N==M (or maybe some smaller threshold like 90% of M)
make collisionCount a local variable and when it reaches N (unfortunately this check is also necessary to avoid some pathological cases)
in both cases you should extend your storage area and copy old data into it (re-insert). This alone should fix your bug for small values of M but for really big sizes of the map you still need the next thing.
You didn't take into account how mod (%) operation works in Java. Particularly for negative value of a the value of a % b is also negative. So when you insert a lot of values and check for next index, i + h^2 might overflow Integer.MAX_VALUE and become negative. To fix this you might use a method like this:
static int safeMod(int a, int b) {
int m = a % b;
return (m >= 0) ? m : (m+b);
}

Smallest java structure with relatively decent contains() solution

Alright, here's the lowdown: I'm writing a class in Java that finds the Nth Hardy's Taxi number (a number that can be summed up by two different sets of two cubed numbers). I have the discovery itself down, but I am in desperate need of some space saving. To that end, I need the smallest possible data structure where I can relatively easily use or create a method like contains(). I'm not particularly worried about speed, as my current solution can certainly get it to compute well within the time restrictions.
In short, the data structure needs:
To be able to relatively simply implement a contains() method
To use a low amount of memory
To be able to store very large number of entries
To be easily usable with the primitive long type
Any ideas? I started with a hash map (because I needed to test the values the led to the sum to ensure accuracy), then moved to hash set once I guaranteed reliable answers.
Any other general ideas on how to save some space would be greatly appreciated!
I don't think you'd need the code to answer the question, but here it is in case you're curious:
public class Hardy {
// private static HashMap<Long, Long> hm;
/**
* Find the nth Hardy number (start counting with 1, not 0) and the numbers
* whose cubes demonstrate that it is a Hardy number.
* #param n
* #return the nth Hardy number
*/
public static long nthHardyNumber(int n) {
// long i, j, oldValue;
int i, j;
int counter = 0;
long xyLimit = 2147483647; // xyLimit is the max value of a 32bit signed number
long sum;
// hm = new HashMap<Long, Long>();
int hardyCalculations = (int) (n * 1.1);
HashSet<Long> hs = new HashSet<Long>(hardyCalculations * hardyCalculations, (float) 0.95);
long[] sums = new long[hardyCalculations];
// long binaryStorage, mask = 0x00000000FFFFFFFF;
for (i = 1; i < xyLimit; i++){
for (j = 1; j <= i; j++){
// binaryStorage = ((i << 32) + j);
// long y = ((binaryStorage << 32) >> 32) & mask;
// long x = (binaryStorage >> 32) & mask;
sum = cube(i) + cube(j);
if (hs.contains(sum) && !arrayContains(sums, sum)){
// oldValue = hm.get(sum);
// long oldY = ((oldValue << 32) >> 32) & mask;
// long oldX = (oldValue >> 32) & mask;
// if (oldX != x && oldX != y){
sums[counter] = sum;
counter++;
if (counter == hardyCalculations){
// Arrays.sort(sums);
bubbleSort(sums);
return sums[n - 1];
}
} else {
hs.add(sum);
}
}
}
return 0;
}
private static void bubbleSort(long[] array){
long current, next;
int i;
boolean ordered = false;
while (!ordered) {
ordered = true;
for (i = 0; i < array.length - 1; i++){
current = array[i];
next = array[i + 1];
if (current > next) {
ordered = false;
array[i] = next;
array[i+1] = current;
}
}
}
}
private static boolean arrayContains(long[] array, long n){
for (long l : array){
if (l == n){
return true;
}
}
return false;
}
private static long cube(long n){
return n*n*n;
}
}
Have you considered using a standard tree? In java that would be a TreeSet. By sacrificing speed, a tree generally gains back space over a hash.
For that matter, sums might be a TreeMap, transforming the linear arrayContains to a logarithmic operation. Being naturally ordered, there would also be no need to re-sort it afterwards.
EDIT
The complaint against using a java tree structure for sums is that java's tree types don't support the k-select algorithm. On the assumption that Hardy numbers are rare, perhaps you don't need to sweat the complexity of this container (in which case your array is fine.)
If you did need to improve time performance of this aspect, you could consider using a selection-enabled tree such as the one mentioned here. However that solution works by increasing the space requirement, not lowering it.
Alternately we can incrementally throw out Hardy numbers we know we don't need. Suppose during the running of the algorithm, sums already contains n Hardy numbers and we discover a new one. We insert it and do whatever we need to preserve collection order, and so now contains n+1 sorted elements.
Consider that last element. We already know about n smaller Hardy numbers, and so there is no possible way this last element is our answer. Why keep it? At this point we can shrink sums again down to size n and toss the largest element out. This is both a space savings, and time savings as we have fewer elements to maintain in sorted order.
The natural data structure for sums in that approach is a max heap. In java there is no native implementation available, but a few 3rd party ones are floating around. You could "make it work" with TreeMap::lastKey, which will be slower in the end, but still faster than quadratic bubbleSort.
If you have an extremely large number of elements, and you effectively want an index to allow fast tests for containment in the underlying dataset, then take a look at Bloom Filters. These are space-efficient indexes whose sole purpose is to enable fast tests for containment in a dataset.
Bloom Filters are probabilistic, which means if they return true for containment, then you actually need to check your underlying dataset to confirm that the element is really present.
If they return false, the element is guaranteed not to be contained in the underlying dataset, and in that case the test for containment would be very cheap.
So it depends on the whether most of the time you expect a candidate to really be contained in the dataset or not.
this is core function to find if a given number is HR-number: it's in C but one should get the idea:
bool is_sum_of_cubes(int value)
{
int m = pow(value, 1.0/3);
int i = m;
int j = 1;
while(j < m && i >= 0)
{
int element = i*i*i + j*j*j;
if( value == element )
{
return true;
}
if(element < value)
{
++j;
}
else
{
--i;
}
}
return false;
}

Regarding HashMap implementation in java

I was trying to do research on hashmap and came up with the following analysis:
https://stackoverflow.com/questions/11596549/how-does-javas-hashmap-work-internally/18492835#18492835
Q1 Can you guys show me a simple map where you can show the process..that how hashcode for the given key is calculated in detail by using this formula ..Calculate position hash % (arrayLength-1)) where element should be placed(bucket number), let say I have this hashMap
HashMap map=new HashMap();//HashMap key random order.
map.put("Amit","Java");
map.put("Saral","J2EE");
Q2 Sometimes it might happen that hashCodes for 2 different objects are the same. In this case 2 objects will be saved in one bucket and will be presented as LinkedList. The entry point is more recently added object. This object refers to other objest with next field and so one. Last entry refers to null. Can you guys show me this with real example..!!
.
"Amit" will be distributed to the 10th bucket, because of the bit twiddeling. If there were no bit twiddeling it would go to the 7th bucket, because 2044535 & 15 = 7. how this is possible please explanin detail the whole calculation..?
Snapshots updated...
and the other image is ...
that how hashcode for the given key is calculated in detail by using
this formula
In case of String this is calculated by String#hashCode(); which is implemented as follows:
public int hashCode() {
int h = hash;
int len = count;
if (h == 0 && len > 0) {
int off = offset;
char val[] = value;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
Basically following the equation in the java doc
hashcode = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
One interesting thing to note on this implementation is that String actually caches its hash code. It can do this, because String is immutable.
If I calculate the hashcode of the String "Amit", it will yield to this integer:
System.out.println("Amit".hashCode());
> 2044535
Let's get through a simple put to a map, but first we have to determine how the map is built.
The most interesting fact about a Java HashMap is that it always has 2^n buckets. So if you call it, the default number of buckets is 16, which is obviously 2^4.
Doing a put operation on this map, it will first get the hashcode of the key. There happens some fancy bit twiddeling on this hashcode to ensure that poor hash functions (especially those that do not differ in the lower bits) don't "overload" a single bucket.
The real function that is actually responsible for distributing your key to the buckets is the following:
h & (length-1); // length is the current number of buckets, h the hashcode of the key
This only works for power of two bucket sizes, because it uses & to map the key to a bucket instead of a modulo.
"Amit" will be distributed to the 10th bucket, because of the bit twiddeling. If there were no bit twiddeling it would go to the 7th bucket, because 2044535 & 15 = 7.
Now that we have an index for it, we can find the bucket. If the bucket contains elements, we have to iterate over them and replace an equal entry if we find it.
If none item has been found in the linked list we will just add it at the beginning of the linked list.
The next important thing in HashMap is the resizing, so if the actual size of the map is above over a threshold (determined by the current number of buckets and the loadfactor, in our case 16*0.75=12) it will resize the backing array.
Resize is always 2 * the current number of buckets, which is guranteed to be a power of two to not break the function to find the buckets.
Since the number of buckets change, we have to rehash all the current entries in our table.
This is quite costly, so if you know how many items there are, you should initialize the HashMap with that count so it does not have to resize the whole time.
Q1: look at hashCode() method implementation for String object
Q2: Create simple class and implement its hashCode() method as return 1. That means each your object with that class will have the same hashCode and therefore will be saved in the same bucket in HashMap.
Understand that there are two basic requirements for a hash code:
When the hash code is recalculated for a given object (that has not been changed internally in a way that would alter its identity) it must produce the same value as the previous calculation. Similarly, two "identical" objects must produce the same hash codes.
When the hash code is calculated for two different objects (which are not considered "identical" from the standpoint of their internal content) there should be a high probability that the two hash codes would be different.
How these goals are accomplished is the subject of much interest to the math nerds who work on such things, but understanding the details is not at all important to understanding how hash tables work.
import java.util.Arrays;
public class Test2 {
public static void main(String[] args) {
Map<Integer, String> map = new Map<Integer, String>();
map.put(1, "A");
map.put(2, "B");
map.put(3, "C");
map.put(4, "D");
map.put(5, "E");
System.out.println("Iterate");
for (int i = 0; i < map.size(); i++) {
System.out.println(map.values()[i].getKey() + " : " + map.values()[i].getValue());
}
System.out.println("Get-> 3");
System.out.println(map.get(3));
System.out.println("Delete-> 3");
map.delete(3);
System.out.println("Iterate again");
for (int i = 0; i < map.size(); i++) {
System.out.println(map.values()[i].getKey() + " : " + map.values()[i].getValue());
}
}
}
class Map<K, V> {
private int size;
private Entry<K, V>[] entries = new Entry[16];
public void put(K key, V value) {
boolean flag = true;
for (int i = 0; i < size; i++) {
if (entries[i].getKey().equals(key)) {
entries[i].setValue(value);
flag = false;
break;
}
}
if (flag) {
this.ensureCapacity();
entries[size++] = new Entry<K, V>(key, value);
}
}
public V get(K key) {
V value = null;
for (int i = 0; i < size; i++) {
if (entries[i].getKey().equals(key)) {
value = entries[i].getValue();
break;
}
}
return value;
}
public boolean delete(K key) {
boolean flag = false;
Entry<K, V>[] entry = new Entry[size];
int j = 0;
int total = size;
for (int i = 0; i < total; i++) {
if (!entries[i].getKey().equals(key)) {
entry[j++] = entries[i];
} else {
flag = true;
size--;
}
}
entries = flag ? entry : entries;
return flag;
}
public int size() {
return size;
}
public Entry<K, V>[] values() {
return entries;
}
private void ensureCapacity() {
if (size == entries.length) {
entries = Arrays.copyOf(entries, size * 2);
}
}
#SuppressWarnings("hiding")
public class Entry<K, V> {
private K key;
private V value;
public K getKey() {
return key;
}
public V getValue() {
return value;
}
public void setValue(V value) {
this.value = value;
}
public Entry(K key, V value) {
super();
this.key = key;
this.value = value;
}
}
}

Calculating Percentiles on the fly

I'm programming in Java. Every 100 ms my program gets a new number.
It has a cache with contains the history of the last n = 180 numbers.
When I get a new number x I want to calculate how many numbers there are in the cache which are smaller than x.
Afterwards I want to delete the oldest number in the cache.
Every 100 ms I want to repeat the process of calculating how many smaller numbers there are and delete the oldest number.
Which algorithm should I use? I would like to optimize for making the calculation fast as it's not the only thing that calculated on those 100 ms.
For practical reasons and reasonable values of n you're best of with a ring-buffer of primitive ints (to keep track of oldest entry), and a linear scan for determining how many values are smaller than x.
In order for this to be in O(log n) you would have to use something like Guavas TreeMultiset. Here is an outline of how it would look.
class Statistics {
private final static int N = 180;
Queue<Integer> queue = new LinkedList<Integer>();
SortedMap<Integer, Integer> counts = new TreeMap<Integer, Integer>();
public int insertAndGetSmallerCount(int x) {
queue.add(x); // O(1)
counts.put(x, getCount(x) + 1); // O(log N)
int lessCount = 0; // O(N), unfortunately
for (int i : counts.headMap(x).values()) // use Guavas TreeMultiset
lessCount += i; // for O(log n)
if (queue.size() > N) { // O(1)
int oldest = queue.remove(); // O(1)
int newCount = getCount(oldest) - 1; // O(log N)
if (newCount == 0)
counts.remove(oldest); // O(log N)
else
counts.put(oldest, newCount); // O(log N)
}
return lessCount;
}
private int getCount(int x) {
return counts.containsKey(x) ? counts.get(x) : 0;
}
}
On my 1.8 GHz laptop, this solution performs 1,000,000 iterations on about 13 seconds (i.e. one iteration takes about 0.013 ms, well under 100 ms).
You can keep an array of 180 numbers and save an index to the oldest one so that when a new number comes in you overwrite the number at the oldest index and increment the index modulo 180 (it's a bit more complex than that since you need special behaviour for the first 180 numbers).
As for calculating how many numbers are smaller I would use the brute force way (iterate all the numbers and count).
Edit: I find it funny to see that the "optimized" version runs five times slower than this trivial implementation (thanks to #Eiko for the analysis). I think this is due to the fact that when you use trees and maps you lose data locality and have many more memory faults (not to mention memory allocation and garbage collection).
Add your numbers to a list. If size > 180, remove the first number.
Counting is just iterating over the 180 elements which is probably fast enough. It's hard to beat performance wise.
You can use a LinkedList implementation.
With this structure, you can easily manipulate the first and the last elements of the List.
(addFirst, removeFirst, ...)
For the algorithm (find how many numbers are lower/greater), a simple loop on the list is enough, and will give you the result in less than 100ms on a 180's element list.
You could try a custom linked list data structure where each node maintains next/prev as well as sorted next/prev references. Then inserting becomes a two phase process, first always insert node at tail, and the insert sort, and the insert sort will return the count of numbers less than x. Deleting is simply removing the head.
Here is an example, NOTE: THIS IS VERY NASTY JAVA, IT IS EXAMPLE CODE TO PURELY DEMONSTRATE THE IDEA. You get the idea! ;) Also, I'm only adding a few items, but it should give you an idea of how it would work... The worst case for this is a full iteration through the sorted linked list - which is no worse than the examples above I guess?
import java.util.*;
class SortedLinkedList {
public static class SortedLL<T>
{
public class SortedNode<T>
{
public SortedNode(T value)
{
_value = value;
}
T _value;
SortedNode<T> prev;
SortedNode<T> next;
SortedNode<T> sortedPrev;
SortedNode<T> sortedNext;
}
public SortedLL(Comparator comp)
{
_comp = comp;
_head = new SortedNode<T>(null);
_tail = new SortedNode<T>(null);
// Setup the pointers
_head.next = _tail;
_tail.prev = _head;
_head.sortedNext = _tail;
_tail.sortedPrev = _head;
_sortedHead = _head;
_sortedTail = _tail;
}
int insert(T value)
{
SortedNode<T> nn = new SortedNode<T>(value);
// always add node at end
nn.prev = _tail.prev;
nn.prev.next = nn;
nn.next = _tail;
_tail.prev = nn;
// now second insert sort through..
int count = 0;
SortedNode<T> ptr = _sortedHead.sortedNext;
while(ptr.sortedNext != null)
{
if (_comp.compare(ptr._value, nn._value) >= 0)
{
break;
}
++count;
ptr = ptr.sortedNext;
}
// update the sorted pointers..
nn.sortedNext = ptr;
nn.sortedPrev = ptr.sortedPrev;
if (nn.sortedPrev != null)
nn.sortedPrev.sortedNext = nn;
ptr.sortedPrev = nn;
return count;
}
void trim()
{
// Remove from the head...
if (_head.next != _tail)
{
// trim.
SortedNode<T> tmp = _head.next;
_head.next = tmp.next;
_head.next.prev = _head;
// Now updated the sorted list
if (tmp.sortedPrev != null)
{
tmp.sortedPrev.sortedNext = tmp.sortedNext;
}
if (tmp.sortedNext != null)
{
tmp.sortedNext.sortedPrev = tmp.sortedPrev;
}
}
}
void printList()
{
SortedNode<T> ptr = _head.next;
while (ptr != _tail)
{
System.out.println("node: v: " + ptr._value);
ptr = ptr.next;
}
}
void printSorted()
{
SortedNode<T> ptr = _sortedHead.sortedNext;
while (ptr != _sortedTail)
{
System.out.println("sorted: v: " + ptr._value);
ptr = ptr.sortedNext;
}
}
Comparator _comp;
SortedNode<T> _head;
SortedNode<T> _tail;
SortedNode<T> _sortedHead;
SortedNode<T> _sortedTail;
}
public static class IntComparator implements Comparator
{
public int compare(Object v1, Object v2){
Integer iv1 = (Integer)v1;
Integer iv2 = (Integer)v2;
return iv1.compareTo(iv2);
}
}
public static void main(String[] args){
SortedLL<Integer> ll = new SortedLL<Integer>(new IntComparator());
System.out.println("inserting: " + ll.insert(1));
System.out.println("inserting: " + ll.insert(3));
System.out.println("inserting: " + ll.insert(2));
System.out.println("inserting: " + ll.insert(5));
System.out.println("inserting: " + ll.insert(4));
ll.printList();
ll.printSorted();
System.out.println("inserting new value");
System.out.println("inserting: " + ll.insert(3));
ll.trim();
ll.printList();
ll.printSorted();
}
}
Let the cache be a list, so you can insert at the start and let the oldest be at the end and be removed.
Then after every insertion just scan the whole list and calculate the number you need.
Take a look at the commons-math implementation of the DescriptiveStatistics class (Percentile.java)
180 values is not many and a simple array which a brute force search and System.arraycopy() should be faster than 1 micro-second (1/1000 milli-second) and incurs no GC. It could be faster that playing with more complex collections.
I suggest you keep it simple and measure how long ti takes before assuming you need to optimise it.

how to Compute the average probe length for success and failure - Linear probe (Hash Tables) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I'm doing an assignment for my Data Structures class. we were asked to to study linear probing with load factors of .1, .2 , .3, ...., and .9. The formula for testing is:
The average probe length using linear probing is roughly
Success--> ( 1 + 1/(1-L)**2)/2
or
Failure--> (1+1(1-L))/2.
we are required to find the theoretical using the formula above which I did(just plug the load factor in the formula), then we have to calculate the empirical (which I not quite sure how to do). here is the rest of the requirements
**For each load factor, 10,000 randomly generated positive ints
between 1 and 50000 (inclusive) will
be inserted into a table of the
"right" size, where "right" is
strictly based upon the load factor
you are testing. Repeats are allowed.
Be sure that your formula for randomly
generated ints is correct. There is a
class called Random in java.util. USE
it! After a table of the right (based
upon L) size is loaded with 10,000
ints, do 100 searches of newly
generated random ints from the range
of 1 to 50000. Compute the average
probe length for each of the two
formulas and indicate the denominators
used in each calculationSo, for example, each test for a .5 load would have a table of > > size
approximately 20,000 (adjusted to be
prime) and similarly each test for a
.9 load would have a table of
approximate size 10,000/.9 (again
adjusted to be prime).
The program should run displaying the
various load factors tested, the
average probe for each search (the two
denominators used to compute the
averages will add to 100), and the
theoretical answers using the formula
above. .**
how do I calculate the empirical success?
here is my code so far:
import java.util.Random;
/**
*
* #author Johnny
*/
class DataItem
{
private int iData;
public DataItem(int it)
{iData = it;}
public int getKey()
{
return iData;
}
}
class HashTable
{
private DataItem[] hashArray;
private int arraySize;
public HashTable(int size)
{
arraySize = size;
hashArray = new DataItem[arraySize];
}
public void displayTable()
{
int sp=0;
System.out.print("Table: ");
for(int j=0; j<arraySize; j++)
{
if(sp>50){System.out.println("");sp=0;}
if(hashArray[j] != null){
System.out.print(hashArray[j].getKey() + " ");sp++;}
else
{System.out.print("** "); sp++;}
}
System.out.println("");
}
public int hashFunc(int key)
{
return key %arraySize;
}
public void insert(DataItem item)
{
int key = item.getKey();
int hashVal = hashFunc(key);
while(hashArray[hashVal] != null &&
hashArray[hashVal].getKey() != -1)
{
++hashVal;
hashVal %= arraySize;
}
hashArray[hashVal]=item;
}
public int hashFunc1(int key)
{
return key % arraySize;
}
public int hashFunc2(int key)
{
// non-zero, less than array size, different from hF1
// array size must be relatively prime to 5, 4, 3, and 2
return 5 - key % 5;
}
public DataItem find(int key) // find item with key
// (assumes table not full)
{
int hashVal = hashFunc1(key); // hash the key
int stepSize = hashFunc2(key); // get step size
while(hashArray[hashVal] != null) // until empty cell,
{ // is correct hashVal?
if(hashArray[hashVal].getKey() == key)
return hashArray[hashVal]; // yes, return item
hashVal += stepSize; // add the step
hashVal %= arraySize; // for wraparound
}
return null; // can’t find item
}
}
public class n00645805 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
double b=1;
double L;
double[] tf = new double[9];
double[] ts = new double[9];
double d=0.1;
DataItem aDataItem;
int aKey;
HashTable h1Table = new HashTable(100003); //L=.1
HashTable h2Table = new HashTable(50051); //L=.2
HashTable h3Table = new HashTable(33343); //L=.3
HashTable h4Table = new HashTable(25013); //L=.4
HashTable h5Table = new HashTable(20011); //L=.5
HashTable h6Table = new HashTable(16673); //L=.6
HashTable h7Table = new HashTable(14243); //L=.7
HashTable h8Table = new HashTable(12503); //L=.8
HashTable h9Table = new HashTable(11113); //L=.9
fillht(h1Table);
fillht(h2Table);
fillht(h3Table);
fillht(h4Table);
fillht(h5Table);
fillht(h6Table);
fillht(h7Table);
fillht(h8Table);
fillht(h9Table);
pm(h1Table);
pm(h2Table);
pm(h3Table);
pm(h4Table);
pm(h5Table);
pm(h6Table);
pm(h7Table);
pm(h8Table);
pm(h9Table);
for (int j=1;j<10;j++)
{
//System.out.println(j);
L=Math.round((b-d)*100.0)/100.0;
System.out.println(L);
System.out.println("ts "+(1+(1/(1-L)))/2);
System.out.println("tf "+(1+(1/((1-L)*(1-L))))/2);
tf[j-1]=(1+(1/(1-L)))/2;
ts[j-1]=(1+(1/((1-L)*(1-L))))/2;
d=d+.1;
}
display(ts,tf);
}
public static void fillht(HashTable a)
{
Random r = new Random();
for(int j=0; j<10000; j++)
{
int aKey;
DataItem y;
aKey =1+Math.round(r.nextInt(50000));
y = new DataItem(aKey);
a.insert(y);
}
}
public static void pm(HashTable a)
{
DataItem X;
int numsuc=0;
int numfail=0;
int aKey;
Random r = new Random();
for(int j=0; j<100;j++)
{
aKey =1+Math.round(r.nextInt(50000));
X = a.find(aKey);
if(X != null)
{
//System.out.println("Found " + aKey);
numsuc++;
}
else
{
//System.out.println("Could not find " + aKey);
numfail++;
}
}
System.out.println("# of succ is "+ numsuc+" # of failures is "+ numfail);
}
public static void display(double[] s, double[] f)
{
}
}
You should take into account that Java's HashTable uses a closed addressing (no probing) implementation, so you have separate buckets in which many items can be placed. This is not what you are looking for in your benchmarks. I'm not sure about HashMap implementation but I think it uses open addressing too.
So forget about JDK classes.. since you want to calculate empirical values you should write your own version of an hashtable that uses the open addressing implementation with linear probing but you should take care of counting the probe length whenever you try to get a value from the hashmap..
For example you can write your hashmap and then take care of having
class YourHashMap
{
int empiricalGet(K key)
{
// search for the key but store the probe length of this get operation
return probeLength;
}
}
Then you can easily benchmark it by searching how many keys you want and calculating the average probe length.
Otherwise you can just provide the hasmap the ability of storing the total probe length and the count of gets requested and retrieve them after the benchmark run to calculate average value.
This kind of exercises must prove that the empirical value concordates with the theoretical one. So take also into account the fact that you may need many benchmarks, and then do the average of them all, assuring that variance is not too high.

Categories

Resources