Calculating Percentiles on the fly - java

I'm programming in Java. Every 100 ms my program gets a new number.
It has a cache with contains the history of the last n = 180 numbers.
When I get a new number x I want to calculate how many numbers there are in the cache which are smaller than x.
Afterwards I want to delete the oldest number in the cache.
Every 100 ms I want to repeat the process of calculating how many smaller numbers there are and delete the oldest number.
Which algorithm should I use? I would like to optimize for making the calculation fast as it's not the only thing that calculated on those 100 ms.

For practical reasons and reasonable values of n you're best of with a ring-buffer of primitive ints (to keep track of oldest entry), and a linear scan for determining how many values are smaller than x.
In order for this to be in O(log n) you would have to use something like Guavas TreeMultiset. Here is an outline of how it would look.
class Statistics {
private final static int N = 180;
Queue<Integer> queue = new LinkedList<Integer>();
SortedMap<Integer, Integer> counts = new TreeMap<Integer, Integer>();
public int insertAndGetSmallerCount(int x) {
queue.add(x); // O(1)
counts.put(x, getCount(x) + 1); // O(log N)
int lessCount = 0; // O(N), unfortunately
for (int i : counts.headMap(x).values()) // use Guavas TreeMultiset
lessCount += i; // for O(log n)
if (queue.size() > N) { // O(1)
int oldest = queue.remove(); // O(1)
int newCount = getCount(oldest) - 1; // O(log N)
if (newCount == 0)
counts.remove(oldest); // O(log N)
else
counts.put(oldest, newCount); // O(log N)
}
return lessCount;
}
private int getCount(int x) {
return counts.containsKey(x) ? counts.get(x) : 0;
}
}
On my 1.8 GHz laptop, this solution performs 1,000,000 iterations on about 13 seconds (i.e. one iteration takes about 0.013 ms, well under 100 ms).

You can keep an array of 180 numbers and save an index to the oldest one so that when a new number comes in you overwrite the number at the oldest index and increment the index modulo 180 (it's a bit more complex than that since you need special behaviour for the first 180 numbers).
As for calculating how many numbers are smaller I would use the brute force way (iterate all the numbers and count).
Edit: I find it funny to see that the "optimized" version runs five times slower than this trivial implementation (thanks to #Eiko for the analysis). I think this is due to the fact that when you use trees and maps you lose data locality and have many more memory faults (not to mention memory allocation and garbage collection).

Add your numbers to a list. If size > 180, remove the first number.
Counting is just iterating over the 180 elements which is probably fast enough. It's hard to beat performance wise.

You can use a LinkedList implementation.
With this structure, you can easily manipulate the first and the last elements of the List.
(addFirst, removeFirst, ...)
For the algorithm (find how many numbers are lower/greater), a simple loop on the list is enough, and will give you the result in less than 100ms on a 180's element list.

You could try a custom linked list data structure where each node maintains next/prev as well as sorted next/prev references. Then inserting becomes a two phase process, first always insert node at tail, and the insert sort, and the insert sort will return the count of numbers less than x. Deleting is simply removing the head.
Here is an example, NOTE: THIS IS VERY NASTY JAVA, IT IS EXAMPLE CODE TO PURELY DEMONSTRATE THE IDEA. You get the idea! ;) Also, I'm only adding a few items, but it should give you an idea of how it would work... The worst case for this is a full iteration through the sorted linked list - which is no worse than the examples above I guess?
import java.util.*;
class SortedLinkedList {
public static class SortedLL<T>
{
public class SortedNode<T>
{
public SortedNode(T value)
{
_value = value;
}
T _value;
SortedNode<T> prev;
SortedNode<T> next;
SortedNode<T> sortedPrev;
SortedNode<T> sortedNext;
}
public SortedLL(Comparator comp)
{
_comp = comp;
_head = new SortedNode<T>(null);
_tail = new SortedNode<T>(null);
// Setup the pointers
_head.next = _tail;
_tail.prev = _head;
_head.sortedNext = _tail;
_tail.sortedPrev = _head;
_sortedHead = _head;
_sortedTail = _tail;
}
int insert(T value)
{
SortedNode<T> nn = new SortedNode<T>(value);
// always add node at end
nn.prev = _tail.prev;
nn.prev.next = nn;
nn.next = _tail;
_tail.prev = nn;
// now second insert sort through..
int count = 0;
SortedNode<T> ptr = _sortedHead.sortedNext;
while(ptr.sortedNext != null)
{
if (_comp.compare(ptr._value, nn._value) >= 0)
{
break;
}
++count;
ptr = ptr.sortedNext;
}
// update the sorted pointers..
nn.sortedNext = ptr;
nn.sortedPrev = ptr.sortedPrev;
if (nn.sortedPrev != null)
nn.sortedPrev.sortedNext = nn;
ptr.sortedPrev = nn;
return count;
}
void trim()
{
// Remove from the head...
if (_head.next != _tail)
{
// trim.
SortedNode<T> tmp = _head.next;
_head.next = tmp.next;
_head.next.prev = _head;
// Now updated the sorted list
if (tmp.sortedPrev != null)
{
tmp.sortedPrev.sortedNext = tmp.sortedNext;
}
if (tmp.sortedNext != null)
{
tmp.sortedNext.sortedPrev = tmp.sortedPrev;
}
}
}
void printList()
{
SortedNode<T> ptr = _head.next;
while (ptr != _tail)
{
System.out.println("node: v: " + ptr._value);
ptr = ptr.next;
}
}
void printSorted()
{
SortedNode<T> ptr = _sortedHead.sortedNext;
while (ptr != _sortedTail)
{
System.out.println("sorted: v: " + ptr._value);
ptr = ptr.sortedNext;
}
}
Comparator _comp;
SortedNode<T> _head;
SortedNode<T> _tail;
SortedNode<T> _sortedHead;
SortedNode<T> _sortedTail;
}
public static class IntComparator implements Comparator
{
public int compare(Object v1, Object v2){
Integer iv1 = (Integer)v1;
Integer iv2 = (Integer)v2;
return iv1.compareTo(iv2);
}
}
public static void main(String[] args){
SortedLL<Integer> ll = new SortedLL<Integer>(new IntComparator());
System.out.println("inserting: " + ll.insert(1));
System.out.println("inserting: " + ll.insert(3));
System.out.println("inserting: " + ll.insert(2));
System.out.println("inserting: " + ll.insert(5));
System.out.println("inserting: " + ll.insert(4));
ll.printList();
ll.printSorted();
System.out.println("inserting new value");
System.out.println("inserting: " + ll.insert(3));
ll.trim();
ll.printList();
ll.printSorted();
}
}

Let the cache be a list, so you can insert at the start and let the oldest be at the end and be removed.
Then after every insertion just scan the whole list and calculate the number you need.

Take a look at the commons-math implementation of the DescriptiveStatistics class (Percentile.java)

180 values is not many and a simple array which a brute force search and System.arraycopy() should be faster than 1 micro-second (1/1000 milli-second) and incurs no GC. It could be faster that playing with more complex collections.
I suggest you keep it simple and measure how long ti takes before assuming you need to optimise it.

Related

What is the space complexity of computing linkedlist intersection

I am computing intersection of 2 linkedlists, where one linkedlists if of size 'n' and second is of size 'm'. The code below, stores the items of smaller linkedlist in a set. Thus space complexity is O(m), where m < n, aka, m is length of smaller linkedlist.
But, it is possible that 2 linked lists are equal, m = n. So is the complexity O(n) ?
public IntersectionAndUnionLinkedList<T> intersection(IntersectionAndUnionLinkedList<T> list) {
final Set<T> items = new HashSet<>();
Node<T> firstSmall = null;
Node<T> firstBig = null;
if (size <= list.size) {
firstSmall = first;
firstBig = list.first;
} else {
firstSmall = list.first;
firstBig = first;
}
Node<T> n = null;
for (n = firstSmall; n != null; n = n.next) {
items.add(n.item);
}
IntersectionAndUnionLinkedList<T> intersectionlist = new IntersectionAndUnionLinkedList<>();
for (n = firstBig; n != null; n = n.next) {
if (items.contains(n.item)) {
intersectionlist.add(n.item);
}
}
return intersectionlist;
}
Well in case m=n, O(m)=O(n), but it is safe to state that the memory complexity is O(m) since it's the only real factor.
On the other hand, a HashSet<T> can under extreme circumstances be less memory efficient: after all it uses buckets, the buckets can be filled in a bad way. It depends on the exact implementation of a HashMap<T>. Although one still expect linear memory complexity so O(m).

Smallest java structure with relatively decent contains() solution

Alright, here's the lowdown: I'm writing a class in Java that finds the Nth Hardy's Taxi number (a number that can be summed up by two different sets of two cubed numbers). I have the discovery itself down, but I am in desperate need of some space saving. To that end, I need the smallest possible data structure where I can relatively easily use or create a method like contains(). I'm not particularly worried about speed, as my current solution can certainly get it to compute well within the time restrictions.
In short, the data structure needs:
To be able to relatively simply implement a contains() method
To use a low amount of memory
To be able to store very large number of entries
To be easily usable with the primitive long type
Any ideas? I started with a hash map (because I needed to test the values the led to the sum to ensure accuracy), then moved to hash set once I guaranteed reliable answers.
Any other general ideas on how to save some space would be greatly appreciated!
I don't think you'd need the code to answer the question, but here it is in case you're curious:
public class Hardy {
// private static HashMap<Long, Long> hm;
/**
* Find the nth Hardy number (start counting with 1, not 0) and the numbers
* whose cubes demonstrate that it is a Hardy number.
* #param n
* #return the nth Hardy number
*/
public static long nthHardyNumber(int n) {
// long i, j, oldValue;
int i, j;
int counter = 0;
long xyLimit = 2147483647; // xyLimit is the max value of a 32bit signed number
long sum;
// hm = new HashMap<Long, Long>();
int hardyCalculations = (int) (n * 1.1);
HashSet<Long> hs = new HashSet<Long>(hardyCalculations * hardyCalculations, (float) 0.95);
long[] sums = new long[hardyCalculations];
// long binaryStorage, mask = 0x00000000FFFFFFFF;
for (i = 1; i < xyLimit; i++){
for (j = 1; j <= i; j++){
// binaryStorage = ((i << 32) + j);
// long y = ((binaryStorage << 32) >> 32) & mask;
// long x = (binaryStorage >> 32) & mask;
sum = cube(i) + cube(j);
if (hs.contains(sum) && !arrayContains(sums, sum)){
// oldValue = hm.get(sum);
// long oldY = ((oldValue << 32) >> 32) & mask;
// long oldX = (oldValue >> 32) & mask;
// if (oldX != x && oldX != y){
sums[counter] = sum;
counter++;
if (counter == hardyCalculations){
// Arrays.sort(sums);
bubbleSort(sums);
return sums[n - 1];
}
} else {
hs.add(sum);
}
}
}
return 0;
}
private static void bubbleSort(long[] array){
long current, next;
int i;
boolean ordered = false;
while (!ordered) {
ordered = true;
for (i = 0; i < array.length - 1; i++){
current = array[i];
next = array[i + 1];
if (current > next) {
ordered = false;
array[i] = next;
array[i+1] = current;
}
}
}
}
private static boolean arrayContains(long[] array, long n){
for (long l : array){
if (l == n){
return true;
}
}
return false;
}
private static long cube(long n){
return n*n*n;
}
}
Have you considered using a standard tree? In java that would be a TreeSet. By sacrificing speed, a tree generally gains back space over a hash.
For that matter, sums might be a TreeMap, transforming the linear arrayContains to a logarithmic operation. Being naturally ordered, there would also be no need to re-sort it afterwards.
EDIT
The complaint against using a java tree structure for sums is that java's tree types don't support the k-select algorithm. On the assumption that Hardy numbers are rare, perhaps you don't need to sweat the complexity of this container (in which case your array is fine.)
If you did need to improve time performance of this aspect, you could consider using a selection-enabled tree such as the one mentioned here. However that solution works by increasing the space requirement, not lowering it.
Alternately we can incrementally throw out Hardy numbers we know we don't need. Suppose during the running of the algorithm, sums already contains n Hardy numbers and we discover a new one. We insert it and do whatever we need to preserve collection order, and so now contains n+1 sorted elements.
Consider that last element. We already know about n smaller Hardy numbers, and so there is no possible way this last element is our answer. Why keep it? At this point we can shrink sums again down to size n and toss the largest element out. This is both a space savings, and time savings as we have fewer elements to maintain in sorted order.
The natural data structure for sums in that approach is a max heap. In java there is no native implementation available, but a few 3rd party ones are floating around. You could "make it work" with TreeMap::lastKey, which will be slower in the end, but still faster than quadratic bubbleSort.
If you have an extremely large number of elements, and you effectively want an index to allow fast tests for containment in the underlying dataset, then take a look at Bloom Filters. These are space-efficient indexes whose sole purpose is to enable fast tests for containment in a dataset.
Bloom Filters are probabilistic, which means if they return true for containment, then you actually need to check your underlying dataset to confirm that the element is really present.
If they return false, the element is guaranteed not to be contained in the underlying dataset, and in that case the test for containment would be very cheap.
So it depends on the whether most of the time you expect a candidate to really be contained in the dataset or not.
this is core function to find if a given number is HR-number: it's in C but one should get the idea:
bool is_sum_of_cubes(int value)
{
int m = pow(value, 1.0/3);
int i = m;
int j = 1;
while(j < m && i >= 0)
{
int element = i*i*i + j*j*j;
if( value == element )
{
return true;
}
if(element < value)
{
++j;
}
else
{
--i;
}
}
return false;
}

Solve n-puzzle in Java

I'm trying to implement a program to solve the n-puzzle problem.
I have written a simple implementation in Java that has a state of the problem characterized by a matrix representing the tiles. I am also able to auto-generate the graph of all the states giving the starting state. On the graph, then, I can do a BFS to find the path to the goal state.
But the problem is that I run out of memory and I cannot even create the whole graph.
I tried with a 2x2 tiles and it works. Also with some 3x3 (it depends on the starting state and how many nodes are in the graph). But in general this way is not suitable.
So I tried generating the nodes at runtime, while searching. It works, but it is slow (sometimes after some minutes it still have not ended and I terminate the program).
Btw: I give as starting state only solvable configurations and I don't create duplicated states.
So, I cannot create the graph. This leads to my main problem: I have to implement the A* algorithm and I need the path cost (i.e. for each node the distance from the starting state), but I think I cannot calculate it at runtime. I need the whole graph, right? Because A* does not follow a BFS exploration of the graph, so I don't know how to estimate the distance for each node. Hence, I don't know how to perform an A* search.
Any suggestion?
EDIT
State:
private int[][] tiles;
private int pathDistance;
private int misplacedTiles;
private State parent;
public State(int[][] tiles) {
this.tiles = tiles;
pathDistance = 0;
misplacedTiles = estimateHammingDistance();
parent = null;
}
public ArrayList<State> findNext() {
ArrayList<State> next = new ArrayList<State>();
int[] coordZero = findCoordinates(0);
int[][] copy;
if(coordZero[1] + 1 < Solver.SIZE) {
copy = copyTiles();
int[] newCoord = {coordZero[0], coordZero[1] + 1};
switchValues(copy, coordZero, newCoord);
State newState = checkNewState(copy);
if(newState != null)
next.add(newState);
}
if(coordZero[1] - 1 >= 0) {
copy = copyTiles();
int[] newCoord = {coordZero[0], coordZero[1] - 1};
switchValues(copy, coordZero, newCoord);
State newState = checkNewState(copy);
if(newState != null)
next.add(newState);
}
if(coordZero[0] + 1 < Solver.SIZE) {
copy = copyTiles();
int[] newCoord = {coordZero[0] + 1, coordZero[1]};
switchValues(copy, coordZero, newCoord);
State newState = checkNewState(copy);
if(newState != null)
next.add(newState);
}
if(coordZero[0] - 1 >= 0) {
copy = copyTiles();
int[] newCoord = {coordZero[0] - 1, coordZero[1]};
switchValues(copy, coordZero, newCoord);
State newState = checkNewState(copy);
if(newState != null)
next.add(newState);
}
return next;
}
private State checkNewState(int[][] tiles) {
State newState = new State(tiles);
for(State s : Solver.states)
if(s.equals(newState))
return null;
return newState;
}
#Override
public boolean equals(Object obj) {
if(this == null || obj == null)
return false;
if (obj.getClass().equals(this.getClass())) {
for(int r = 0; r < tiles.length; r++) {
for(int c = 0; c < tiles[r].length; c++) {
if (((State)obj).getTiles()[r][c] != tiles[r][c])
return false;
}
}
return true;
}
return false;
}
Solver:
public static final HashSet<State> states = new HashSet<State>();
public static void main(String[] args) {
solve(new State(selectStartingBoard()));
}
public static State solve(State initialState) {
TreeSet<State> queue = new TreeSet<State>(new Comparator1());
queue.add(initialState);
states.add(initialState);
while(!queue.isEmpty()) {
State current = queue.pollFirst();
for(State s : current.findNext()) {
if(s.goalCheck()) {
s.setParent(current);
return s;
}
if(!states.contains(s)) {
s.setPathDistance(current.getPathDistance() + 1);
s.setParent(current);
states.add(s);
queue.add(s);
}
}
}
return null;
}
Basically here is what I do:
- Solver's solve has a SortedSet. Elements (States) are sorted according to Comparator1, which calculates f(n) = g(n) + h(n), where g(n) is the path cost and h(n) is a heuristic (the number of misplaced tiles).
- I give the starting configuration and look for all the successors.
- If a successor has not been already visited (i.e. if it is not in the global set States) I add it to the queue and to States, setting the current state as its parent and parent's path + 1 as its path cost.
- Dequeue and repeat.
I think it should work because:
- I keep all the visited states so I'm not looping.
- Also, there won't be any useless edge because I immediately store current node's successors. E.g.: if from A I can go to B and C, and from B I could also go to C, there won't be the edge B->C (since path cost is 1 for each edge and A->B is cheaper than A->B->C).
- Each time I choose to expand the path with the minimum f(n), accordin to A*.
But it does not work. Or at least, after a few minutes it still can't find a solution (and I think is a lot of time in this case).
If I try to create a tree structure before executing A*, I run out of memory building it.
EDIT 2
Here are my heuristic functions:
private int estimateManhattanDistance() {
int counter = 0;
int[] expectedCoord = new int[2];
int[] realCoord = new int[2];
for(int value = 1; value < Solver.SIZE * Solver.SIZE; value++) {
realCoord = findCoordinates(value);
expectedCoord[0] = (value - 1) / Solver.SIZE;
expectedCoord[1] = (value - 1) % Solver.SIZE;
counter += Math.abs(expectedCoord[0] - realCoord[0]) + Math.abs(expectedCoord[1] - realCoord[1]);
}
return counter;
}
private int estimateMisplacedTiles() {
int counter = 0;
int expectedTileValue = 1;
for(int i = 0; i < Solver.SIZE; i++)
for(int j = 0; j < Solver.SIZE; j++) {
if(tiles[i][j] != expectedTileValue)
if(expectedTileValue != Solver.ZERO)
counter++;
expectedTileValue++;
}
return counter;
}
If I use a simple greedy algorithm they both work (using Manhattan distance is really quick (around 500 iterations to find a solution), while with number of misplaced tiles it takes around 10k iterations). If I use A* (evaluating also the path cost) it's really slow.
Comparators are like that:
public int compare(State o1, State o2) {
if(o1.getPathDistance() + o1.getManhattanDistance() >= o2.getPathDistance() + o2.getManhattanDistance())
return 1;
else
return -1;
}
EDIT 3
There was a little error. I fixed it and now A* works. Or at least, for the 3x3 if finds the optimal solution with only 700 iterations. For the 4x4 it's still too slow. I'll try with IDA*, but one question: how long could it take with A* to find the solution? Minutes? Hours? I left it for 10 minutes and it didn't end.
There is no need to generate all state space nodes for solving a problem using BFS, A* or any tree search, you just add states you can explore from current state to the fringe and that's why there is a successor function.
If BFS consumes much memory it is normal. But I don't know exactly fro what n it would make problem. Use DFS instead.
For A* you know how many moves you made to come to current state and you can estimate moves need to solve problem, simply by relaxing problem. As an example you can think that any two tiles can replace and then count moves needed to solve the problem. You heuristic just needs to be admissible ie. your estimate be less then actual moves needed to solve the problem.
add a path cost to your state class and every time you go from a parent state P to another state like C do this : c.cost = P.cost + 1 this will compute the path cost for every node automatically
this is also a very good and simple implementation in C# for 8-puzzle solver with A* take a look at it you will learn many things :
http://geekbrothers.org/index.php/categories/computer/12-solve-8-puzzle-with-a

Java On-Memory Efficient Key-Value Store

I have store 111 million key-value pairs (one key can have multiple values - maximum 2/3) whose key are 50 bit Integers and values are 32 bit (maximum) Integers. Now, my requirements are:
Fast Insertion of (Key, Value) pair [allowing duplicates]
Fast retrieving of value/values based on key.
A nice solution of it is given here based on MultiMap. However, I want to store more key-values pairs in main memory with no/little bit performance penalty. I studied from web articles that B+ Tree, R+ Tree, B Tree, Compact Multimap etc. can be a nice solution for that. Can anybody help me:
Is there any Java library which satisfies my all those needs properly
(above mentioned/other ds also acceptable. no issue with that) ?
Actually, I want an efficient java library data structure to store/retrieve
key-value/values pairs which takes less memory footprint and must be
built in-memory.
NB: I have tried with HashMultiMap (Guava with some modification with trove) as mentioned by Louis Wasserman, Kyoto/Tokyo Cabinet etc etc.My experience is not good with disk-baked solutions. So please avoid that :). Another point is that, for choosing library/ds one important point is: keys are 50 bit (so if we assign 64bit) 14 bit will lost and values are 32 bit Int (maximum)- mostly they are 10-12-14 bits. So, we can save space there also.
I don't think there's anything in the JDK which will do this.
However, implementing such a thing is a simple matter of programming. Here is an open-addressed hashtable with linear probing, with keys and values stored in parallel arrays:
public class LongIntParallelHashMultimap {
private static final long NULL = 0L;
private final long[] keys;
private final int[] values;
private int size;
public LongIntParallelHashMultimap(int capacity) {
keys = new long[capacity];
values = new int[capacity];
}
public void put(long key, int value) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
if (size == keys.length) throw new IllegalStateException("map is full");
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int[] get(long key) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
int index = indexFor(key);
int count = countHits(key, index);
int[] hits = new int[count];
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
}
index = successor(index);
}
return hits;
}
private int countHits(long key, int index) {
int numHits = 0;
while (keys[index] != NULL) {
if (keys[index] == key) ++numHits;
index = successor(index);
}
return numHits;
}
private int indexFor(long key) {
// the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
// see The Art of Computer Programming, section 6.4
// the constant has two important properties:
// (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
// (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
long hash = key * 5700357409661598721L;
return Math.abs((int) (hash % keys.length));
}
private int successor(int index) {
return (index + 1) % keys.length;
}
public int size() {
return size;
}
}
Note that this is a fixed-size structure. You will need to create it big enough to hold all your data - 110 million entries for me takes up 1.32 GB. The bigger you make it, in excess of what you need to store the data, the faster that insertions and lookups will be. I found that for 110 million entries, with a load factor of 0.5 (2.64 GB, twice as much space as needed), it took on average 403 nanoseconds to look up a key, but with a load factor of 0.75 (1.76 GB, a third more space than is needed), it took 575 nanoseconds. Decreasing the load factor below 0.5 usually doesn't make much difference, and indeed, with a load factor of 0.33 (4.00 GB, three times more space than needed), i get an average time of 394 nanoseconds. So, even though you have 5 GB available, don't use it all.
Note also that zero is not allowed as a key. If this is a problem, change the null value to be something else, and pre-fill the keys array with that on creation.
Is there any Java library which satisfies my all those needs properly.
AFAIK no. Or at least, not one that minimizes the memory footprint.
However, it should be easy write a custom map class that is specialized to these requirements.
It's a good idea to look for databases, because problems like these are what they are designed for. In recent years Key-Value databases became very popular, e.g. for web services (keyword "NoSQL"), so you should find something.
The choice for a custom data structure also depends if you want to use a hard drive to store your data (and how safe that has to be) or if it completely lost on program exit.
If implementing manually and the whole db fits into memory somewhat easily, I'd just implement a hashmap in C. Create a hash function that gives a (well-spread) memory address from a value. Insert there or next to it if already assigned. Assigning and retrieval is then O(1). If you implement it in Java, you'll have the 4 byte overhead for each (primitive) object.
Based on #Tom Andersons solution I removed the need to allocate objects, and added a performance test.
import java.util.Arrays;
import java.util.Random;
public class LongIntParallelHashMultimap {
private static final long NULL = Long.MIN_VALUE;
private final long[] keys;
private final int[] values;
private int size;
public LongIntParallelHashMultimap(int capacity) {
keys = new long[capacity];
values = new int[capacity];
Arrays.fill(keys, NULL);
}
public void put(long key, int value) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
if (size == keys.length) throw new IllegalStateException("map is full");
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int get(long key, int[] hits) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
int index = indexFor(key);
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
if (hitIndex == hits.length)
break;
}
index = successor(index);
}
return hitIndex;
}
private int indexFor(long key) {
return Math.abs((int) (key % keys.length));
}
private int successor(int index) {
index++;
return index >= keys.length ? index - keys.length : index;
}
public int size() {
return size;
}
public static class PerfTest {
public static void main(String... args) {
int values = 110* 1000 * 1000;
long start0 = System.nanoTime();
long[] keysValues = generateKeys(values);
LongIntParallelHashMultimap map = new LongIntParallelHashMultimap(222222227);
long start = System.nanoTime();
addKeyValues(values, keysValues, map);
long mid = System.nanoTime();
int sum = lookUpKeyValues(values, keysValues, map);
long time = System.nanoTime();
System.out.printf("Generated %.1f M keys/s, Added %.1f M/s and looked up %.1f M/s%n",
values * 1e3 / (start - start0), values * 1e3 / (mid - start), values * 1e3 / (time - mid));
System.out.println("Expected " + values + " got " + sum);
}
private static long[] generateKeys(int values) {
Random rand = new Random();
long[] keysValues = new long[values];
for (int i = 0; i < values; i++)
keysValues[i] = rand.nextLong();
return keysValues;
}
private static void addKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
for (int i = 0; i < values; i++) {
map.put(keysValues[i], i);
}
assert map.size() == values;
}
private static int lookUpKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
int[] found = new int[8];
int sum = 0;
for (int i = 0; i < values; i++) {
sum += map.get(keysValues[i], found);
}
return sum;
}
}
}
prints
Generated 34.8 M keys/s, Added 11.1 M/s and looked up 7.6 M/s
Run on an 3.8 GHz i7 with Java 7 update 3.
This is much slower than the previous test because you are accessing main memory, rather than the cache at random. This is really a test of the speed of your memory. The writes are faster because they can be performed asynchronously to main memory.
Using this collection
final SetMultimap<Long, Integer> map = Multimaps.newSetMultimap(
TDecorators.wrap(new TLongObjectHashMap<Collection<Integer>>()),
new Supplier<Set<Integer>>() {
public Set<Integer> get() {
return TDecorators.wrap(new TIntHashSet());
}
});
Running the same test with 50 million entries (which used about 16 GB) and -mx20g I go the following result.
Generated 47.2 M keys/s, Added 0.5 M/s and looked up 0.7 M/s
For 110 M entries you will need about 35 GB of memory and a machine 10 x faster than mine (3.8 GHz) to perform 5 million adds per second.
If you must use Java, then implement your own hashtable/hashmap. An important property of your table is to use a linkedlist to handle collisions. Hence when you do a lookup, you may return all the elements on the list.
Might be I am late in answering this question but elastic search will solve your problem.

Why does Java's ArrayList's remove function seem to cost so little?

I have a function which manipulates a very large list, exceeding about 250,000 items. For the majority of those items, it simply replaces the item at position x. However, for about 5% of them, it must remove them from the list.
Using a LinkedList seemed to be the most obvious solution to avoid expensive removals. However, naturally, accessing a LinkedList by index becomes increasingly slow as time goes on. The cost here is minutes (and a lot of them).
Using an Iterator over that LinkedList is also expensive, as I appear to need a separate copy to avoid Iterator concurrency issues while editing that list. The cost here is minutes.
However, here's where my mind is blown a bit. If I change to an ArrayList, it runs almost instantly.
For a list with 297515 elements, removing 11958 elements and modifying everything else takes 909ms. I verified that the resulting list is indeed 285557 in size, as expected, and contains the updated information I need.
Why is this so fast? I looked at the source for ArrayList in JDK6 and it appears to be using an arraycopy function as expected. I would love to understand why an ArrayList works so well here when common sense would seem to indicate that an array for this task is an awful idea, requiring shifting several hundred thousand items.
I ran a benchmark, trying each of the following strategies for filtering the list elements:
Copy the wanted elements into a new list
Use Iterator.remove() to remove the unwanted elements from an ArrayList
Use Iterator.remove() to remove the unwanted elements from a LinkedList
Compact the list in-place (moving the wanted elements to lower positions)
Remove by index (List.remove(int)) on an ArrayList
Remove by index (List.remove(int)) on a LinkedList
Each time I populated the list with 100000 random instances of Point and used a filter condition (based on the hash code) that would accept 95% of elements and reject the remaining 5% (the same proportion stated in the question, but with a smaller list because I didn't have time to run the test for 250000 elements.)
And the average times (on my old MacBook Pro: Core 2 Duo, 2.2GHz, 3Gb RAM) were:
CopyIntoNewListWithIterator : 4.24ms
CopyIntoNewListWithoutIterator: 3.57ms
FilterLinkedListInPlace : 4.21ms
RandomRemoveByIndex : 312.50ms
SequentialRemoveByIndex : 33632.28ms
ShiftDown : 3.75ms
So removing elements by index from a LinkedList was more than 300 times more expensive than removing them from an ArrayList, and probably somewhere between 6000-10000 times more expensive than the other methods (that avoid linear search and arraycopy)
Here there doesn't seem to be much difference between the four faster methods, but I ran just those four again with a 500000-element list with the following results:
CopyIntoNewListWithIterator : 92.49ms
CopyIntoNewListWithoutIterator: 71.77ms
FilterLinkedListInPlace : 15.73ms
ShiftDown : 11.86ms
I'm guessing that with the larger size cache memory becomes the limiting factor, so the cost of creating a second copy of the list becomes significant.
Here's the code:
import java.awt.Point;
import java.security.SecureRandom;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.TreeMap;
public class ListBenchmark {
public static void main(String[] args) {
Random rnd = new SecureRandom();
Map<String, Long> timings = new TreeMap<String, Long>();
for (int outerPass = 0; outerPass < 10; ++ outerPass) {
List<FilterStrategy> strategies =
Arrays.asList(new CopyIntoNewListWithIterator(),
new CopyIntoNewListWithoutIterator(),
new FilterLinkedListInPlace(),
new RandomRemoveByIndex(),
new SequentialRemoveByIndex(),
new ShiftDown());
for (FilterStrategy strategy: strategies) {
String strategyName = strategy.getClass().getSimpleName();
for (int innerPass = 0; innerPass < 10; ++ innerPass) {
strategy.populate(rnd);
if (outerPass >= 5 && innerPass >= 5) {
Long totalTime = timings.get(strategyName);
if (totalTime == null) totalTime = 0L;
timings.put(strategyName, totalTime - System.currentTimeMillis());
}
Collection<Point> filtered = strategy.filter();
if (outerPass >= 5 && innerPass >= 5) {
Long totalTime = timings.get(strategyName);
timings.put(strategy.getClass().getSimpleName(), totalTime + System.currentTimeMillis());
}
CHECKSUM += filtered.hashCode();
System.err.printf("%-30s %d %d %d%n", strategy.getClass().getSimpleName(), outerPass, innerPass, filtered.size());
strategy.clear();
}
}
}
for (Map.Entry<String, Long> e: timings.entrySet()) {
System.err.printf("%-30s: %9.2fms%n", e.getKey(), e.getValue() * (1.0/25.0));
}
}
public static volatile int CHECKSUM = 0;
static void populate(Collection<Point> dst, Random rnd) {
for (int i = 0; i < INITIAL_SIZE; ++ i) {
dst.add(new Point(rnd.nextInt(), rnd.nextInt()));
}
}
static boolean wanted(Point p) {
return p.hashCode() % 20 != 0;
}
static abstract class FilterStrategy {
abstract void clear();
abstract Collection<Point> filter();
abstract void populate(Random rnd);
}
static final int INITIAL_SIZE = 100000;
private static class CopyIntoNewListWithIterator extends FilterStrategy {
public CopyIntoNewListWithIterator() {
list = new ArrayList<Point>(INITIAL_SIZE);
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
ArrayList<Point> dst = new ArrayList<Point>(list.size());
for (Point p: list) {
if (wanted(p)) dst.add(p);
}
return dst;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
private static class CopyIntoNewListWithoutIterator extends FilterStrategy {
public CopyIntoNewListWithoutIterator() {
list = new ArrayList<Point>(INITIAL_SIZE);
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
int inputSize = list.size();
ArrayList<Point> dst = new ArrayList<Point>(inputSize);
for (int i = 0; i < inputSize; ++ i) {
Point p = list.get(i);
if (wanted(p)) dst.add(p);
}
return dst;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
private static class FilterLinkedListInPlace extends FilterStrategy {
public String toString() {
return getClass().getSimpleName();
}
FilterLinkedListInPlace() {
list = new LinkedList<Point>();
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
for (Iterator<Point> it = list.iterator();
it.hasNext();
) {
Point p = it.next();
if (! wanted(p)) it.remove();
}
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final LinkedList<Point> list;
}
private static class RandomRemoveByIndex extends FilterStrategy {
public RandomRemoveByIndex() {
list = new ArrayList<Point>(INITIAL_SIZE);
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
for (int i = 0; i < list.size();) {
if (wanted(list.get(i))) {
++ i;
} else {
list.remove(i);
}
}
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
private static class SequentialRemoveByIndex extends FilterStrategy {
public SequentialRemoveByIndex() {
list = new LinkedList<Point>();
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
for (int i = 0; i < list.size();) {
if (wanted(list.get(i))) {
++ i;
} else {
list.remove(i);
}
}
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final LinkedList<Point> list;
}
private static class ShiftDown extends FilterStrategy {
public ShiftDown() {
list = new ArrayList<Point>();
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
int inputSize = list.size();
int outputSize = 0;
for (int i = 0; i < inputSize; ++ i) {
Point p = list.get(i);
if (wanted(p)) {
list.set(outputSize++, p);
}
}
list.subList(outputSize, inputSize).clear();
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
}
Array copy is a rather unexpensive operation. It is done on a very basic level (its a java native static method) and you are not yet in the range where the performance becomes really important.
In your example you copy approx 12000 times an array of size 150000 (on average). This does not take much time. I tested it here on my laptop and it took less than 500 ms.
Update I used the following code to measure on my laptop (Intel P8400)
import java.util.Random;
public class PerformanceArrayCopy {
public static void main(String[] args) {
int[] lengths = new int[] { 10000, 50000, 125000, 250000 };
int[] loops = new int[] { 1000, 5000, 10000, 20000 };
for (int length : lengths) {
for (int loop : loops) {
Object[] list1 = new Object[length];
Object[] list2 = new Object[length];
for (int k = 0; k < 100; k++) {
System.arraycopy(list1, 0, list2, 0, list1.length);
}
int[] len = new int[loop];
int[] ofs = new int[loop];
Random rnd = new Random();
for (int k = 0; k < loop; k++) {
len[k] = rnd.nextInt(length);
ofs[k] = rnd.nextInt(length - len[k]);
}
long n = System.nanoTime();
for (int k = 0; k < loop; k++) {
System.arraycopy(list1, ofs[k], list2, ofs[k], len[k]);
}
n = System.nanoTime() - n;
System.out.print("length: " + length);
System.out.print("\tloop: " + loop);
System.out.print("\truntime [ms]: " + n / 1000000);
System.out.println();
}
}
}
}
Some results:
length: 10000 loop: 10000 runtime [ms]: 47
length: 50000 loop: 10000 runtime [ms]: 228
length: 125000 loop: 10000 runtime [ms]: 575
length: 250000 loop: 10000 runtime [ms]: 1198
I think the difference in performance is likely coming down to the difference that ArrayList supports random access where LinkedList does not.
If I want to get(1000) of an ArrayList I am specifying a specific index to access this, however LinkedList doesn't support this as it is organized through Node references.
If I call get(1000) of LinkedList, it will iterate the entire list until if finds index 1000 and this can be exorbitantly expensive if you have a large number of items in the LinkedList.
Interesting and unexpected results. This is just a hypothesis, but...
On average one of your array element removals will require moving half of your list (everything after it) back one element. If each item is a 64-bit pointer to an object (8 bytes), then this means copying 125000 items x 8 Bytes per pointer = 1 MB.
A modern CPU can copy a contiguous block of 1 MB of RAM to RAM pretty quickly.
Compared to looping over a linked list for every access, which requires comparisons and branching and other CPU unfriendly activities, the RAM copy is fast.
You should really try benchmarking the various operations independently and see how efficient they are with various list implementations. Share your results here if you do!
I'm skipping over some implementation details on purpose here, just to explain the fundamental difference.
To remove the N-th element of a list of M elements, the LinkedList implementation will navigate up to this element, then simply remove it and update the pointers of the N-1 and N+1 elements accordingly. This second operation is very simple, but it's getting up to this element that costs you time.
For an ArrayList however, the access time is instantaneous as it is backed by an array, meaning contiguous memory spaces. You can jump directly to the right memory address to perform, broadly speaking, the following:
reallocate a new array of M - 1 elements
put everything from 0 to N - 1 at index 0 in the new arraylist's array
put everything N + 1 to M at index N in the arraylist's array.
Thinking of it, you'll notice you can even reuse the same array as Java can use ArrayList with pre-allocated sizes, so if you remove elements you might as well skip steps 1 and 2 and directly do step 3 and update your size.
Memory accesses are fast, and copying a chunk of memory is probably sufficiently fast on modern hardware that moving to the N-position is too time consuming.
However, should you use your LinkedList in such a way that it allows you to remove multiple elements that follow each other and keep track of your position, you would see a gain.
But clearly, on a long list, doing a simple remove(i) will be costly.
To add a bilt of salt and spice to this:
See the note on Efficiency on the Array Data Structure and the note on Performance on the Dynamic Array Wikipedia entries, which describe your concern.
Keep in mind that using a memory structure that requires contiguous memory requires, well, contiguous memory. Which means your virtual memory will need to be able to allocate contiguous chunks. Or even with Java, you'll see your JVM happily going down with an obscure OutOfMemoryException taking its cause in a low-level crash.

Categories

Resources