Fastest Java HashSet<Integer> library [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
In addition to this quite old post, I need something that will use primitives and give a speedup for an application that contains lots of HashSets of Integers:
Set<Integer> set = new HashSet<Integer>();
So people mention libraries like Guava, Javalution, Trove, but there is no perfect comparison of those in terms of benchmarks and performance results, or at least good answer coming from good experience. From what I see many recommend Trove's TIntHashSet, but others say it is not that good; some say Guava is supercool and manageable, but I do not need beauty and maintainability, only time execution, so Python's style Guava goes home :) Javalution? I've visited the website, seems too old for me and thus wacky.
The library should provide the best achievable time, memory does not matter.
Looking at "Thinking in Java", there is an idea of creating custom HashMap with int[] as keys. So I would like to see something similar with a HashSet or simply download and use an amazing library.
EDIT (in response to the comments below)
So in my project I start from about 50 HashSet<Integer> collections, then I call a function about 1000 times that inside creates up to 10 HashSet<Integer> collections. If I change initial parameters, the numbers may grow up exponentially. I only use add(), contains() and clear() methods on those collections, that is why they were chosen.
Now I'm going to find a library that implements HashSet or something similar, but will do that faster due to autoboxing Integer overhead and maybe something else which I do not know. In fact, I'm using ints as my data comes in and store them in those HashSets.

Trove is an excellent choice.
The reason why it is much faster than generic collections is memory use.
A java.util.HashSet<Integer> uses a java.util.HashMap<Integer, Integer> internally. In a HashMap, each object is contained in an Entry<Integer, Integer>. These objects take estimated 24 bytes for the Entry + 16 bytes for the actual integer + 4 bytes in the actual hash table. This yields 44 bytes, as opposed to 4 bytes in Trove, an up to 11x memory overhead (note that unoccupied entires in the main table will yield a smaller difference in practise).
See also these experiments:
http://www.takipiblog.com/2014/01/23/java-scala-guava-and-trove-collections-how-much-can-they-hold/

Take a look at the High Performance Primitive Collections for Java (HPPC). It is an alternative to trove, mature and carefully designed for efficiency. See the JavaDoc for the IntOpenHashSet.

Have you tried working with the initial capacity and load factor parameters while creating your HashSet?
HashSet doc
Initial capacity, as you might think, refers to how big will the empty hashset be when created, and loadfactor is a threshhold that determines when to grow the hash table. Normally you would like to keep the ratio between used buckets and total buckets, below two thirds, which is regarded as the best ratio to achieve good stable performance in a hash table.
Dynamic rezing of a hash table
So basically, try to set an initial capacity that will fit your needs (to avoid re-creating and reassigning the values of a hash table when it grows), as well as fiddling with the load factor until you find a sweet spot.
It might be that for your particular data distribution and setting/getting values, a lower loadfactor could help (hardly a higher one will, but your milage may vary).

Related

Direction for Implementing EntityCollection task [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have the following task:
Guidelines: You may implement your solution in Java/C#.
You are asked to implement the EntityCollection interface which is specified in the attached Java file.
Your implementation should support the following operations:
a. Add - adds the entity which is given as input to the collection.
b. Remove Max Value - removes the entity with the maximal value from the
collection and returns it.
You should provide 3 implementations for the following use-cases (A-C), according to the frequencies of performing Add & Remove Max
Value in these use-cases:
Each use-case implementation should be optimized in terms of its WC
time complexity -
If one operation is more frequent than the other operation (e.g. high vs. low) – then the frequent operation should have the lowest
possible complexity, whereas the other operation may have higher
complexity but still optimized as much as possible.
If both operations are equally frequent (e.g. medium vs. medium) – then both should have similar complexity, which is as low as possible
in each operation while taking into account also the need for the same
complexity in the other operation.
The given java code:
public interface Entity
{
public int getValue(); // unique
}
public interface EntityCollection
{
public void add(Entity entity);
public Entity removeMaxValue();
}
Notes: You may use any existing collections/data structures in your
solution.
My question: Do you think that this assignment is clear enough? I feel a bit fugue about how to approach this.
I think that they asked me to write some collection. but I can't see what's the meaning of the use case/operation.
Any directions hints/code examples would be appreciated.
The importance of this assignment is in your understanding of data structures and algorithms.
Doing a lot of adding and not a lot of removing the max value? use a linked-list. linked-list is O(1) for adding a new value to the list. so use that and use an easy to implement search algorithm for your second operation since it isn't used much.
For the second use case, you need to balance the speed of both operations, so choose a data structure that has decent speed for both. Maybe a Binary Search Tree.
And so on for the final case.
Here is a nice link outlining data structures and their speeds Cheat Sheet
You could choose a Hashtable for some of these, but note that despite the hash tables speed, it consumes extreme amounts of memory to achieve it. However, that is only a concern if memory is a problem or you are working with large data sets.
IMO you need to look into the data structures which efficiently support the add or remove operations as per the use case and then internally make use of corresponding in-built data structure in language of your choice and if you want more flexibility then implement that data structure yourself.
For example in use case A the add frequency is high and remove max value freq. is low, so you may make use of a Data structure which supports addition in say O(1)(constant-time) time-complexity. Whereas, for remove you may make use of something which is O(n)(linear-time) time complexity or anything less than O(n^2) is good enough for low-frequency operations.
So for use case A you can make use of Linked List as addition is O(1) but for remove-max you need to sort and then remove the max, which makes O(nlogn) complexity.
For Use case C - you can choose to go with Priority queue which has O(1) for remove max and O(log n) for addition. Priority queue is internally max-heap implementation in Java.

Compact alternatives to Java ArrayList<String> [duplicate]

This question already has answers here:
HashSet of Strings taking up too much memory, suggestions...?
(8 answers)
Closed 7 years ago.
I need to store a large dictionary of natural language words -- up to 120,000, depending on the language. These need to be kept in memory as profiling has shown that the algorithm which utilises the array is the time bottleneck in the system. (It's essentially a spellchecking/autocorrect algorithm, though the details don't matter.) On Android devices with 16MB memory, the memory overhead associated with Java Strings is causing us to run out of space. Note that each String has a 38 byte overhead associated with it, which gives up to a 5MB overhead.
At first sight, one option is to substitute char[] for String. (Or even byte[], as UTF-8 is more compact in this case.) But again, the memory overhead is an issue: each Java array has a 32 byte overhead.
One alternative to ArrayList<String>, etc. is to create an class with much the same interface that internally concatenates all the strings into one gigantic string, e.g. represented as a single byte[], and then store offsets into that huge string. Each offset would take up 4 bytes, giving a much more space-efficient solution.
My questions are a) are there any other solutions to the problem with similarly low overheads* and b) is any solution available off-the-shelf? Searching through the Guava, trove and PCJ collection libraries yields nothing.
*I know one can get the overhead down below 4 bytes, but there are diminishing returns.
NB. Support for Compressed Strings being Dropped in HotSpot JVM? suggests that the JVM option -XX:+UseCompressedStrings isn't going to help here.
I had to develop a word dictionary for a class project. we ended up using a trie as a data structure. Not sure the size difference between an arrraylist and a trie, but the performance is a lot better.
Here are some resources that could be helpful.
https://en.wikipedia.org/wiki/Trie
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/

Scalability in java

If millions of words are added in a Hash Set, on average, roughly, will there be any performance issues?
According to my opinion, since Hash Set has best case complexity of O(1), so on average it will be lower than O(log N) resulting no performance issues. I would still like answers by others. This is one of the question asked to me in an interview.
Hashset provides good add performance. It should work fine. It should be also important to write hashcode() function correctly.

What is the general purpose of using hashtables as a collection? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.

How to implement hash function in Java?

I've used an array as hash table for hashing alogrithm with values:
int[] arr={4 , 5 , 64 ,432 };
and keys with consective integers in array as:
int keys[]={ 1 , 2 , 3 ,4};
Could anyone please tell me, what would be the good approach in mapping those integers keys with those arrays location? Is the following a short and better approach with little or no collision (or something larger values)?
keys[i] % arrlength // where i is for different element of an array
Thanks in advance.
I assume you're trying to implement some kind of hash table as an exercise. Otherwise, you should just use a java.util.HashMap or java.util.HashTree or similar.
For a small set of values, as you have given above, your solution is fine. The real question will come when your data grows much bigger.
You have identified that collisions are undesirable - that is true. Sometimes, some knowledge of the likely keys can help you design a good hash function. Sometimes, you can assume that the key class will have a good hash() method. Since hash() is a method defined by Object, every class implements it. It would be neatest for you to be able to utilise the hash() method of your key, rather than have to build a new algorithm specially for your map.
If all integer keys are equally likely, then a mod function will spread them out evenly amongst the different buckets, minimising collisions. However, if you know that the keys are going to be numbered consecutively, it might be better to use a List than a HashMap - this will guarantee no collisions.
Any reason not to use the built-in HashMap ? You will have to use Integer though, not int.
java.util.Map myMap = new java.util.HashMap<Integer, Integer>();
Since you want to implement your own, then first brush-up on hash tables by reading the Wikipedia article. After that, you could study the HashMap source code.
This StackOverflow question contains interesting links for implementing fast hashmaps (for C++ though), as does this one (for Java).
Get yourself an book about algorithms and data structures and read the chapter about hash tables (The Wikipedia article would also be a good entry point). It's a complex topic and far beyond the scope of a Q&A site like this.
For starters, using the array-size modulo is in general a horrible hash function, because it results in massive collisions when the values are multiples of the array size or one of its divisors. How bad that is depends on the array size: the more divisors it has, the more likely are collisions; when it's a prime number, it's not too bad (but not really good either).

Categories

Resources