So I am implementing my own hashtable in java, since the built in hashtable has ridiculous memory overhead per entry. I'm making an open-addressed table with a variant of quadratic hashing, which is backed internally by two arrays, one for keys and one for values. I don't have the ability to resize though. The obvious way to do it is to create larger arrays and then hash all of the (key, value) pairs into the new arrays from the old ones. This falls apart though when my old arrays take up over 50% of my current memory, since I can't fit both the old and new arrays in memory at the same time. Is there any way to resize my hashtable in this situation
Edit: the info I got for current hashtable memory overheads is from here How much memory does a Hashtable use?
Also, for my current application, my values are ints, so rather than store references to Integers, I have an array of ints as my values.
The simple answer is "no, there is no way to extend the length of an existing array". That said, you could add extra complexity to your hashtable and use arrays-of-arrays (or specialize and hard-code support for just two arrays).
You could partition your hash table. e.g. you could have 2-16 partitions based on 1-4 bits in the hashCode. This would allow you to resize a portion of the hash table at a time.
If you have a single hash table which is a large percentage of your memory size, you have a serious design issue IMHO. Are you using a mobile device? What is your memory restriction? Have you looked using Trove4j which doesn't use entry objects either.
Maybe a solution for the problem is:
->Creating a list to store the content of the matrix (setting a list for each row and then free the memory of the array in question if possible, one by one);
-> Create the new matrix;
->Fill the matrix with the values stored on the list (removing 1 element from the list right after copying the info from it.
This can be easier if the matrix elements are pointers to the elements themselves.
This is a very theorical approach of the problem, but I hope it helps
Related
As I have studied HashSet class, it uses the concept of filled ratio, which says if the HashSet if filled up to this limit create a larger HashSet and copy values in to it. Why we dont let HashSet to get full with object and then create a new HashSet? Why a new concept is derived for HashSet?
Both ArrayList and Vector are accessed by positional index, so that there are no conflicts and access is always O(1).
A hash-based data structure is accessed by a hashed value, which can collide and degrade into access to a second-level "overflow" data structure (list or tree). If you have no such collisions, access is O(1), but if you have many collisions, it can be significantly worse. You can control this a bit by allocating more memory (so that there are more buckets and hopefully fewer collisions).
As a result, there is no need to grow an ArrayList to a capacity more than you need to store all elements, but it does make sense to "waste" a bit (or a lot) in the case of a HashSet. The parameter is exposed to allow the programmer to choose what works best for her application.
As Jonny Henly has described. It is because of the way data is stored.
ArrayList is linear data structure, while HashSet is not. In HashSet data is stored in underlying array based on hashcodes. In a way performance of HashSet is linked to how many buckets are filled and how well data is distributed among these buckets. Once this distribution of data is beyond a certain level (called load factor) re-hashing is done.
HashSet is primarily used to ensure that the basic operations (such as adding, fetching, modifying and deleting) are performed in constant time regardless of the number of entries being stored in the HashSet.
Though a well designed hash function can achieve this, designing one might take time. So if performance is a critical requirement for the application, we could use the load factor to ensure the operations are performed in constant time as well. I think we could call both of these as redundant's for each other (the load factor and the hash function).
I agree that this may not be a perfect explanation, but I hope it does bring some clarity on the subject.
I have an Android application that iterates through an array of thousands of integers and it uses them as key values to access pairs of integers (let us call them id's) in order to make calculations with them. It needs to do it as fast as possible and in the end, it returns a result which is crucial to the application.
I tried loading a HashMap into the memory for fast access to those numbers but it resulted in OOM Exception. I also tried writing those id's to a RandomAccessFile and storing their offsets on the file to another HashMap but it was way too slow. Also, the new HashMap that only stores the offsets is still occupying a large memory.
Now I am considering SQLite but I am not sure if it will be any faster. Are there any structures or libraries that could help me with that?
EDIT: Number of keys are more than 20 million whereas I only need to access thousands of them. I do not know which ones I will access beforehand because it changes with user input.
You could use Trove's TIntLongHashMap to map primitive ints to primitive longs (which store the ints of your value pair). This saves you the object overhead of a plain vanilla Map, which forces you to use wrapper types.
EDIT
Since your update states you have more than 20 million mappings, there will likely be more space-efficient structures than a hash map. An approach to partition your keys into buckets, combined with some sub-key compression will likely save you half the memory over even the most efficient hash map implementation.
SQLite is an embedded relational database, which uses indexes. I would bet it is much faster than using RandomAccessFile. You can give it a try.
My suggestion is to rearrange the keys in Buckets - what i mean is identify (more or less) the distribution of your keys, then make files that corresponds to each range of keys (the point is that every file must contain just as much integers that can get in memory and no more then that) then when you search for a key, you just read the whole file to the memory and look for it.
exemple, assuming the distribution of the key is uniform, store 500k values corresponding to the 0-500k key values, 500k values corresponding to 500k-1mil keys and so on...
EDIT : if you did try this approach, and it still went slow, i still have some tricks in my sleaves:
First make sure that your division is actually close to equal between all the buckets.
Try to make the buckets smaller, by making more buckets.
The idea about correct division to buckets by ranges is that when you search for a key, you go to the corresponding range bucket and The key either in it or that it is not in the whole collection. so there is no point on Concurnetly reading another bucket.
I never done that, cause im not sure concurrency works on I\O's, but it may be helpfull to Read the whole file with 2 threads one from top to bottom and the other from bottom to top until they meet. (or something like that)
While you read the whole bucket into memory, split it to 3-4 arraylists, Run 3-4 working threads to search your key on each of the arrays, the search must end way faster then.
I'm processing some generated data files (hundreds of Mbytes) which contains several G objects. I need to random access these objects. A possible implementation, I guess, might be a big HashTable. My program is written in Java and it seems the java.util.HashMap cannot handle this (somehow it's extremely slow). Could anyone recommend a solution to random accessing these objects?
If a HashMap is extremely slow, then the two most likely causes are as follows:
The hashCode() and/or equals(Object) methods on your key class could be very expensive. For instance, if you use an array or a collection as a key, the hashCode() method will access every element each time you call it, and the equals method will do the same for equal keys.
Your key class could have a poor hashCode() method that is giving the same value for a significant percentage of the (distinct) keys used by the program. When this occurs you get many key collisions, and that can be really bad for performance when the hash table gets large.
I suggest you look at these possibilities first ... before changing your data structure.
Note: if "several G objects" means several billion objects, then you'll have trouble holding the files' contents in memory ... unless you are running this application on a machine with 100's of gigabytes of RAM. I advise you do some "back of the envelope" calculations to see if what you are trying to do is feasible.
Whatever your keys are, make sure you're generating a good hash for each one via hashCode(). A lot of times bad HashMap performance can be blamed on colliding hashes. When there's a collision, HashMap generates a linked list for the colliding objects.
Worst-case if you're returning the same hash for all objects, HashMap essentially becomes a linked list. Here's a good starting place for writing hash functions: http://www.javamex.com/tutorials/collections/hash_function_guidelines.shtml
A few hundred MB cannot hold several billion objects unless each object is a bit (which is not really an object IMHO).
How I would approach this is to use memory mapped file to map in the contents of the data and to build your own hash table in another memory mapped file (which requires you to scan the data once to build the keys)
Depending on the layout of the data, it is worth remembering that random access is not the most efficient way to cache data i.e. your cache loaded lines of 64 bytes (depending on architecture) and if your structure doesn't fit in memory, record based tables may be more efficient.
I usually do e.g.
HashMap<String,String> dictionary = new HashMap<String,String>();
I started to think about it, and as far as I know a HashMap is implemented under the hood via a hash table.
The objects are stored in the table using a hash to find where they should be stored in the table.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
I.e. what would be the size of the hash table during construction? Would it need to allocate new memory for the table as elements increase?
Or I am confused on the concept here?
Are the default capacity and load adequate or should I be spending time for the actual numbers?
The nice thing about Java is that it is open-source, so you can pull up the source code, which answers a number of questions:
No, there is no relationship between HashMap and HashTable. HashMap derives from AbstractMap, and does not internally use a HashTable for managing data.
Whether or not omitting an explicit size will decrease performance will depend upon your usage model (or more specifically, how many things you put into the map). The map will automatically double in size every time a certain threshold is hit (0.75 * <current map capacity>), and the doubling operation is expensive. So if you know approximately how many elements will be going into the map, you can specify a size and prevent it from ever needing to allocate additional space.
The default capacity of the map, if none is specified using the constructor, is 16. So it will double its capacity to 32 when the 12th element is added to the map. And then again on the 24th, and so on.
Yes, it needs to allocate new memory when the capacity increases. And it's a fairly costly operation (see the resize() and transfer() functions).
Unrelated to your question but still worth noting, I would recommend declaring/instantiating your map like:
Map<String,String> dictionary = new HashMap<String,String>();
...and of course, if you happen to know how many elements will be placed in the map, you should specify that as well.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
Depends on how much you're going to store in the HashMap and how your code will use it afterward. If you can give it a ballpark figure up front, it might be faster, but: "it's very important not to set the initial capacity too high [...] if iteration performance is important" 1 because iteration time is proportional to the capacity.
Doing this in non-performance-critical pieces of code would be considered premature optimization. If you're going to outsmart the JDK authors, make sure you have measurements that show that your optimization matters.
what would be the size of the hash table during construction?
According to the API docs, 16.
Would it need to allocate new memory for the table as elements increase?
Yes. Every time it's fuller than the load factor (default = .75), it reallocates.
Are the default capacity and load adequate
Only you can tell. Profile your program to see whether it's spending too much time in HashMap.put. If it's not, don't bother.
Hashmap would automatically increase the size if it needs to. The best way to initialize is if you have some sort of anticipating how much elements you might needs and if the figure is large just set it to a number which would not require constant resizing. Furthermore if you read the JavaDoc for Hashmap you would see that the default size is 16 and load factor is 0.75 which means that once the hashmap is 75% full it will automatically resize. So if you expect to hold 1million elements it is natural you want a larger size than the default one
I would declare it as interface Map first of all.
Map<String,String> dictionary = new HashMap<String,String>();
Does the fact that I do not set a size on the construction of the
dictionary makes the performace decrease?
Yes, initial capacity should be set for better performance.
Would it need to allocate new memory for the table as elements
increase
Yes, load factor also effects performance.
More detail in docs
As stated here, the default initial capacity is 16 and the default load factor is 0.75. You can change either one with different c'tors, and this depends on your usage (though these are generally good for general purposes).
What is the best data structure to use when programming a 2-dimensional grid of tiles in Java? Tiles on the grid should be easily referenced by their location, so that neighbors and paths can be efficiently computed. Should it be a 2D array? An ArrayList? Something else?
If you're not worrying about speed or memory too much, you can simply use a 2D array - this should work well enough.
If speed and/or memory are issues for you then this depends on memory usage and the access pattern.
A single dimensional array is the way to go if you need high performance. You compute the proper index as y * wdt + x. There are 2 potential problems with this: cache misses and memory usage.
If you know that your access pattern is such that you fetch neighbours of an element most of the time, then mapping a 2D space into a 1D array as described above may cause cache misses - you want the neighbours to be close in memory, and neighbours from 2 different rows are not. You may have to map your 2d tiles in a different order to your 1d array. See Hilbert curves for example.
For better memory usage, if you know that most of your tiles are always the same (e.g. always grass), you might want to implement a sparse array or a quad tree. Both can be implemented quite efficiently, with cache awareness in mind (the sparse array link is good example for this). Another benefit is that these can be dynamically extended. However, you will always have to pay extra levels of indirection in the end for this to work.
NOTE: Be careful with using generic classes such as HashMaps with the key type being some primitive type or a special location class if you're worried about performance - you will either have to allocate an object each time you index the hash map or pay the price of boxing/unboxing. In addition to this, hash maps will not allow you efficient spatial queries (e.g. give me all objects existing in the radius R of a given object - quad trees are better for this).
If you have a fixed dimension for your grid, use a 2D array. If you need the size to be dynamic, use an ArrayList of ArrayLists.
A 2D array seems like a good bet if you plan on inserting stuff into specific locations. As long as its a fixed Size.
The data structure to use really depends on the type of operations you will perform:
In case the number of meaningful positions (nonzero/nondefault) in the grid is rather low (<< n x m) it might be more space efficient to use a hashmap, that maps (x,y) positions to specific tiles. Also you can iterate over meaningful positions alot more efficiently. In addition you could store references to neighboring tiles to each tile to speed up path/neighborhood traversal.
If your grid is densely filled with "information" you should consider using a 2d array or ArrayList (in case you will at some point have generic types involved as "tile-type", you have to use ArrayLists, since Java does not allow native arrays of generic type).
If you simply need to iterate over the grid and random addressing of cells, then MyCellType[][] should be fine. This is most efficient in terms of space and (one would expect) time for these use-cases.