The hash table collision and increase in size

The hash table collision and increase in size - java

I have a hashmap and around 12 element in it, at the same index, now if one more element is inserted at the same index, since now it has reached the threshold it will add the element and increase the hashmap size to double. similarly if 12 more element are added at same index then again it will resize and size will be doubled.This will lead to waste of space (other index elements will be empty).
Any lead/help will be appericiated.

Like separate chaining, open addressing is a method for handling collisions. In Open Addressing, all elements are stored in the hash table itself. So at any point, the size of the table must be greater than or equal to the total number of keys.
Open addressing has several advantages over chaining:
Open addressing provides better cache performance as everything is stored in the same table.
A slot can be used even if the input doesn’t map to it.
You can refer to Hashing overview for more details.

Related

What data structure is used in apps for modifiable lists?

In some apps you have lists of items where you can move items around, delete items, add or insert items, etc.
Normally I'd say an ArrayList would work but apparently a lot of operations are linear time.
Is there a better data structure most people use for this?

If your priority is inserting and/or removing elements from a collection that maintains an arbitrary order, the the LinkedList class bundled with Java meets that need. You can very quickly insert or remove any element at any specific index number.
Each link in the chain that is a doubly-linked-list knows its predecessor and its successor. Each element holds a reference/pointer to the element in front and another reference/pointer to the one following. So insertion means telling a linked pair to consider the new element as their successor or predecessor. The rest of the chain remains untouched.
The downside to LinkedList is that access by index number is expensive as finding the nth element means traversing n links going from one element to the next in the chain. A linked-list inherently means sequential access. So, getting to an element is expensive but once there the mechanics of the insertion/deletion is cheap.
Another downside to LinkedList is searching, for similar reason (sequential access). Since the ordering is arbitrary and not sorted, there is no way to approximately predict/expect where the element might be found. So searching means traversing the chain from one element to the next and performing a comparison on each one.
On the other hand, if indexed access is your priority, then ArrayList is the way to go. Directly accessing the nth element is the speciality for ArrayList. Inserting and removing elements are very expensive operations requiring the backing array to be rebuilt unless dealing with the last element. For large arrays this has implications for memory management as the array must be in contiguous memory.
Both LinkedList and ArrayList allow duplicates.
Neither LinkedList nor ArrayList are thread-safe. So if accessing either from more than one thread, you have a whole other category of concerns to address.
To understand the nuances, study linked lists and arrays in general.

Why need HashMap to relocate elements while rehashing?

When hashmap reaches allowed size (capacity*loadFactor) then it automatically increased and after that all elements will be relocated into the new indexies. So, why need to perform this relocation?

Because it makes hash table sparse, allowing elements to sit in their own buckets instead of piling up in small number of buckets.
When several elements hit the same bucket, HashMap has to create a list (and sometimes even a tree) which is bad for both memory footprint and performance of elements retrieval. So, to prevent the number of such collisions, HashMap is growing its internal hash table and rehashes.

The rehashing is required because the calculation used when mapping a key value to a bucket is dependent on the total number of buckets. When the number of buckets changes (to increase capacity), the new mapping calculation may map a given key to a different bucket.
In other words, lookups for some or all of the previous entries may fail to behave properly because the entries are in the wrong buckets after growing the backing store.
While this may seem unfortunate, you actually want the mapping function to take into account the total number of buckets that are available. In this way, all buckets can be utilized and no entries get mapped to buckets that do not exist.
There are other data structures that do not have this property, but this is the standard way that hash maps work.

Memory overhead of shrinking collections

I have been studying Java Collections recently. I noticed that ArrayList, ArrayDeque or HashMap contains helper functions which expand capacity of the containers if necessary, but neither of them have function to narrow the cap if the container gets empty.
If I am correct, is the memory cost of references (4 byte) so irrelevant?

You're correct, most of the collections have an internal capacity that is expanded automatically and that never shrinks. The exception is ArrayList, which has methods ensureCapacity() and trimToSize() that let the application manage the list's internal capacity explicitly. In practice, I believe these methods are rarely used.
The policy of growing but not shrinking automatically is based on some assumptions about the usage model of collections:
applications often don't know how many elements they want to store, so the collections will expand themselves automatically as elements are added;
once a collection is fully populated, the number of elements will generally remain around that number, neither growing nor shrinking significantly;
the per-element overhead of a collection is generally small compared to the size of the elements themselves.
For applications that fit these assumptions, the policy seems to work out reasonably well. For example, suppose you insert a million key-value pairs into a HashMap. The default load factor is 0.75, so the internal table size would be 1.33 million. Table sizes are rounded up to the next power of two, which would be 2^21 (2,097,152). In a sense, that's a million or so "extra" slots in the map's internal table. Since each slot is typically a 4-byte object reference, that's 4MB of wasted space!
But consider, you're using this map to store a million key-value pairs. Suppose each key and value is 50 bytes (which seems like a pretty small object). That's 100MB to store the data. Compared to that, 4MB of extra map overhead isn't that big of a deal.
Suppose, though, that you've stored a million mappings, and you want to run through them all and delete all but a hundred mappings of interest. Now you're storing 10KB of data, but your map's table of 2^21 elements is occupying 8MB of space. That's a lot of waste.
But it also seems that performing 999,900 deletions from a map is kind of an unlikely thing to do. If you want to keep 100 mappings, you'd probably create a new map, insert just the 100 mappings you want to keep, and throw away the original map. That would eliminate the space wastage, and it would probably be a lot faster as well. Given this, the lack of an automatic shrinking policy for the collections is usually not a problem in practice.

Large 2D Array Storage in Java (Android)

I'm creating a matrix in Java, which:
Can be up to 10,000 x 10,000 elements in the worst case
May change size from time to time (assume on the order of days)
Stores an integer in the range 0-5 inclusive (presumably a byte)
Has elements accessed by referring to a pair of Long IDs (system-determined)
Is symmetrical (so can be done in half the space, if needed, although it makes things like summing the rows harder (or impossible if the array is unordered))
Doesn't necessarily need to be ordered (unless halved into a triangle, as explained above)
Needs to be persistent after the app closes (currently it's being written to file)
My current implementation is using a HashMap<Pair<Long,Long>,Integer>, which works fine on my small test matrix (10x10), but according to this article, is probably going to hit unmanageable memory usage when expanded to 10,000 x 10,000 elements.
I'm new to Java and Android and was wondering: what is the best practice for this sort of thing?
I'm thinking of switching back to a bog standard 2D array byte[][] with a HashMap lookup table for my Long IDs. Will I take a noticable performance hit on matrix access? Also, I take it there's no way of modifying the array size without either:
Pre-allocating for the assumed worst-case (which may not even be the worst case, and would take an unnecessary amount of memory)
Copying the array into a new array if a size change is required (momentarily doubling my memory usage)

Thought I'd answer this for posterity. I've gone with Fildor's suggestion of using an SQL database with two look-up columns to represent the row and column indices of my "matrix". The value is stored in a third column.
The main benefit of this approach is that the entire matrix doesn't need to be loaded into RAM in order to read or update elements, with the added benefit of access to summing functions (and any other features inherently in SQL databases). It's a particularly easy method on Android, because of the built-in SQL functionality.
One performance drawback is that the initialisation of the matrix is extraordinarily slow. However, the approach I've taken is to assume that if an entry isn't found in the database, it takes a default value. This eliminates the need to populate the entire matrix (and is especially useful for sparse matrices), but has the downside of not throwing an error if trying to access an invalid index. It is recommended that this approach is coupled with a pair of lists that list the valid rows and columns, and these lists are referenced before attempting to access the database. If you're trying to sum rows using the built-in SQL features, this will also not work correctly if your default is non-zero, although this can be remedied by returning the number of entries found in the row/column being summed, and multiplying the "missing" elements by the default value.

Does a HashMap collision cause a resize?

When there is a collision during a put in a HashMap is the map resized or is the entry added to a list in that particular bucket?

When you say 'collision', do you mean the same hashcode? The hashcode is used to determine what bucket in a HashMap is to be used, and the bucket is made up of a linked list of all the entries with the same hashcode. The entries are then compared for equality (using .equals()) before being returned or booted (get/put).
Note that this is the HashMap specifically (since that's the one you asked about), and with other implementations, YMMV.

Either could happen - it depends on the fill ratio of the HashMap.
Usually however, it will be added to the list for that bucket - the HashMap class is designed so that resizes are comparatively rare (because they are more expensive).

The documentation of java.util.HashMap explains exactly when the map is resized:
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor.
The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created.
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
The default initial capacity is 16, the default load factor is 0.75. You can supply other values in the map's constructor.

Resising is done when the load factor is reached.
When there is a collision during a put in a HashMap the entry is added to a list in that particular "bucket". If the load factor is reached, the Hashmap is resized.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.