Efficient alternative to Map<Integer, Integer> in Java, with regards to autoboxing?

Efficient alternative to Map<Integer, Integer> in Java, with regards to autoboxing? - java

I'm using a LinkedHashMap<Integer, Integer> to store values of layers on a tile in a 2D game. Higher numbers are drawn over the lower numbers.
In my draw function, I iterate through the value set and draw each one. This means I'm unboxing values (width * height * numLayers) times. I'm planning to port to Android so I want to be as efficient as possible, and I'm thinking this is too much?
The reason I'm using a Map is because the layer number (key) matters: keys above 4 are drawn over players, etc.. So I'll frequently need to skip over a bunch of keys.
I could probably just use an int[10] since I won't need that many layers, but then all of the unused layers would each be taking up 32 bits over nothing, compared to my current HashMap which can have keys 0, 9 and only take up 64 bits.

Efficient alternative to Map ?
SparseIntArrays is more efficient than HashMap<Integer,Integer>. According to documentation
SparseIntArrays map integers to integers. Unlike a normal array of
integers, there can be gaps in the indices. It is intended to be more
memory efficient than using a HashMap to map Integers to Integers,
both because it avoids auto-boxing keys and values and its data
structure doesn't rely on an extra entry object for each mapping. For containers holding up to hundreds of items, the performance difference is not significant, less than 50%.
for more reference click here
For Non-Android languages :
Write your own Hash-based map class (not implementing collections.Map). Relatively simple using a 'linear probe' in an array of cells -- the other technique is a linked-list, which (again) will be as large as the 'direct array' option.
GNU Trove has primitive maps that will do what you want. But if you are not trying to eke out every byte of memory, I'd second Thomas's suggestion to just use an array.

Related

Linked lists as matrix and efficiency

When a big matrix needs to be used in an algorithm, to speed up complexity we were told to use linked lists if the matrix is sparse. Meaning that if the data is mostly the same we can save only the data that are not that value.
But how do we identify the point where using a sparse matrix is not useful anymore ?
For a square matrix of length n how do we calculate the point where we can say that the matrix has too much non-zero data to be written in a linked list ?
I imagine we need to use the memory sizes of an object, a link between two objects, then use our density factor. But what are the calculations to safely say "This matrix has x% non-zero data, it is better to use a linked list ?

The answer to your question depends on what you optimize for. Do you optimize for space or time?
Let's say you optimize for space. To keep data of a square matrix of length n, you need n*n numbers (to simplify, let's say it's an integer for each value). In case of a linked list, you need to have the actual value, the coordination of the value in the matrix and the pointer to the next non-zero value. To simplify, let's say each of those fields is of an integer size. So for a linked list, you need 4 integers for a single value to keep (plus additional data like the head of the linked list).
IMHO, once less than 1/4 of the values in the matrix is non-zero, it's more optimal to use a linked list than an array of arrays.
Obviously, there are other options to keep the matrix values; then the ratio can be different.
To optimize for time, again, it depends which operations you want to run...

UUID mapping to bitset

There're many UUID which has 128bit, I want to set every UUID as a integer and flag it in bitset's each position. But It seems 128bit is too long.
How can I implement this function and there is no collision?

By using bitset, you will need 2^128 which is around 3.4 x 10^38 bits.
You want something that cost you less memory, it is possible. But if you want it to have absolutely no collision, it is impossible, simply by pigeon hole principle.
But why you want it to be "no collision"? For example, if you are going to use HashMap, a relatively normal hashing function, plus pre-initializing the HashMap to the expected size is going to save you a lot of collision. And even there are some collisions, I don't think it will be a big impact to performance (unless the hashing method is really poorly done).
A workaround if your "add to bitset" is explicit (hence you do not need to determine if a UUID is already in the bit set):
Assuming you need to store status of 100,000,000 devices, you will need at least 100,000,000 bits.
By using a reasonable hashing algorithm, make up a 27 bit hash, and the hash will determine which bit to use to store the status. Hence you will need a bitmap of 2^27 =134,217,728 bits ~=17MB.
Have 2 BitSets of such size (cost you around 34MB), one for keeping status, one for keeping "availability of bit".
Have a extra Map<UUID, Integer> as "exceptional device bit"
For a new UUID, calculate that 27bit hash. If the result value is not occupied in "bitAvailabilityBitset", turn that on.
For a new UUID, if the hash result is occupied in "bitAvailabilityBitSet", find the index of next unoccupied bit, turn that on in "bitAvailabilityBitSet", and add the UUID + index pair in "exceptional device bit" map.
Doing something in reverse when lookup/update: first check if UUID is in "exceptional device bit" map, if so, use the index in the map to lookup. If not, simply calculate the 27bit hash as the index to lookup
Given a relatively good hashing algorithm, the collision should not be frequent and hence, the extra overhead for that "exceptional device bit" map should not be big. You may further adjust the size of the bitset to tradeoff size for reduction of collision

Memory efficiency: HashMap versus Array

I was thinking about the following situation: I want to count the occurrence of characters in a string (for example for a permutation check).
One way to do it would be to allocate an array with 256 integers (I assume that the characters are UTF-8), to fill it with zeros and then to go through the string and increment the integers on the array positions corresponding to the int value of the chars.
However, for this approach, you would have to allocate a 256 array each time, even when the analyzed string is very short (and consequently uses only a small part of the array).
An other approach would be to use a Character to Integer HashTable and to store a number for each encountered char. This way, you only would have keys for chars that actually are in the string.
As my understanding of the HashTable is rather theoretic and I do not really know how it is implemented in Java my question is: Which of the two approaches would be more memory efficient?
Edit:
During the discussion of this question (thank you for your answers everyone) I did realize that I had a very fuzzy understanding of the nature of UTF-8. After some searching, I have found this great video that I want to share, in case someone has the same problem.

Ich wonder why you choose 256 as the length of your array when you assume that your String is UTF-8. In UTF-8 a character can be composed of up to 4 bytes which means quite a number of more characters than just 256.
Anyway: Using a HashTable/HashMap needs a huge memory overhead. First all your characters and integer need to be wrapped in an object (Integer/Character). And Integer consumes about 3x as much memory as an int. For arrays the difference can be even larger due to the optimizations java performs on arrays (e.g. the java stack works only in multiples of 4 byte, while in an array java allows smaller types such as a char to consume only 2 bytes).
Then the HashTable itself creates a memory overhead because it needs to maintain an array (which is usually not fully used) and linked lists to maintain all objects which generate the same hash.
Additionally access times will be dramatically faster for arrays. You save multiple method invocations (add, hashCode, iterator,...) and there exist a number of opcode in java byte code to make working with arrays more efficient.
Anyway. You question was:
Which of the two approaches would be more memory efficient?
And it is safe to say that arrays will be more memory efficient.
However you should make absolutely sure what your requirements are. Do you need more memory efficiency? (Could be true if you process large amounts of data or you are on a slow device (mobile devices?)) How important is readability of code? How about size of code? Reuseability?
And ist 256 really the correct size?

Without looking in the code I know that a HashMap requires, at minimum, a base object, a hashtable array, and individual objects for each hash entry. Generally an int value would have to be stored as an Integer object so that's more objects. Let's assume you have 30 unique characters:
32 bytes for the base object
256 bytes for a minimum-size hashtable array
32 bytes for each of the 30 table entries
16 bytes (if highly optimized) for each of 30 Integers
32 + 256 + 960 + 480 = 1728 bytes. That's for a minimal, non-fancy implementation.
The array of 256 ints would be about 1056 bytes.

I would use the array. From a performance aspect, you have guaranteed constant access. Better than the what a hash table can get you.
As it also only uses an constant amount of memory, I see no downside. The HashMap will most likely need more memory, even if you only store a few elements.
By the way, the memory footprint should not be a concern, as you will only need the data structure as long as you need it for counting. Then it will be garbage collected, anyway.

Well here are the facts.
HashMap uses an array for its table behind the scenes.
So if you were actually limited by finding a contiguous space in memory, HashMap's benefit is only that the array may be smaller.
HashMap is generic and therefore uses objects.
Objects take up extra space. As I remember, it's typically 8 or 16 bytes minimum depending on whether it's a 32- or 64-bit system. This means the HashMap may very well not be smaller, even if the number of characters in the String is small. HashMap will require 3 extra objects for each entry: an Entry, a Character and an Integer. HashMap also needs to store the int for the index locally whereas the array does not.
That's beyond that there will be some extra computation using the HashMap.
I would also say space optimization is not something you should worry about here. Either way, the memory footprint is actually very small.

Initialize an array of integers that represent the int value of a char, for example the int value of f is 102 which is its ascii value
http://www.asciitable.com/
char c = 'f';
int x = (int)c;
If you know the range of char's youre dealing with then it is easier.
For each occurance of char increment the index of that char in the array by one. This approach would be slow if you have to iterate and complicated if you are to sort but wont be memory intensive.
Just be aware when you sort you lose the indexes

Datastructure to use for storing a huge list of long values

I am caching list of Long indexes in my Java program and it is causing the memory to overflow.
So, decided to cache only the start and end indexes of all continuous indexes and rewrite the ArrayList's required APIs. Now, what data structure will be best here to implement the start-end index cache? Is it better to go for TreeMap and keep start index as key and end index as value?

If I were you, I would use some variation of bit string storage.
In Java bit strings are implemented by BitSet.
For example, to represent arbitrary list of unique 32-bit integers, you could store it as a single bit string 4 billion bits long, so this will take 4 bln / 8 bits = 512MB of memory. This is a lot, but it is worst possible case.
But, you can be a lot smarter than that. For example, you could store it as list or binary tree of some smaller fixed (or dynamic) sized bit strings, say 65536 bits or less (or 8KB or less). In other words, each leaf object in this tree will have small header representing start offset and length (probably power of 2 for simplicity, but it does not have to be), and bit string storing actual array members. For efficiency, you could optionally compress this bit string using gzip or similar algorithm - it will make access slower, but could improve memory efficiency by factor of 10 or more.
If your 20 million index elements are almost consecutive (not very sparse), it should take only around 20mln/8bits ~= 2 million bits = 2 MB to represent it in memory. If you gzip it, it will be probably under 1MB overall.

The most compact representation will depend greatly on the distribution of indices in your specific application.
If your indices are densely clustered, the range-based representation suggested by mvp will probably work well (you might look at implementations of run-length encoding for raster graphics, since they're similar problems).
If your indices aren't clustered in dense runs, that encoding will actually increase memory consumption. For sparsely-populated lists, you might look into primitive data structures such as LongArrayList or LongOpenHashSet in FastUtil, or similar structures in Gnu Trove or Colt. In most VMs, each Long object in your ArrayList consumes 20+ bytes, whereas a primitive long consumes only 8. So you can often realize a significant memory savings with type-specific primitive collections instead of the standard Collections framework.
I've been very pleased with FastUtil, but you might find another solution suits you better. A little simulation and memory profiling should help you determine the most effective representation for your own data.

Most BitSet (compressed or uncompressed) implementations are for integers. Here's one for longs: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset which works like an ordered primitive long hash set or long to long hash map.

Performance of HashMap with different initial capacity and load factor

Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.

In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.

Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at 0.75. The value of 0.75 has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.

I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.

Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
It's not. From HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.

Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.