Map Lookup Efficiency of TestForNull - java

Referencing a previous answer to a question on SO, there is a method used called TestForNull. This was my original code before I was told I could make it more efficient:
My original code:
for (int i = 0; i < temp.length; i++) {
if (map.containsKey(temp[i]))
map.put(temp[i], map.get(temp[i]) + 1);
else
map.put(temp[i], 1);
In this snippet, I'm doing three look-ups to the map. I was told that this could be accomplished in just one lookup, so I ended up looking for an answer on SO and found the linked answer, and modified my code to look like:
My modified code:
for (int i = 0; i < temp.length; i++) {
Integer value = map.get(temp[i]);
if (value != null)
map.put(temp[i], value + 1);
else
map.put(temp[i], 1);
}
Even though it seems better, it looks like two look-ups to me and not one. I was wondering if there was an implementation of this that only uses one, and if it can be done without the use of third-party libraries. If it helps I'm using a HashMap for my program.

Java 8 has added a number of default methods to the Map interface that could help, including merge:
map.merge(temp[i], 1, v -> v + 1);
And compute:
map.compute(temp[i], (k, v) -> v == null ? 1 : v + 1);
HashMap's implementations of these methods are appropriately optimized to effectively only perform a single key lookup. (Curiously, the same cannot be said for TreeMap.)

#John Kugelman's answer is the best (as long as you can use java 8).
The first example has a worst case of 3 map calls (in the case of a value already present):
containsKey
get
put
The second example always has exactly 2 calls (and a null check):
get
put
So you are basically trading containsKey for a null check.
In a HashMap, these operations are roughly constant time, assuming good hash code distribution (and that the distribution works well with the size of the HashMap). Other Map implementations (such as TreeMap) have log(n) execution time. Even in the case of HashMap, a null check will be faster than containsKey, making the second option the winner. However, you are unlikely to have a measurable difference unless you have poorly distributed hash codes (or this is the only thing your application is doing) or poor performing equals checks.

Related

HashMap resize method implementation detail

As the title suggests this is a question about an implementation detail from HashMap#resize - that's when the inner array is doubled in size.
It's a bit wordy, but I've really tried to prove that I did my best understanding this...
This happens at a point when entries in this particular bucket/bin are stored in a Linked fashion - thus having an exact order and in the context of the question this is important.
Generally the resize could be called from other places as well, but let's look at this case only.
Suppose you put these strings as keys in a HashMap (on the right there's the hashcode after HashMap#hash - that's the internal re-hashing.) Yes, these are carefully generated, not random.
DFHXR - 11111
YSXFJ - 01111
TUDDY - 11111
AXVUH - 01111
RUTWZ - 11111
DEDUC - 01111
WFCVW - 11111
ZETCU - 01111
GCVUR - 11111
There's a simple pattern to notice here - the last 4 bits are the same for all of them - which means that when we insert 8 of these keys (there are 9 total), they will end-up in the same bucket; and on the 9-th HashMap#put, the resize will be called.
So if currently there are 8 entries (having one of the keys above) in the HashMap - it means there are 16 buckets in this map and the last 4 bits of they key decided where the entries end-up.
We put the nine-th key. At this point TREEIFY_THRESHOLD is hit and resize is called. The bins are doubled to 32 and one more bit from the keys decides where that entry will go (so, 5 bits now).
Ultimately this piece of code is reached (when resize happens):
Node<K,V> loHead = null, loTail = null;
Node<K,V> hiHead = null, hiTail = null;
Node<K,V> next;
do {
next = e.next;
if ((e.hash & oldCap) == 0) {
if (loTail == null)
loHead = e;
else
loTail.next = e;
loTail = e;
}
else {
if (hiTail == null)
hiHead = e;
else
hiTail.next = e;
hiTail = e;
}
} while ((e = next) != null);
if (loTail != null) {
loTail.next = null;
newTab[j] = loHead;
}
if (hiTail != null) {
hiTail.next = null;
newTab[j + oldCap] = hiHead;
}
It's actually not that complicated... what it does it splits the current bin into entries that will move to other bins and to entries that will not move to other bins - but will remain into this one for sure.
And it's actually pretty smart how it does that - it's via this piece of code:
if ((e.hash & oldCap) == 0)
What this does is check if the next bit (the 5-th in our case) is actually zero - if it is, it means that this entry will stay where it is; if it's not it will move with a power of two offset in the new bin.
And now finally the question: that piece of code in the resize is carefully made so that it preserves the order of the entries there was in that bin.
So after you put those 9 keys in the HashMap the order is going to be :
DFHXR -> TUDDY -> RUTWZ -> WFCVW -> GCVUR (one bin)
YSXFJ -> AXVUH -> DEDUC -> ZETCU (another bin)
Why would you want to preserve order of some entries in the HashMap. Order in a Map is really bad as detailed here or here.
The design consideration has been documented within the same source file, in a code comment in line 211
* When bin lists are treeified, split, or untreeified, we keep
* them in the same relative access/traversal order (i.e., field
* Node.next) to better preserve locality, and to slightly
* simplify handling of splits and traversals that invoke
* iterator.remove. When using comparators on insertion, to keep a
* total ordering (or as close as is required here) across
* rebalancings, we compare classes and identityHashCodes as
* tie-breakers.
Since removing mappings via an iterator can’t trigger a resize, the reasons to retain the order specifically in resize are “to better preserve locality, and to slightly simplify handling of splits”, as well as being consistent regarding the policy.
There are two common reasons for maintaining order in bins implemented as a linked list:
One is that you maintain order by increasing (or decreasing) hash-value.
That means when searching a bin you can stop as soon as the current item is greater (or less, as applicable) than the hash being searched for.
Another approach involves moving entries to the front (or nearer the front) of the bucket when accessed or just adding them to the front. That suits situations where the probability of an element being accessed is high if it has just been accessed.
I've looked at the source for JDK-8 and it appears to be (at least for the most part) doing the later passive version of the later (add to front):
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java
While it's true that you should never rely on iteration order from containers that don't guarantee it, that doesn't mean that it can't be exploited for performance if it's structural. Also notice that the implementation of a class is in a privilege position to exploit details of its implementation in a formal way that a user of that class should not.
If you look at the source and understand how its implemented and exploit it, you're taking a risk. If the implementer does it, that's a different matter!
Note:
I have an implementation of an algorithm that relies heavily on a hash-table called Hashlife. That uses this model, have a hash-table that's a power of two because (a) you can get the entry by bit-masking (& mask) rather than a division and (b) rehashing is simplified because you only every 'unzip' hash-bins.
Benchmarking shows that algorithm gaining around 20% by actively moving patterns to the front of their bin when accessed.
The algorithm pretty much exploits repeating structures in cellular automata, which are common so if you've seen a pattern the chances of seeing it again are high.
Order in a Map is really bad [...]
It's not bad, it's (in academic terminology) whatever. What Stuart Marks wrote at the first link you posted:
[...] preserve flexibility for future implementation changes [...]
Which means (as I understand it) that now the implementation happens to keep the order, but in the future if a better implementation is found, it will be used either it keeps the order or not.

ImmutableCollections SetN implementation detail

I have sort of a hard time understanding an implementation detail from java-9 ImmutableCollections.SetN; specifically why is there a need to increase the inner array twice.
Suppose you do this:
Set.of(1,2,3,4) // 4 elements, but internal array is 8
More exactly I perfectly understand why this is done (a double expansion) in case of a HashMap - where you never (almost) want the load_factor to be one. A value of !=1 improves search time as entries are better dispersed to buckets for example.
But in case of an immutable Set - I can't really tell. Especially since the way an index of the internal array is chosen.
Let me provide some details. First how the index is searched:
int idx = Math.floorMod(pe.hashCode() ^ SALT, elements.length);
pe is the actual value we put in the set. SALT is just 32 bits generated at start-up, once per JVM (this is the actual randomization if you want). elements.length for our example is 8 (4 elements, but 8 here - double the size).
This expression is like a negative-safe modulo operation. Notice that the same logical thing is done in HashMap for example ((n - 1) & hash) when the bucket is chosen.
So if elements.length is 8 for our case, then this expression will return any positive value that is less than 8 (0, 1, 2, 3, 4, 5, 6, 7).
Now the rest of the method:
while (true) {
E ee = elements[idx];
if (ee == null) {
return -idx - 1;
} else if (pe.equals(ee)) {
return idx;
} else if (++idx == elements.length) {
idx = 0;
}
}
Let's break it down:
if (ee == null) {
return -idx - 1;
This is good, it means that the current slot in the array is empty - we can put our value there.
} else if (pe.equals(ee)) {
return idx;
This is bad - slot is occupied and the already in place entry is equal to the one we want to put. Sets can't have duplicate elements - so an Exception is later thrown.
else if (++idx == elements.length) {
idx = 0;
}
This means that this slot is occupied (hash collision), but elements are not equal. In a HashMap this entry would be put to the same bucket as a LinkedNode or TreeNode - but not the case here.
So index is incremented and the next position is tried (with the small caveat that it moves in a circular way when it reaches the last position).
And here is the question: if nothing too fancy (unless I'm missing something) is being done when searching the index, why is there a need to have an array twice as big? Or why the function was not written like this:
int idx = Math.floorMod(pe.hashCode() ^ SALT, input.length);
// notice the diff elements.length (8) and not input.length (4)
The current implementation of SetN is a fairly simple closed hashing scheme, as opposed to the separate chaining approach used by HashMap. ("Closed hashing" is also confusingly known as "open addressing".) In a closed hashing scheme, elements are stored in the table itself, instead of being stored in a list or tree of elements that are linked from each table slot, which is separate chaining.
This implies that if two different elements hash to the same table slot, this collision needs to be resolved by finding another slot for one of the elements. The current SetN implementation resolves this using linear probing, where the table slots are checked sequentially (wrapping around at the end) until an open slot is found.
If you want to store N elements, they'll certainly fit into a table of size N. You can always find any element that's in the set, though you might have to probe several (or many) successive table slots to find it, because there will be lots of collisions. But if the set is probed for an object that's not a member, linear probing will have to check every table slot before it can determine that object isn't a member. With a full table, most probe operations will degrade to O(N) time, whereas the goal of most hash-based approaches is for operations to be O(1) time.
Thus we have a class space-time tradeoff. If we make the table larger, there will be empty slots sprinkled throughout the table. When storing items, there should be fewer collisions, and linear probing will find empty slots more quickly. The clusters of full slots next to each other will be smaller. Probes for non-members will proceed more quickly, since they're more likely to encounter an empty slot sooner while probing linearly -- possibly after not having to reprobe at all.
In bringing up the implementation, we ran a bunch of benchmarks using different expansion factors. (I used the term EXPAND_FACTOR in the code whereas most literature uses load factor. The reason is that the expansion factor is the reciprocal of the load factor, as used in HashMap, and using "load factor" for both meanings would be confusing.) When the expansion factor was near 1.0, the probe performance was quite slow, as expected. It improved considerably as the expansion factor was increased. The improvement was really flattening out by the time it got up to 3.0 or 4.0. We chose 2.0 since it got most of the performance improvement (close to O(1) time) while providing good space savings compared to HashSet. (Sorry, we haven't published these benchmark numbers anywhere.)
Of course, all of these are implementation specifics and may change from one release to the next, as we find better ways to optimize the system. I'm certain there are ways to improve the current implementation. (And fortunately we don't have to worry about preserving iteration order when we do this.)
A good discussion of open addressing and performance tradeoffs with load factors can be found in section 3.4 of
Sedgewick, Robert and Kevin Wayne. Algorithms, Fourth Edition. Addison-Wesley, 2011.
The online book site is here but note that the print edition has much more detail.

Fastest algorithm to find frequencies of each element of an array of reals?

The problem is to find frequencies of each element of an array of reals.
double[] a = new double[n]
int[] freq = new int[n]
I have come up with two solution:
First solution O(n^2):
for (int i = 0; i < a.length; i++) {
if (freq[i] != -1) {
for (int j = i + 1; j < a.length; j++) {
if (a[i] == a[j]) {
freq[i]++;
freq[j] = -1;
}
}
}
}
Second solution O(nlogn):
quickSort(a, 0, a.length - 1);
freq[j] = 1;
for (int i = 0; i < a.length - 1; i++) {
if (a[i] == a[i + 1]) {
freq[j]++;
}
else {
j = i + 1;
freq[j] = 1;
}
}
Is there any faster algorithm for this problem (O(n) maybe)?
Thank you in advance for any help you can provide.
Let me start by saying that checking for identity of doubles is not a good practice. For more details see: What every programmer should know about floating points.
You should use more robust double comparisons.
Now, that we are done with that, let's face your problem.
You are dealing with variation of Element Distinctness Problem with floating points number.
Generally speaking, under the algebraic tree computation model, one cannot do it better than Omega(nlogn) (references in this thread: https://stackoverflow.com/a/7055544/572670).
However, if you are going to stick with the doubles identity checks (please don't), you can use a stronger model and hash table to achieve O(n) solution, by maintaining a hash-table based histogram (implemented as HashMap<Double,Integer>) of the elements, and when you are done, scan the histogram and yield the key of the highest value.
(Please don't do it)
There is a complex way to do achieve O(n) time based on hashing, even when dealing with floating points. This is based on adding elements to multiple entries of the hash table and assuming a hash function is taking a range of elements [x-delta/2,x+delta/2) to the same hash value (so it is hashing in chunks [x1,x2)->h1, [x2,x3)->h2, [x3,x4)->h3, ....) . You then can create a hash table where an element x will be hashed to 3 values: x-3/4delta, x, x + 3/4delta.
This guarantees when checking an equal value later, it will have a match in at least one of the 3 places you put the element.
This is significantly more complex to implement, but it should work. A variant of this can be found in cracking the code interview, mathematical, question 6. (Just make sure you look at edition 5, the answer in edition 4 is wrong and was fixed in the newer edition)
As another side note, you don't need to implement your own sort. Use Arrays.sort()
Your doubles have already been rounded appropriately and you are confident there isn't an error to worry about, you can use a hash map like
Map<Double, Long> freqCount = DoubleStream.of(reals).boxed()
.collect(Collectors.groupingBy(d -> d, Collectors.counting()));
This uses quite a bit of memory, but is O(n).
The alternative is to use the following as a first pass
NavigableMap<Double, Long> freqCount = DoubleStream.of(reals).boxed()
.collect(Collectors.groupingBy(d -> d, TreeMap::new, Collectors.counting()));
This will count all the values which are exactly the same, and you can use a grouping strategy to combine double values which are almost the same, but should be considered equal for your purposes. This is O(N log N)
Using a Trie would perform in pretty much linear time because insertions are going to be extremely fast (or as fast as the order of your real number).
Sorting and counting is definitely way too slow if all you need is the frequencies. Your friend is the trie: https://en.wikipedia.org/wiki/Trie
If you were using a Trie, then you would convert each integer into a String (simple enough in Java). The complexity of an insertion into a Trie varies slightly based on the implementation, but in general it will be proportional to the length of the String.
If you need an implementation of a Trie, I suggest looking into Robert Sedgwick's implementation for his Algorithm's course here:
http://algs4.cs.princeton.edu/52trie/TrieST.java.html

Adding 2 list element by element

I am having 2 Lists and want to add them element by element. Like that:
Is there an easier way and probably much more well performing way than using a for loop to iterate over the first list and add it to the result list?
I appreciate your answer!
Depends on what kind of list and what kind of for loop.
Iterating over the elements (rather than indices) would almost certainly be plenty fast enough.
On the other hand, iterating over indices and repeatedly getting the element by index could work rather poorly for certain types of lists (e.g. a linked list).
My understanding is that you have List1 and List2 and that you want to find the best performing way to find result[index] = List1[index] + list2[index]
My main suggestion is that before you start optimising for performance is to measure whether you need to optimise at all. You can iterate through the lists as you said, something like:
for(int i = 0; i < listSize; i++)
{
result[i] = List1[i] + List2[i];
}
In most cases this is fine. See NPE's answer for a description of where this might be expensive, i.e. a linked list. Also see this answer and note that each step of the for loop is doing a get - on an array it is done in 1 step, but in a linked list it is done in as many steps at it takes to iterate to the element in the list.
Assuming a standard array, this is O(n) and (depending on array size) will be done so quickly that it will hardly result in a blip on your performance profiling.
As a twist, since the operations are completely independent, that is result[0] = List1[0] + List2[0] is independent of result[1] = List1[1] + List2[1], etc, you can run these operations in parallel. E.g. you could run the first half of the calculations (<= List.Size / 2) on one thread and the other half (> List.Size / 2) on another thread and expect the elapsed time to roughly halve (assuming at least 2 free CPUs). Now, the best number of threads to use depends on the size of your data, the number of CPUs, other operations happening at the same time and is normally best decided by testing and modeling under different conditions. All this adds complexity to your program, so my main recommendation is to start simple, then measure and then decide whether you need to optimise.
Looping is inevitable except you have a matrix API (e.g. OpenGL). You could implement a List<Integer> which is backed by the original Lists:
public class CalcList implements List<Integer> {
private List<Integer> l1, l2;
#Override
public int get(int index) {
return l1.get(index) + l2.get(index);
}
}
This avoids copy operations and moves the calculations at the end of your stack:
CalcList<Integer> results1 = new CalcList(list, list1);
CalcList<Integer> results2 = new CalcList(results1, list3);
// no calculation or memory allocated until now.
for (int result : results2) {
// here happens the calculation, still without unnecessary memory
}
This could give an advantage if the compiler is able to translate it into:
for (int i = 0; i < list1.size; i++) {
int result = list1[i] + list2[i] + list3[i] + …;
}
But I doubt that. You have to run a benchmark for your specific use case to find out if this implementation has an advantage.
Java doesn't come with a map style function, so the the way of doing this kind of operation is using a for loop.
Even if you use some other construct, the looping will be done anyway. An alternative is using the GPU for computations but this is not a default Java feature.
Also using arrays should be faster than operating with linked lists.

How to obtain counts of each of the elements of the list?

Given a sorted list of something (a,a,b,c,c)
What would be the most efficient way to recognize that a exists in the list 2 times, b once and c 2 times?
Aside from obvious making a map of counts. Can we do better then this?
if (map.containsKey(key)) {
map.put(key, map.get(key) + 1);
} else {
map.put(key, 1);
}
Ultimately the goal is to iterate of the list and know at any given point how many times a key was seen before. Putting things in a map, seems like a step we don't really need.
I would use a Multiset implementation in Guava - probably a HashMultiset. That avoids having to do a put/get on each iteration - if the item already exists when you add it, it just increments the count. It's a bit like using a HashMap<Foo, AtomicInteger>.
See the Guava User's Guide entry on Multiset for more details.
Your method, at each iteration, makes
one lookup for containsKey
one lookup for get
one unboxing from Integer to int
one boxing from int to Integer
one put
You could simply compare the current element to the previous one, increment a count if it's equal, and put the count if not (and reset the counter to 1).
But even if you keep your algorithm, using get and compare the result to null would at least avoid an unnecessary lookup.

Categories

Resources