Using temp variable using collectors - java

I have the following code. I am trying to understand if it would make any changes to memory.
Approach 1: Using collectors I can directly return map like so:
List<Customer> customerList = new ArrayList<>();
customerList.add(new Customer("1", "pavan"));
customerList.add(new Customer("2", "kumar"));
return customerList.stream().collect(Collectors.toMap(t->t.getId(), t->t));
Approach 2: Using an explicit map to collect results, like so:
Map<String,Customer> map = new HashMap<String, Customer>();
map = customerList.stream().collect(Collectors.toMap(t->t.getId(), t->t));
return map;
Compared to the first, does the second approach make any difference to memory/ GC, if I iterate over a million times?

Aside from instantiating a Map instance that you don't need in the second example, both pieces of code are identical. You immediately replace the Map reference you created with the one returned by the stream. Most likely the compiler would eliminate that as redundant code.
The collect method of the streams API will instantiate a Map for you; the code has been well optimised, which is one of the advantages of using the Stream API over doing it yourself.
To answer your specific question, you can iterate over both sections of code as many times as you like and it won't make any difference to the GC impact.

There code is by far not identical; specifically Collectors.toMap says that it will return a Map :
There are no guarantees on the type, mutability, serializability, or thread-safety of the Map returned.
There are absolutely no guarantees what-so-ever that the returned Map is actually a HashMap. It could be anything other - any other Map here; so assigning it to a HashMap is just wrong.
Then there is the way you build the Collector. It could be simplified to:
customerList.stream().collect(Collectors.toMap(Customer::getId, Function.identity()));
The method reference Customer::getId, as opposed to the lambda expression, will create one less method (since lambda expressions are de-sugared to methods and method references are not).
Also Function.identity() instead of t -> t will create less objects if used in multiple places. Read this.
Then there is the fact how a HashMap works internally. If you don't specify a default size, it might have to re-size - which is an expensive operation. By default Collectors.toMap will start with a default Map of 16 entries and a load_factor of 0.75 - which means you can put 12 entries into it before the next resize.
You can't omit that using Collectors.toMap since the supplier of that will always start from HashMap::new - using the default 16 entries and load_factor of 0.75.
Knowing this, you can drop the stream entirely:
Map<String, Customer> map = new HashMap<String, Customer>((int)Math.ceil(customerList.size() / 0.75));
customerList.forEach(x -> map.put(x.getId(), x));

In the second approach, you are instantiating a Map, and reassigning the reference to the one returned by the call to stream.collect().
Obviously, the first Map object referenced by "map" is lost.
The first approach does not have this problem.
In short, yes, this makes a minor difference in terms of memory usage, but it is likely negligible considering you have a million entries to iterate over.

Related

Java unmodifiableMap can be replaced with 'Map.copyOf' call

I'm new to Java and I recently learnt that somtimes it's important to deepcopy a Collection and make an unmodifiable view of it so that the data inside remains safe and unchanged.
When I try to practice this(unmodifiableMap2), I get a warning from IDEA that
unmodifiableMap Can be replaced with 'Map.copyOf' call
That's weird for me because I think unmodifiableMap is not only a copy of the underlying map. Besides, when I try to create the same unmodifiableMap in another way(unmodifiableMap1), the warning doesn't pop up!
How should I understand this behavior of IDEA ?
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
public class test {
public static void main(String[] args) {
Map<Integer, Integer> map = new HashMap<>();
map.put(1,1);
map.put(2,2);
Map<Integer, Integer> map1 = new HashMap<>(map);
Map<Integer, Integer> unmodifiableMap1 = Collections.unmodifiableMap(map1);
Map<Integer, Integer> unmodifiableMap2 = Collections.unmodifiableMap(new HashMap<>(map););
}
}
Map.copyOf() makes a copy of the given Map instance, but it requires that no value in the map is null. Usually, this is the case, but it is not a strict requirement for a Map in general.
java.util.Collections.unmodifiableMap() just wraps a reference to the given Map instance. This means that the receiver is unable to modify the map, but modifications to the original map (that one that was the argument to unmodifiableMap()) are visible to the receiver.
Assuming we have two threads, one iterates over the unmodifiable map, while the other modifies the original one. As a result, you may get a ConcurrentModificationException for an operation on the unmodifiable map … not funny to debug that thing!
This cannot happen with the copy created by Map.copyOf(). But this has a price: with a copy, you need two times the amount of memory for the map (roughly). For really large maps, this may cause memory shortages up to an OutOfMemoryError. Also not fun to debug!
In addition, just wrapping the existing map is presumably much faster than copying it.
So there is no best solution in general, but for most scenarios, I have a preference for using Map.copyOf() when I need an unmodifiable map.
The sample in the question did not wrap the original Map instance, but it makes a copy before wrapping it (either in a line of its own, or on the fly). This eliminates the potential problem with the 'under-the-hood' modification, but may bring the memory issue.
From my experience so far, Map.copyOf( map ) looks to be more efficient than Collections.unmodifiableMap( new HashMap( map ) ).
By the way: Map.copyOf() returns a map that resembles a HashMap; when you copy a TreeMap with it, the sort order gets lost, while the wrapping with unmodifiableMap() keeps the underlying Map implementation and therefore also the sort order. So when this is important, you can use Collections.unmodifiableMap( new TreeMap( map ) ), while Map.copyOf() does not work here.
An unmodifiable map using an existing reference to a map is perfectly fine, and there are many reasons you might want to do this.
Consider this class:
class Foo {
private final Map<String, String> fooMap = new HashMap<>();
// some methods which mutate the map
public Map<String, String> getMap() {
return Collections.unmodifiableMap(fooMap);
}
}
What this class does is provide a read-only view of the map it encapsulates. The class can be sure that clients who consume the map cannot alter it, they can just see its contents. They will also be able to see any updates to the entries if they keep hold of the reference for some time.
If we had tried to expose a read-only view by copying the map, it would take time and memory to perform the copy and the client would not see any changes because both maps are then distinct instances - the source and the copy.
However in the case of this:
Collections.unmodifiableMap(new HashMap<>(map));
You are first copying the map into a new hash map and then passing that copy into Collections.unmodifiableMap. The result is effectively constant. You do not have a reference to the copy you created with new HashMap<>(map), and nor can you get one*.
If what you want is a constant map, then Map.copyOf is a more concise way of achieving that, so IntelliJ suggests you should use that instead.
In the first case, since the reference to the map already exists, IntelliJ cannot make the same inference about your intent so it gives no such suggestion.
You can see the IntelliJ ticket for this feature if you like, though it doesn't explain why the two are essentially equivalent, just that they are.
* well, you probably could via reflection, but IntelliJ is assuming that you won't
Map.copyOf(map) is fully equivalent to Collections.unmodifiableMap(new HashMap<>(map)).
Neither does any kind of deep copying. But it's strictly shorter to do Maps.copyOf(map).

how to make access to a value in a Java Hashmap synchronized?

Let's say I have a Java Hashmap where the keys are strings or whatever, and the values are lists of other values, for example
Map<String,List<String>> myMap=new HashMap<String,List<String>>();
//adding value to it would look like this
myMap.put("catKey", new ArrayList<String>(){{add("catValue1");}} );
If we have many threads adding and removing values from the lists (not changing the keys just the values of the Hashmap) is there a way to make the access to the lists only threadsafe? so that many threads can edit many values in the same time?
Use a synchronized or concurrent list implementation instead of ArrayList, e.g.
new Vector() (synchronized)
Collections.synchronizedList(new ArrayList<>()) (synchronized wrapper)
new CopyOnWriteArrayList<>() (concurrent)
new ConcurrentLinkedDeque<>() (concurrent, not a List)
The last is not a list, but is useful if you don't actually need access-by-index (i.e. random access), because it performs better than the others.
Note, since you likely need concurrent insertion of the initial empty list into the map for a new key, you should use a ConcurrentHashMap for the Map itself, instead of a plain HashMap.
Recommendation
Map<String, Deque<String>> myMap = new ConcurrentHashMap<>();
// Add new key/value pair
String key = "catKey";
String value = "catValue1";
myMap.computeIfAbsent(key, k -> new ConcurrentLinkedDeque<>()).add(value);
The above code is fully thread-safe when adding a new key to the map, and fully thread-safe when adding a new value to a list. The code doesn't spend time obtaining synchronization locks, and don't suffer the degradation that CopyOnWriteArrayList has when the list grows large.
The only problem is that it uses a Deque, not a List, but the reality is that most uses of List could as easily be using a Deque, but is specifying a List out of habit, so this is likely an acceptable change.
There is a ConcurrentHashMap class which implements ConcurrentMap which can be used for thread-safe Map handling. compute, putIfAbsent, merge, all thread-safely handle multiple things trying to affect the same value at once.
Firstly use the concurrent hash map which will synchronize that particular bucket.
Secondly atomic functions must be used, otherwise when one thread will use the get method another thread can call the put method. Like below
// wrong
if(myMap.get("catKey") == null){
myMap.put("catKey",new ArrayList<String>(){{add("catValue1");}});
}
//correct
myMap.compute("catKey", (key, value) -> if(value==null){return new ArrayList<String>(){{add("catValue1");}}} return value;);

Java 8 forEach use cases

Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).

What is the fastest way to get the elements of a collection?

I have a List<Pair<String, String>> into which I want to copy the data from a Collection.
What is the best way to read the collection and add items to list?
List<Pair<String, String>> identityMemoPairs = new LinkedList<Pair<String, String>>();
Collection result = handler.getResult();
while(result.iterator().hasNext()){
IdentityM im =(IdentityM) result.iterator().next();
identityMemoPairs.add(Pair.of(im.identity,im.memo));
}
Your code is wrong, because you create a new iterator at each iteration of the while loop (actually you create two of them). Each fresh iterator will point to the beginning of the result collection. Thus you have created an infinite loop.
To fix this, call result.iterator() only once and store the result in a variable.
But even better (better to read, less error-prone) would be the for-each loop, which is (almost always) the preferred variant to iterate over collections:
for (IdentityM im : (Collection<IdentityM>)result) {
identityMemoPairs.add(Pair.of(im.identity,im.memo));
}
The compiler will automatically transform this into code using the iterator, so no performance difference. Generally, performance is anyway not a problem when iterating over a collection, as long as you avoid a few bad things like calling get(i) on a LinkedList.
Note that the compiler will give you a warning here, which has nothing to do with the iteration, but with the use of the raw type Collection (instead of Collection<IdentityM>). Please check if handler.getResult() actually returns a Collection<IdentityM> and change the type of the result variable to this if possible.
Another question is whether you really need that list to be a list of pairs. Usually using plain pair classes is not recommended, because their name does not show the meaning of the object they represent. For example, it is better to use a class PersonName which has fields for first and last name instead of a Pair<String, String>. Why can't you just use a List<IdentityM>? And if you can use this, are you sure you can't use a Collection<IdentityM>? (List and Collection are often exchangeable.) Then you could avoid the copying altogether.
Your code is imho good as it is. But you could probabbly make the handler return a collection of Pairs directly so you could call identityMemoPairs.addAll() instead of iterating the collection yourself. But that only makes it "prettier" it doesn't give more performance.

Iterating through the union of several Java Map key sets efficiently

In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.
My current implementation looks like this:
Set<Object> keyset = new HashSet<Object>();
for (Map<Object, Object> map : input) {
for (Object key : map.keySet()) {
if (keyset.add(key)) {
...
}
}
}
The HashSet instance ensures that no key will be acted upon more than once.
Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.
I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.
Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.
Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?
EDIT:
A few clarifications and comments:
I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
for (Map<Object, Object> m : input) {
Object v = m.get(key);
// Do something with v
}
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
Key Map0 Map1 Map2
F 1 null 2
B 2 3 null
C null null 5
...
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().
EDIT 2:
Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:
I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...
Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.
Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.
How you avoid using a HashSet depends on what you are doing.
I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.
// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input)
union.putAll(map);
// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
// do something.
}
Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.
If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.
This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().
Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.
You could take a look at Guava's Sets.union() http://guava-libraries.googlecode.com/svn/tags/release04/javadoc/com/google/common/collect/Sets.html#union(java.util.Set,%20java.util.Set)

Categories

Resources