mapPartitions Vs foreach plus accumulator approach - java

There are some cases in which I can obtain the same results by using the mapPartitions or the foreach method.
For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transforms the original RDD in a collection of tuple (key, value). I think that it is possible to achieve the same result by using, for instance an array of accumulator where at each index an executor sums a value and the index itself could be a key.
Since the reduceByKey will perform a shuffle on disk, I think that when it is possible, the foreach approach should be better even though the foreach has the side effect of sum a value to an accumulator.
I am making this request to see if my reasoning is correct . I hope I was clear.

Don't use aggregators for this. They are less reliable. (For example they can be double-counted if speculative execution is enabled.)
But the approach you describe has its merits.
With reduceByKey there is a shuffle. The upside is that it can handle more keys than would fit on a single machine.
With the foreach + aggregator method you avoid the shuffle. But now you cannot handle more keys than can fit on one machine. Also you have to know the keys in advance so that you can create the aggregators. The code becomes a mess too.
If you have a low number of keys, then the reduceByKeyLocally method is what you need. It's basically the same as your aggregator trick, except it doesn't use aggregators, you don't have to know the keys in advance, and it's a drop-in replacement for reduceByKey.
reduceByKeyLocally creates a hashmap for each partition, sends the hashmaps to the driver and merges them there.

Related

Spark RDD- map vs mapPartitions

I read through theoretical differences between map and mapPartitions, & 'm much clear when to use them in varied situations.
But my problem described below is more based upon GC activity & Memory (RAM). Please read below for the problem:-
=> I wrote a map function to convert Row to String. So, an input of RDD[org.apache.spark.sql.Row] would be mapped to RDD[String]. But with this approach map object would be created for every row of an RDD. Thus creation of such large number of objects may increase GC activity.
=> To resolve above, I thought of using mapPartitions. So, that number of objects become equivalent to number of partitions. mapPartitions gives Iterator as an input and accepts to return and java.lang.Iterable. But most of the Iterable like Array, List, etc are in memory. So, if I have huge amount of data then would creating a Iterable this way can lead to out of Memory ? or Is there any other collection (java or scala) that should be utilized here (to spill to Disk in case memory starts to fill)? or should we only use mapPartitions in case RDD is completely in Memory?
Thanks in advance. Any help would be greatly appreciated.
If you think about JavaRDD.mapPartitions it takes FlatMapFunction (or some variant like DoubleFlatMapFunction) which is expected to return Iterator not Iterable. If underlaying collection is lazy then you have nothing to worry about.
RDD.mapPartitions takes a functions from Iterator to Iterator.
I general if you use reference data you can replace mapPartitions with map and use static member to store data. This will have the same footprint and will be easier to write.
to answer your question about mapPartition(f: Iterator => Iterator). it is lazy and also does not hold the whole partition in mem. Spark will use this(we can consider it to be a Functor in FP term) Iterator => Iterator function and recompile it into its own code to execute. if partition is too big, it will spill to disk before next shuffle point. so don't worry about it.
one thing that needs to mention is, you can force your function to materialize data into mem, simply by doing:
rdd.mapPartition(
partitionIter => {
partitionIter.map(do your logic).toList.toIterator
}
)
toList will force Spark to materialize the data for the whole partition into mem, therefore watch out for this, because ops like toList will break the laziness of the function chain.

Java 8 forEach use cases

Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).

Is there a preferred way collect a stream of lists into a flat list?

I was wondering, whether there is a preferred way to get from a stream of lists to a collection containing the elements of all the lists in the stream.
I can think of two ways to get there:
final Stream<List<Integer>> stream = Stream.empty();
final List<Integer> one = stream.collect(ArrayList::new, ArrayList::addAll, ArrayList::addAll);
final List<Integer> two = stream.flatMap(List::stream).collect(Collectors.toList());
The second option looks much nicer to me, but I guess the first one is more efficient in parallel streams.
Are there further arguments for or against one of the two methods?
The main difference is that flatMap is an intermediate operation. while collect is a terminal operation.
So flatMap is the only way to process the flattened stream items if you want to do other operations than collecting immediately.
Further collect(ArrayList::new, ArrayList::addAll, ArrayList::addAll) is very hard to read given the fact that you have two identical method references ArrayList::addAll with completely different semantics.
Regarding parallel processing, your guess is wrong. The first one has lesser capabilities of parallel processing as it relies on ArrayList.addAll applied to the stream items (sub-lists) which can’t be broken into parallel sub-steps. In contrast, Collectors.toList() applied to a flatMap can do parallel processing of sub-list items if the particular Lists encountered in the stream support it. But this will be relevant only if you have a rather small stream of rather big sub-lists.
The only drawback of flatMap is the intermediate stream creation which adds an overhead in the case that you have a lot of very small sub-lists.
But in your example, the stream is empty so it doesn’t matter (scnr).
I think the intent of option two is much clearer than that of option one. It took me a few seconds to work out what was happening with the first one, it doesn't look "right" - although it seems valid. Option two was more obvious to me.
Essentially, the intent of what you are doing is a flatmap. If that's the case I'd expect to see flatmap used rather than using addAll().

Working with huge maps (putIfAbsent)

I have this Map definition :
TreeMap <String, Set<Integer>>
It may contain millions of entries, and I also need a "natural order" (that's why I've chosen a TreeMap, though I could write a Comparator if needed).
So, what I have to do in order to add an element to the map is :
Check if a key already exists.
If not, create a new Set and add the value.
If it exists, I have to add the value to the Set
I have this implementation which works fine :
private void addToMap (String key, Integer value){
Set<Integer> vs = dataMap.get(key);
if (vs == null){
vs = new TreeSet<Integer>();
dataMap.put(key,vs);
}
vs.add(value);
}
But I would like to avoid searching for the key and then putting the element if it doesn't exist (it will perform a new search over the huge map).
I think I could use ConcurrentHashMap.putIfAbsent method, but then :
I will not have the natural ordering of the keys (and I will need to perform a sort on the millions keys)
I may have (I don't know) additional overhead because of synchronization over the ConcurrentHashMap, and in my situation my process is single threaded and it may impact performance.
Reading this post : Java map.get(key) - automatically do put(key) and return if key doesn't exist?
there's an answer that talks about Guava MapMaker.makeComputingMap but looks like the method is not there anymore.
Performance is critical in this situation (as always :D), so please let me know your recommendations.
Thanks in advance.
NOTE :
Thanks a lot for so many helping answers in just some minutes.
(I don't know which one to select as the best).
I will do some performance tests on the suggestions (TreeMultiMap, ConcurrentSkipListMap, TreeSet + HashMap) and update with the results. I will select the one with the best performance then, as I'd like to select all three of them but I cannot.
NOTE2
So, I did some performance testing with 1.5 million entries, and these are the results :
ConcurrentSkipListMap, it doesn't work as I expected, because it replaces the existing value with the new empty set I provided. I thought it was setting the value only if it the key doesn't exist, so I cannot use this one. (my mistake).
TreeSet + HashMap, works fine but doesn't give the best performance. It is like 1.5 times slower than TreeMap alone or TreeMultiMap.
TreeMultiMap gives the best performance, but it is almost the same as the TreeMap alone. I will check this one as the answer.
Again, thanks a lot for your contributions and help.
Concurrent map does not do magic, it checks the existence and then inserts if not exists.
Guava have MultiMaps, for example TreeMultiMap can be what you need.
If performance is critical I wouldn't use a TreeSet of Integer, I would find a more light weight structure like TIntArrayList or something which wraps int values. I would also use a HashMap as its look up is O(1) instead of O(log N). If you also need to keep the keys sorted, I would use a second collection for that.
I agree that putIfAbsent on ConcurrentHashMap is overkill and get/put on a HashMap is likely to be the fastest option.
ConcurrentSkipListMap might be a good option to use putIfAbsent, but it I would make sure its not slower.
BTW Even worse than doing a get/put is creating a HashSet you don't need.
PutIfAbsent has the benefit of concurrency, that is: if many threads call this at the same time, they don't have to wait (it doesn't use synchronized internally). However this comes at a minor cost of execution speed, so if you work only single-threaded, this will slow things down.
If you need this sorted, try the ConcurrentSkipListMap.

java concurrent map sorted by value

I'm looking for a way to have a concurrent map or similar key->value storage that can be sorted by value and not by key.
So far I was looking at ConcurrentSkipListMap but I couldn't find a way to sort it by value (using Comparator), since compare method receives only the keys as parameters.
The map has keys as String and values as Integer. What I'm looking is a way to retrieve the key with the smallest value(integer).
I was also thinking about using 2 maps, and create a separate map with Integer keys and String values and in this way I will have a sorted map by integer as I wanted, however there can be more than one integers with the same value, which could lead me into more problems.
Example
"user1"=>3
"user2"=>1
"user3"=>3
sorted list:
"user2"=>1
"user1"=>3
"user3"=>3
Is there a way to do this or are any 3rd party libraries that can do this?
Thanks
To sort by value where you can have multiple "value" to "key" mapping, you need a MultiMap. This needs to be synchronized as there is no concurrent version.
This doesn't meant the performance will be poor as that depends on how often you call this data structure. e.g. it could add up to 1 micro-second.
I recently had to do this and ended up using a ConcurrentSkipListMap where the keys contain a string and an integer. I ended up using the answer proposed below. The core insight is that you can structure your code to allow for a duplicate of a key with a different value before removing the previous one.
Atomic way to reorder keys in a ConcurrentSkipListMap / ConcurrentSkipListSet?
The problem was to keep a dynamic set of strings which were associated with integers that could change concurrently from different threads, described below. It sounds very similar to what you wanted to do.
Is there an embeddable Java alternative to Redis?
Here's the code for my implementation:
https://github.com/HarvardEconCS/TurkServer/blob/master/turkserver/src/main/java/edu/harvard/econcs/turkserver/util/UserItemMatcher.java
The principle of a ConcurrentMap is that it can be accessed concurrently - if you want it sorted at any time, performance will suffer significantly as that map would need to be fully synchronized (like a hashtable), resulting in poor throughput.
So I think your best bet is to return a sorted view of your map by putting all elements in an unmodifiable TreeMap for example (although sorting a TreeMap by values needs a bit of tweaking).

Categories

Resources