Spark RDD- map vs mapPartitions - java

I read through theoretical differences between map and mapPartitions, & 'm much clear when to use them in varied situations.
But my problem described below is more based upon GC activity & Memory (RAM). Please read below for the problem:-
=> I wrote a map function to convert Row to String. So, an input of RDD[org.apache.spark.sql.Row] would be mapped to RDD[String]. But with this approach map object would be created for every row of an RDD. Thus creation of such large number of objects may increase GC activity.
=> To resolve above, I thought of using mapPartitions. So, that number of objects become equivalent to number of partitions. mapPartitions gives Iterator as an input and accepts to return and java.lang.Iterable. But most of the Iterable like Array, List, etc are in memory. So, if I have huge amount of data then would creating a Iterable this way can lead to out of Memory ? or Is there any other collection (java or scala) that should be utilized here (to spill to Disk in case memory starts to fill)? or should we only use mapPartitions in case RDD is completely in Memory?
Thanks in advance. Any help would be greatly appreciated.

If you think about JavaRDD.mapPartitions it takes FlatMapFunction (or some variant like DoubleFlatMapFunction) which is expected to return Iterator not Iterable. If underlaying collection is lazy then you have nothing to worry about.
RDD.mapPartitions takes a functions from Iterator to Iterator.
I general if you use reference data you can replace mapPartitions with map and use static member to store data. This will have the same footprint and will be easier to write.

to answer your question about mapPartition(f: Iterator => Iterator). it is lazy and also does not hold the whole partition in mem. Spark will use this(we can consider it to be a Functor in FP term) Iterator => Iterator function and recompile it into its own code to execute. if partition is too big, it will spill to disk before next shuffle point. so don't worry about it.
one thing that needs to mention is, you can force your function to materialize data into mem, simply by doing:
rdd.mapPartition(
partitionIter => {
partitionIter.map(do your logic).toList.toIterator
}
)
toList will force Spark to materialize the data for the whole partition into mem, therefore watch out for this, because ops like toList will break the laziness of the function chain.

Related

Java Stream API - is it possible to create Collector that is collecting result to a new stream?

So, the question is pretty self-explanatory: is there a way to create a Collector, that is collecting stream that is passed to it into a new stream?
I am aware that it can be done with some tricks like:
Collectors.collectingAndThen(Collectors.toList(), List::stream);
But this code allocates redundant memory.
Explanation on why do I need this in the first place: sometimes I want to pass something to Collectors.groupingBy and then perform a stream operation on a downstream, without collecting it additional time.
Is there a simple way to do it (without writing my own class implementing the collector interface)?
EDIT: This question has been marked as a duplicate of this question, but that is not what I'm looking for. I do not want to duplicate the original stream, I want to close it and produce a new stream consisting of the same elements in the same order, by the means of Collector and without allocating memory in between.
The collect method will accept a collector.
Parts of the collectors are the following
Supplier - For a stream, it could be Stream.empty() for example
Accumulator - For streams, we can use Stream.concat()
Combiner - Since this takes a BinaryOperator and you'll need to work on the first element because it will be the reference. For lists for example, you get l1 and l2 and usually do l1.addAll(l2)
So, further than being very complicated, the overhead in terms of memory allocation will be bigger than just collecting everything in a collection first.
Everything is possible, however, you'll probably need to write your own collector that will suite your needs.
You can even rewrite the same exact collector as groupingBy, but is it really worth it?

Memory allocation Primitive Stream vs primitive type table

My problem: I'm downloading huge data from the database for JFreeChart. I want to optimize data memory usage without using a primitive table.
Working with collections requires objects usage.
I'm wondering if I can optimize memory usage using primitive Stream like IntStream instead of for example LinkedList<Integer>.
I don't know how to make a reliable benchmark.
If your starting point is a LinkedList<Integer>, just replacing it with ArrayList<Integer> will reduce the memory consumption significantly, as explained in When to use LinkedList over ArrayList?. Since boxed integers are small objects and some of them will even be reused when boxing the same value, the graphs of this answer have significance.
If you want to have it simpler, just use int[]. If you need something you can fill incrementally, the Stream API has indeed an option. Use IntStream.builder() to get an IntStream.Builder to which you can repeatedly add new int values. Once it contains all values, you can call build().toArray() on it to get an int[] array containing all values or you can directly perform filtering and aggregate operations on the IntStream returned by build() (if you can express the aggregate operation as reduction).

Apache Spark: Effectively using mapPartitions in Java

In the currently early-release textbook titled High Performance Spark, the developers of Spark note that:
To allow Spark the flexibility to spill some records
to disk, it is important to represent your functions inside of mapPartitions in such a
way that your functions don’t force loading the entire partition in-memory (e.g.
implicitly converting to a list). Iterators have many methods we can write functional style
transformations on, or you can construct your own custom iterator. When a
transformation directly takes and returns an iterator without forcing it through
another collection, we call these iterator-to-iterator transformations.
However, the textbook lacks good examples using mapPartitions or similar variations of the method. And there's few good code examples existing online--most of which are Scala. For example, we see this Scala code using mapPartitions written by zero323 on How to add columns into org.apache.spark.sql.Row inside of mapPartitions.
def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)
sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show
Unfortunately, Java doesn't provide anything as nice as iter.map(...) for iterators. So it begs the question, how can one effectively use the iterator-to-iterator transformations with mapPartitions without entirely spilling an RDD to disk as a list?
JavaRDD<OutObj> collection = prevCollection.mapPartitions((Iterator<InObj> iter) -> {
ArrayList<OutObj> out = new ArrayList<>();
while(iter.hasNext()) {
InObj current = iter.next();
out.add(someChange(current));
}
return out.iterator();
});
This seems to be the general syntax for using mapPartitions in Java examples, but I don't see how this would be the most efficient, supposing you have a JavaRDD with tens of thousands of records (or even more...since, Spark is for big data). You'd eventually end up with a list of all the objects in the iterator, just to turn it back into an iterator (which begs to say that a map function of some sort would be much more efficient here).
Note: while these 8 lines of code using mapPartitions could be written as 1 line with a map or flatMap, I'm intentionally using mapPartitions to take advantage of the fact that it operates over each partition rather than each element in the RDD.
Any ideas, please?
One way to prevent forcing the "materialization" of the entire partition is by converting the Iterator into a Stream, and then using Stream's functional API (e.g. map function).
How to convert an iterator to a stream? suggests a few good ways to convert an Iterator into a Stream, so taking one of the options suggested there we can end up with:
rdd.mapPartitions((Iterator<InObj> iter) -> {
Iterable<InObj> iterable = () -> iter;
return StreamSupport.stream(iterable.spliterator(), false)
.map(s -> transformRow(s)) // or whatever transformation
.iterator();
});
Which should be an "Itrator-to-Iterator" transformation, because all the intermediate APIs used (Iterable, Stream) are lazily evaluated.
EDIT: I haven't tested it myself, but the OP commented, and I quote, that "there is no efficiency increase by using a Stream over a list". I don't know why that is, and I don't know if that would be true in general, but worth mentioning.

mapPartitions Vs foreach plus accumulator approach

There are some cases in which I can obtain the same results by using the mapPartitions or the foreach method.
For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transforms the original RDD in a collection of tuple (key, value). I think that it is possible to achieve the same result by using, for instance an array of accumulator where at each index an executor sums a value and the index itself could be a key.
Since the reduceByKey will perform a shuffle on disk, I think that when it is possible, the foreach approach should be better even though the foreach has the side effect of sum a value to an accumulator.
I am making this request to see if my reasoning is correct . I hope I was clear.
Don't use aggregators for this. They are less reliable. (For example they can be double-counted if speculative execution is enabled.)
But the approach you describe has its merits.
With reduceByKey there is a shuffle. The upside is that it can handle more keys than would fit on a single machine.
With the foreach + aggregator method you avoid the shuffle. But now you cannot handle more keys than can fit on one machine. Also you have to know the keys in advance so that you can create the aggregators. The code becomes a mess too.
If you have a low number of keys, then the reduceByKeyLocally method is what you need. It's basically the same as your aggregator trick, except it doesn't use aggregators, you don't have to know the keys in advance, and it's a drop-in replacement for reduceByKey.
reduceByKeyLocally creates a hashmap for each partition, sends the hashmaps to the driver and merges them there.

Java 8 forEach use cases

Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).

Categories

Resources