Understanding Spark code flow execution across driver and cluster

Understanding Spark code flow execution across driver and cluster - java

Learning Apache Spark (Java API) and trying to understand what code executes on the Driver versus what code executes on the remote Spark Cluster.
Given the following code snippet from this Baeldung article:
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile(args[0], 1);
JavaRDD<String> words
= lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD<String, Integer> ones
= words.mapToPair(word -> new Tuple2<>(word, 1));
JavaPairRDD<String, Integer> counts
= ones.reduceByKey((Integer i1, Integer i2) -> i1 + i2);
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2<?, ?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
ctx.stop();
}
Is everything in between the ctx instantiation to the ctx.stop() somehow ported over from the Driver's JVM to the Spark cluster? Or just all the JavaRDD operations?

In general, the lines of Spark code that you write are not executed line-by-line by Spark.
Before actually executing this code, Spark turns it into an execution plan with something called the Catalyst Optimizer. This will also make a bunch of optimizations, for example:
predicate pushdown of filters
rewriting joins more efficiently
adding necessary filters for operations not to fail
removing unnecessary operations (for example removing a sort operation if the max nr. of rows is 1)
...
This means that you can't exactly reason about code in the same way as with classically compiled/interpreted code: your code will get rewritten/optimized by the Catalyst optimizer.
But still, it can help to think about your lines of code in the way that your question is asked. So I'll give you a very rough answer. In general:
Driver:
Keeps track of the execution of your whole program.
Non distributed tasks happen on here: creating your spark context, keeping track of for loops, ...
Will send over tasks to each executor for distributed operations
Executor:
Distributed operations happen on here: Any kind of Spark transformation on RDDs/Dataframes/Datasets like map, flatMap, filter, join, ... The driver does send tasks to each executor to do these operations (so in a sense there is some activity on the driver even for these operations)
Both:
Spark actions. These are operations where you ask for some result of a calculation on distributed datasets (RDDs, Dataframes, Datasets) to go back to the driver. For example: collect, count, take, show, ... In your case, the result of collect ends up back on the driver.
So, you could say (but it would not be entirely correct, remember that a bunch of optimizations happen) that:
All your code until ctx instantiation happens on the driver.
Then, all the RDD operations happen on the driver/executor (driver send tasks to executors and executors do the work).
Then, when you do your collect, the executors send over all of the data to the driver.
From then on, the rest happens on the driver

Related

Simple multi-threaded Java app - ExecutorService? Fork/Join? Spliterators?

I am writing a command-line application in Java 8. There's a part that involves some computation, and I believe it could benefit from running in parallel using multiple threads. However, I have not much experience in writing multi-threaded applications, so I hope you could steer me in the right direction how should I design the parallel part of my code.
For simplicity, let's pretend the method in question receives a relatively big array of longs, and it should return a Set containing only prime numbers:
public final static boolean checkIfNumberIsPrime(long number) {
// algorithm implementation, not important here
// ...
}
// a single-threaded version
public Set<Long> extractPrimeNumbers(long[] inputArray) {
Set<Long> result = new HashSet<>();
for (long number : inputArray) {
if (checkIfNumberIsPrime(number)) {
result.add(number);
}
}
return result;
}
Now, I would like to refactor method extractPrimeNumbers() in such way that it would be executed by four threads in parallel, and when all of them are finished, return the result. Off the top of my head, I have the following questions:
Which approach would be more suitable for the task: ExecutorService or Fork/Join? (each element of inputArray[] is completely independent and they can be processed in any order whatsoever)
Assuming there are 1 million elements in inputArray[], should I "ask" thread #1 to process all indexes 0..249999, thread #2 - 250000..499999, thread #3 - 500000..749999 and thread #4 - 750000..999999? Or should I rather treat each element of inputArray[] as a separate task to be queued and then executed by an applicable worker thread?
If a prime number is detected, it should be added to `Set result, therefore it needs to be thread-safe (synchronized). So, perhaps it would be better if each thread maintained its own, local result-set, and only when it is finished, it would transfer its contents to the global result, in one go?
Is Spliterator of any use here? Should they be used to partition inputArray[] somehow?

Parallel stream
Use none of these. Parallel streams are going to be enough to deal with this problem much more straightforwardly than any of the alternatives you list.
return Arrays.parallelStream(inputArray)
.filter(n -> checkIfNumberIsPrime(n))
.boxed()
.collect(Collectors.toSet());
For more info, see The Java™ Tutorials > Aggregate Operations > Parallelism.

Java ParallelStream: several map or single map

Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?

I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.

I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.

The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}

I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.

Searching a list using multiple threads and find element (without using parallel streams)

I have a method
public boolean contains(int valueToFind, List<Integer> list) {
//
}
How can I split the array into x chunks? and have a new thread for searching every chunk looking for the value. If the method returns true, I would like to stop the other threads from searching.
I see there are lots of examples for simply splitting work between threads, but how I do structure it so that once one thread returns true, all threads and return that as the answer?
I do not want to use parallel streams for this reason (from source):
If you do, please look at the previous example again. There is a big
error. Do you see it? The problem is that all parallel streams use
common fork-join thread pool, and if you submit a long-running task,
you effectively block all threads in the pool. Consequently, you block
all other tasks that are using parallel streams. Imagine a servlet
environment, when one request calls getStockInfo() and another one
countPrimes(). One will block the other one even though each of them
requires different resources. What's worse, you can not specify thread
pool for parallel streams; the whole class loader has to use the same
one.

You could use the built-in Stream API:
//For a List
public boolean contains(int valueToFind, List<Integer> list) {
return list.parallelStream().anyMatch(Integer.valueOf(valueToFind)::equals);
}
//For an array
public boolean contains(int valueToFind, int[] arr){
return Arrays.stream(arr).parallel().anyMatch(x -> x == valueToFind);
}
Executing Streams in Parallel:
You can execute streams in serial or in parallel. When a stream executes in parallel, the Java runtime partitions the stream into multiple substreams. Aggregate operations iterate over and process these substreams in parallel and then combine the results.
When you create a stream, it is always a serial stream unless otherwise specified. To create a parallel stream, invoke the operation Collection.parallelStream.

Concurrent transformations on RDD in foreachDD function of Spark DStream

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI.
DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>()
{
public void call(JavaRDD<String> inRDD)
{
inRDD.foreach(fn1)
inRDD.foreach(fn2)
}
}
How is is different when streaming job is run this way. Are the below functions run in parallel on input Dstream?
DStreamRDD1.foreachRDD(fn1)
DStreamRDD2.foreachRDD(fn2)

Both foreach on RDD and foreachRDD on DStream will run sequentially because they are output transformations, meaning they cause the materialization of the graph. This would not be the case for any general lazy transformation in Spark, which can run in parallel when the execution graph diverges into multiple separate stages.
For example:
dStream: DStream[String] = ???
val first = dStream.filter(x => x.contains("h"))
val second = dStream.filter(x => !x.contains("h"))
first.print()
second.print()
The first part need not execute sequentially when you have sufficient cluster resources to run underlying stages in parallel. Then, calling count, which again is an output transformation will cause the print statements to be printed one after the other.

How to debug a Callable in Java ExecutorService that returns Futures

I have an ExecutorService that I use to multithread the processing of the some text, and the unit of work to be applied to a given chunk of text is defined as my ParserCallable which returns a Payload.
So I have a
List> list;
which is populated by handing off chunks of work to each ParserCallable.
I want to debug ParserCallable in this multithreaded environment, and I don't have any concern for which one, but I'm not sure how I can step into the execution of any ParserCallable's from my code.
Which looks something like
List<Future<PayLoad>> list = new ArrayList<Future<PayLoad>>();
ExecutorService executor = Executors.newFixedThreadPool(25);
for (int i = 0; i < blocks_of_work.size(); i++) {
Callable<PayLoad> worker = new ParserCallable(blocks_of_work.get(i));
Future<PayLoad> submit = executor.submit(worker);
list.add(submit);
}
How I can debug any given ParserCallable in order to troubleshoot some errors I'm getting? Based on my code I'm not sure how to step into one of these callables.

You need to put a break point in the callable itself.
From a computer science theoretic standpoint, it's possible to enable smooth debugging (after all why does your language care whether the operation was sync or async?) but I don't believe this is supported in any mainstream language, like Java, yet.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.