java 8 parallel stream confusion/issue

java 8 parallel stream confusion/issue - java

I am new to parallel stream and trying to make 1 sample program that will calculate value * 100(1 to 100) and store it in map.
While executing code I am getting different count on each iteration.
I may be wrong at somewhere so please guide me anyone knows the proper way to do so.
code:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.stream.Collectors;
public class Main{
static int l = 0;
public static void main (String[] args) throws java.lang.Exception {
letsGoParallel();
}
public static int makeSomeMagic(int data) {
l++;
return data * 100;
}
public static void letsGoParallel() {
List<Integer> dataList = new ArrayList<>();
for(int i = 1; i <= 100 ; i++) {
dataList.add(i);
}
Map<Integer, Integer> resultMap = new HashMap<>();
dataList.parallelStream().map(f -> {
Integer xx = 0;
{
xx = makeSomeMagic(f);
}
resultMap.put(f, xx);
return 0;
}).collect(Collectors.toList());
System.out.println("Input Size: " + dataList.size());
System.out.println("Size: " + resultMap.size());
System.out.println("Function Called: " + l);
}
}
Runnable Code
Last Output
Input Size: 100
Size: 100
Function Called: 98
On each time run output differs.
I want to use parallel stream in my own application but due to this confusion/issue I can't.
In my application I have 100-200 unique numbers on which some same operation needs to be performed. In short there's function which process something.

Your access to both the HashMap and to the l variable are both not thread safe, which is why the output is different in each run.
The correct way to do what you are trying to do is collecting the Stream elements into a Map:
Map<Integer, Integer> resultMap =
dataList.parallelStream()
.collect(Collectors.toMap (Function.identity (), Main::makeSomeMagic));
EDIT: The l variable is still updated in a not thread safe way with this code, so you'll have to add your own thread safety if the final value of the variable is important to you.

By putting some values in resultMap you're using a side-effect:
dataList.parallelStream().map(f -> {
Integer xx = 0;
{
xx = makeSomeMagic(f);
}
resultMap.put(f, xx);
return 0;
})
The API states:
Stateless operations, such as filter and map, retain no state from
previously seen element when processing a new element -- each element
can be processed independently of operations on other elements.
Going on with:
Stream pipeline results may be nondeterministic or incorrect if the
behavioral parameters to the stream operations are stateful. A
stateful lambda (or other object implementing the appropriate
functional interface) is one whose result depends on any state which
might change during the execution of the stream pipeline.
It follows an example similar to yours showing:
... if the mapping operation is performed in parallel, the results for
the same input could vary from run to run, due to thread scheduling
differences, whereas, with a stateless lambda expression the results
would always be the same.
That explains your observation: On each time run output differs.
The right approach is shown by #Eran

Hopefully it works fine. by making Synchronied function makeSomeMagic and using Threadsafe data structure ConcurrentHashMap
and write simple statement
dataList.parallelStream().forEach(f -> resultMap.put(f, makeSomeMagic(f)));
Whole code is here :
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.stream.Collectors;
public class Main{
static int l = 0;
public static void main (String[] args) throws java.lang.Exception {
letsGoParallel();
}
public synchronized static int makeSomeMagic( int data) { // make it synchonized
l++;
return data * 100;
}
public static void letsGoParallel() {
List<Integer> dataList = new ArrayList<>();
for(int i = 1; i <= 100 ; i++) {
dataList.add(i);
}
Map<Integer, Integer> resultMap = new ConcurrentHashMap<>();// use ConcurrentHashMap
dataList.parallelStream().forEach(f -> resultMap.put(f, makeSomeMagic(f)));
System.out.println("Input Size: " + dataList.size());
System.out.println("Size: " + resultMap.size());
System.out.println("Function Called: " + l);
}
}

There is no need to count how many times the method invoked.
Stream will help you do loop in byte code.
Pass your logic(function) to Stream, do not use no thread-safe variable in multi-thread(include parallelStream)
like this.
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelStreamClient {
// static int l = 0;---> no need to count times.
public static void main(String[] args) throws java.lang.Exception {
letsGoParallel();
}
public static int makeSomeMagic(int data) {
// l++;-----> this is no thread-safe way
return data * 100;
}
public static void letsGoParallel() {
List<Integer> dataList = new ArrayList<>();
for (int i = 1; i <= 100; i++) {
dataList.add(i);
}
Map<Integer, Integer> resultMap =
dataList.parallelStream().collect(Collectors.toMap(i -> i,ParallelStreamClient::makeSomeMagic));
System.out.println("Input Size: " + dataList.size());
System.out.println("Size: " + resultMap.size());
//System.out.println("Function Called: " + l);
}

Related

Spark Dataset Foreach function does not iterate

Context
I want to iterate over a Spark Dataset and update a HashMap for each row.
Here is the code I have:
// At this point, I have a my_dataset variable containing 300 000 rows and 10 columns
// - my_dataset.count() == 300 000
// - my_dataset.columns().length == 10
// Declare my HashMap
HashMap<String, Vector<String>> my_map = new HashMap<String, Vector<String>>();
// Initialize the map
for(String col : my_dataset.columns())
{
my_map.put(col, new Vector<String>());
}
// Iterate over the dataset and update the map
my_dataset.foreach( (ForeachFunction<Row>) row -> {
for(String col : my_map.KeySet())
{
my_map.get(col).add(row.get(row.fieldIndex(col)).toString());
}
});
Issue
My issue is that the foreach doesn't iterate at all, the lambda is never executed and I don't know why.
I implemented it as indicated here: How to traverse/iterate a Dataset in Spark Java?
At the end, all the inner Vectors remain empty (as they were initialized) despite the Dataset is not (Take a look to the first comments in the given code sample).
I know that the foreach never iterates because I did two tests:
Add an AtomicInteger to count the iterations, increment it right in the beginning of the lambda with incrementAndGet() method. => The counter value remains 0 at the end of the process.
Print a debug message right in the beginning of the lambda. => The message is never displayed.
I'm not used of Java (even less with Java lambdas) so maybe I missed an important point but I can't find what.

I am probably a little old school, but I never like lambdas too much, as it can get pretty complicated.
Here is a full example of a foreach():
package net.jgp.labs.spark.l240_foreach.l000;
import java.io.Serializable;
import org.apache.spark.api.java.function.ForeachFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ForEachBookApp implements Serializable {
private static final long serialVersionUID = -4250231621481140775L;
private final class BookPrinter implements ForeachFunction<Row> {
private static final long serialVersionUID = -3680381094052442862L;
#Override
public void call(Row r) throws Exception {
System.out.println(r.getString(2) + " can be bought at " + r.getString(
4));
}
}
public static void main(String[] args) {
ForEachBookApp app = new ForEachBookApp();
app.start();
}
private void start() {
SparkSession spark = SparkSession.builder().appName("For Each Book").master(
"local").getOrCreate();
String filename = "data/books.csv";
Dataset<Row> df = spark.read().format("csv").option("inferSchema", "true")
.option("header", "true")
.load(filename);
df.show();
df.foreach(new BookPrinter());
}
}
As you can see, this example reads a CSV file and prints a message from the data. It is fairly simple.
The foreach() instantiates a new class, where the work is done.
df.foreach(new BookPrinter());
The work is done in the call() method of the class:
private final class BookPrinter implements ForeachFunction<Row> {
#Override
public void call(Row r) throws Exception {
...
}
}
As you are new to Java, make sure you have the right signature (for classes and methods) and the right imports.
You can also clone the example from https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l240_foreach/l000. This should help you with foreach().

Java Static class for saving middle variables

I create a middle class save many variables in it, example:
public static class Middle{
public static List<Student> listStudent = new ArrayList<>();
public static String level = 1; (this example of level of a character in the game)
}
And assign value for those variables
class A{
Middle.listStudent = GetData();
Middle.level++;
Intent intent = new Intent(A.this, B.class);
startActivity(intent)
}
And then in next class (or activity) we using those variables with new data
class B{
ShowResult(Middle.listStudent);
ShowResult(Middle.level);
}
I using this way because don't want to transfer data by Intent.
My question is, can we using this way too much in the whole application without getting any issue, and if the middle class it shut down for any reason, makes losing data?

If some static class become shutdown, maybe some serious error
occurs in your application.The JVM had to exit.
In multithreading environment,this way can cause dirty read and
brings into some strange thing.
You can try the code below. And see what's going on.
public static void main(String[] args) {
// create three threads to run it
for (int i = 0; i < 3; i++) {
//simulate multi-threaded environment
new Thread(() -> {
for (int j = 0; j < 10; j++) {
StaticData.listStudent.add(Thread.currentThread().getName() + ":" + j);
StaticData.level++;
}
}).start();
}
//show the last result , in single thread ,result must be 30 31 ,but maybe not this in multi-threaded environment
System.out.println("Total Result listStudent's size is :" + StaticData.listStudent.size());
System.out.println("Total Result level is :" + StaticData.level);
}
public static class StaticData {
public static List<String> listStudent = new ArrayList<>();
public static Integer level = 1;
}

ArrayList calling method

import acm.program.*;
import java.util.*;
public class ReverseArrayList extends ConsoleProgram {
public void run() {
println("This program reverses the elements in an ArrayList.");
println("Use 0 to signal the end of the list.");
ArrayList<Integer> list = readArrayList();
reverseArrayList(list);
printArrayList(list);
}
/* Reads the data into the list */
private ArrayList<Integer> readArrayList() {
ArrayList<Integer> list = new ArrayList<Integer>();
while (true) {
int value = readInt(" ? ");
if (value == 0) break;
list.add(value);
}
return list;
}
I dont understand the following code:
ArrayList<Integer> list = readArrayList();
I dont understand why I can't do the following instead:
list.getInput();
Why do i need to make the ArrayList equal to the method, and this confuses me because now I'm unsure which way is needed whenever I want to call a method in java

Your code shows that the method getInput() does not take in an ArrayList as argument, but returning it instead. So it is reasonable that
Arrlist=getInput()
Is the correct syntax, you are assigning the returned ArrayList from getInput() to Arrlist. While on the other hand,
Arrlist.getInput()
represents a method that must be implemented in ArrayList Class or one of its superclasses, which is not true in your case. I would recommend revising OOP concepts.

One way you might be able to pass it is using a constructor. I mocked up working code that does the same.
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
public class ArrayListExample {
ArrayList<Integer> ofNumbers;
public ArrayListExample() {
ofNumbers = new ArrayList<>();
createArray();
}
private void createArray(){
ofNumbers = IntStream.range(0, 10)
.boxed()
.collect(Collectors
.toCollection(ArrayList::new));
}
public ArrayList<Integer> getInput() {
return ofNumbers;
}
public void getArray() {
ArrayList<Integer> newList = new ArrayList<>(ofNumbers);
for (Integer num : newList) {
System.out.println(num);
}
}
}
I also agree with Andrew. Keep it up, with a little more practice this will become second nature to you.

Java: Filtering lots of data

I have ~10M rows of data, each containing ~1000 columns (String & Numeric). What I need is to be able to apply simple filters (>, <, RANGE, ==) to this data set as quick as possible (less than a second to get 10K slice for this data).
What kind of production ready technology, which could be used from Java exist?

Where is your data coming from? This sounds like a task for a database.

A sql database with an index on the fields you're filtering. The index can be based on numeric value, which will make range and equals queries pretty quick.

If it's not from database,
You can do it in few threads and then combine the results in order to improve performance.
Like, here AMOUNT is a number of elements in your map:
package com.stackoverflow.test;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class Test6 {
private static final int AMOUNT = 10000000;
private static final int CORES = Runtime.getRuntime().availableProcessors();
private static final int PART = AMOUNT / CORES;
private static final class MapFilterTask implements Callable<Map<String,Number >> {
private Integer fromElement;
private Integer toElement;
private Map<String,Number > map;
private MapFilterTask(Map<String,Number > map, Integer fromElement, Integer toElement) {
this.map=map;
this.fromElement = fromElement;
this.toElement = toElement;
}
public Map<String,Number > call() throws Exception {
for(int i=fromElement; i<=toElement; i++){
//filetr your map and return filtered resutl
}
}
}
public static void main(String[] args) throws InterruptedException, ExecutionException {
Map<String,Number > yourMap =new HashMap<String, Number>();
ExecutorService taskExecutor = Executors.newFixedThreadPool(CORES);
List<Callable<Map<String,Number >>> tasks = new ArrayList<Callable<Map<String,Number >>>();
for (int i = 0; i < CORES; i++) {
tasks.add(new MapFilterTask(yourMap,i*PART,(i+1)*PART));
}
List<Future<Map<String,Number >>> futures = taskExecutor.invokeAll(tasks);
Map<String,Number > newMap =new HashMap<String, Number>();
for(Future<Map<String,Number >> feature : futures){
newMap.putAll(feature.get());
}
// Map<String,Numeric>
}
}
And for me it works 4 times faster only with VM args : -Xms2048M -Xmx2048M
Without VM args I got 1.7 time increment on my laptop with 4 cores processor and Linux Mint OS.

java+spark: org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException

I'm new to spark, and was trying to run the example JavaSparkPi.java, it runs well, but because i have to use this in another java s I copy all things from main to a method in the class and try to call the method in main, it saids
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException
the code looks like this:
public class JavaSparkPi {
public void cal(){
JavaSparkContext jsc = new JavaSparkContext("local", "JavaLogQuery");
int slices = 2;
int n = 100000 * slices;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++) {
l.add(i);
}
JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
System.out.println("count is: "+ dataSet.count());
dataSet.foreach(new VoidFunction<Integer>(){
public void call(Integer i){
System.out.println(i);
}
});
int count = dataSet.map(new Function<Integer, Integer>() {
#Override
public Integer call(Integer integer) throws Exception {
double x = Math.random() * 2 - 1;
double y = Math.random() * 2 - 1;
return (x * x + y * y < 1) ? 1 : 0;
}
}).reduce(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
});
System.out.println("Pi is roughly " + 4.0 * count / n);
}
public static void main(String[] args) throws Exception {
JavaSparkPi myClass = new JavaSparkPi();
myClass.cal();
}
}
anyone have idea on this? thanks!

The nested functions hold a reference to the containing object (JavaSparkPi). So this object will get serialized. For this to work, it needs to be serializable. Simple to do:
public class JavaSparkPi implements Serializable {
...

The main problem is that when you create an Anonymous Class in java it is passed a reference of the enclosing class.
This can be fixed in many ways
Declare the enclosing class Serializable
This works in your case but will fall flat in case your enclosing class has some field that is not serializable. I would also say that serializing the parent class is a total waste.
Create the Closure in a static function
Creating the closure by invoking some static function doesn't pass the reference to the closure and hence no need to make serializable this way.

This error comes because you have multiple physical CPUs in your local or cluster and spark engine try to send this function to multiple CPUs over network.
Your function
dataSet.foreach(new VoidFunction<Integer>(){
public void call(Integer i){
***System.out.println(i);***
}
});
uses println() which is not serialize. So the exception thrown by Spark Engine.
The solution is you can use below:
dataSet.collect().forEach(new VoidFunction<Integer>(){
public void call(Integer i){
System.out.println(i);
}
});

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java 8 parallel stream confusion/issue - java

Related

Spark Dataset Foreach function does not iterate

Java Static class for saving middle variables

ArrayList calling method

Java: Filtering lots of data

java+spark: org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException

Categories

Resources