What is the danger of side effects in Java 8 Streams? - java

I'm trying to understand warnings I found in the Documentation on Streams. I've gotten in the habit of using forEach() as a general purpose iterator. And that's lead me to writing this type of code:
public class FooCache {
private static Map<Integer, Integer> sortOrderCache = new ConcurrentHashMap<>();
private static Map<Integer, String> codeNameCache = new ConcurrentHashMap<>();
public static void populateCache() {
List<Foo> myThings = getThings();
myThings.forEach(thing -> {
sortOrderCache.put(thing.getId(), thing.getSortOrder());
codeNameCache.put(thing.getId(), thing.getCodeName())
});
}
}
This is a trivialized example. I understand that this code violates Oracle's warning against stateful lamdas and side-effects. But I don't understand why this warning exists.
When running this code it appears to behave as expected. So how do I break this to demonstrate why it's a bad idea?
In sort, I read this:
If executed in parallel, the non-thread-safety of ArrayList would
cause incorrect results, and adding needed synchronization would cause
contention, undermining the benefit of parallelism.
But can anyone add clarity to help me understand the warning?

From the Javadoc:
Note also that attempting to access mutable state from behavioral
parameters presents you with a bad choice with respect to safety and
performance; if you do not synchronize access to that state, you have
a data race and therefore your code is broken, but if you do
synchronize access to that state, you risk having contention undermine
the parallelism you are seeking to benefit from. The best approach is
to avoid stateful behavioral parameters to stream operations entirely;
there is usually a way to restructure the stream pipeline to avoid
statefulness.
The problem here is that if you access a mutable state, you loose on two side:
Safety, because you need synchronization which the Stream tries to minimize
Performance, because the required synchronization cost you (in your example, if you use a ConcurrentHashMap, this has a cost).
Now, in your example, there are several points here:
If you want to use Stream and multi threading stream, you need to use parralelStream() as in myThings.parralelStream(); as it stands, the forEach method provided by java.util.Collection is simple for each.
You use HashMap as a static member and you mutate it. HashMap is not threadsafe; you need to use a ConcurrentHashMap.
In the lambda, and in the case of a Stream, you must not mutate the source of your stream:
myThings.stream().forEach(thing -> myThings.remove(thing));
This may work (but I suspect it will throw a ConcurrentModificationException) but this will likely not work:
myThings.parallelStream().forEach(thing -> myThings.remove(thing));
That's because the ArrayList is not thread safe.
If you use a synchronized view (Collections.synchronizedList), then you would have a performance it because you synchronize on each access.
In your example, you would rather use:
sortOrderCache = myThings.stream()
.collect(Collectors.groupingBy(
Thing::getId, Thing::getSortOrder);
codeNameCache= myThings.stream()
.collect(Collectors.groupingBy(
Thing::getId, Thing::getCodeName);
The finisher (here the groupingBy) does the work you were doing and might be called sequentially (I mean, the Stream may be split across several thread, the the finisher may be invoked several times (in different thread) and then it might need to merge.
By the way, you might eventually drop the codeNameCache/sortOrderCache and simply store the id->Thing mapping.

I believe the documentation is mentioning about the side effects demonstrated by the below code:
List<Integer> matched = new ArrayList<>();
List<Integer> elements = new ArrayList<>();
for(int i=0 ; i< 10000 ; i++) {
elements.add(i);
}
elements.parallelStream()
.forEach(e -> {
if(e >= 100) {
matched.add(e);
}
});
System.out.println(matched.size());
This code streams through the list in parallel, and tries to add elements into other list if they match the certain criteria. As the resultant list is not synchronised, you will get java.lang.ArrayIndexOutOfBoundsException while executing the above code.
The fix would be to create a new list and return, e.g.:
List<Integer> elements = new ArrayList<>();
for(int i=0 ; i< 10000 ; i++) {
elements.add(i);
}
List<Integer> matched = elements.parallelStream()
.filter(e -> e >= 100)
.collect(Collectors.toList());
System.out.println(matched.size());

Side effects frequently makes assumptions about state and context. In parallel you are not guaranteed a specific order you see the elements in and multiple threads may run at the same time.
Unless you code for this this can give very subtle bugs which is very hard to track and fix when trying to go parallel.

Related

Java ParallelStream: several map or single map

Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?
I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.
I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.
The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}
I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.

Java 8 updating a map in parallel stream

I have two loops. In the inner loop, I hit a Database, get the result and perform some computatiosn on the result (which involves calling other private method) and put the result it in a map.
Will this approach cause any problem like putting null for any of the keys?
No two threads will update the same value. i.e)the key that is computed will be unique. (If it loops n times, there will be n keys)
Map<String,String> m = new ConcurrentHashMap<>();
obj1.getProp().parallelStream().forEach(k1 -> { //obj.getProp() returns a list
obj2.parallelStream().forEach(k2-> { //obj2 is a list
String key = constructKey(k1,k2);
//Hit a DB and get the result
//Computations on the result
//Call some other methods
m.put(key, result);
});
});
You should not use the Stream API unless you’ve fully understood that it is more than an alternative spelling for loops. Generally, if your code contains a forEach on a stream, you should ask yourself at least once whether this is really the best solution for your task, but if your code contains a nested forEach calls, you should know that it can’t be the right thing.
It might work, as when adding to a concurrent map, like in your question, however, it defeats the purpose of the Stream API.
Besides that, arrays don’t have a parallelStream() method, thus, when the result type of obj.getProp() and the type of obj2 are arrays, as your comments say, you have to use Arrays.stream(…) to construct a stream.
What you want to do can be implemented as
Map<String,String> m =
Arrays.stream(obj1.getProp()).parallel()
.flatMap(k1 -> Arrays.stream(obj2).map(k2 -> constructKey(k1, k2)))
.collect(Collectors.toConcurrentMap(key -> key, key -> {
//Hit a DB and get the result
//Computations on the result
//Call some other methods
return result;
}));
The benefit of this is not only a better utilization of parallel processing, but also that it works even if you use Collectors.toMap, creating a non-concurrent Map, instead of Collectors.toConcurrentMap; the framework will take care of producing it in a thread-safe manner.
So unless you definitely need a concurrent map for concurrent later-one processing, you can use either; which one will perform better depends on factors whose discussion would exceed the scope of this answer.
So with the correct usage of the Stream API, it will be thread safe, regardless of which Map type you produce, and the remaining question is whether the database access is thread safe, which, as already explained in this answer depends on a lot of factors which you didn’t include in your question, so we can’t answer that.
Your question boils down to the parts "can I add to a concurrent hash map from multiple threads?" and "can I access my database in parallel?"
The answer to the first is: "yes", the answer to the second is "it depends"
Or a little longer: the two parallel streams which you use basically just start the inner lambda on multiple threads in the execution pool. The adding to the map itself is not a problem, that is what the concurrent hash map was made for.
Regarding the database, it depends on how you query it and on which level you share the object. If you use a connection pool with a different connection for each thread, you will probably be fine. For most databases, sharing a connection and getting a new statement per thread is also fine. Sharing a statement and getting a new result set leads to problems for quite a number of database drivers.

Thread safe or not? Updating a not-thread-safe-map from a parallel stream

The code snippet below updates a not-thread-safe map (itemsById is not thread safe) from a parallel stream's forEach block.
// Update stuff in `itemsById` by iterating over all stuff in newItemsById:
newItemsById.entrySet()
.parallelStream()
.unordered()
.filter(...)
.forEach(entry -> {
itemsById.put(entry.getKey(), entry.getValue()); <-- look
});
To me, this looks like not-thread-safe, because the parallel stream will call the forEach block in many threads at the same time, and thus call itemsById.put(..) in many threads at the same time, and itemsById isn't thread safe. (However, with a ConcurrentMap the code would be safe I think)
I wrote to a colleague: "Please note that the map might allocate new memory when you insert new data. That's likely not thread safe, since the collection is not thread safe. -- Whether or not writing to different keys from many threads, is thread safe, is implementation dependent, I would think. It's nothing I would choose to rely on."
He however says that the above code is thread safe. -- Is it?
((Please note: I don't think this question is too localized. Actually now with Java 8 I think fairly many people will do something like: parallelStream()...foreach(...) and then it might be good know about thread safety issues, for many people))
You're right: this code is not thread-safe and depending on the Map implementation and race condition may produce any random effect: correct result, silent loss of data, some exception or endless loop. You may easily check it like this:
int equal = 0;
for(int i=0; i<100; i++) {
// create test input map like {0 -> 0, 1 -> 1, 2 -> 2, ...}
Map<Integer, Integer> input = IntStream.range(0, 200).boxed()
.collect(Collectors.toMap(x -> x, x -> x));
Map<Integer, Integer> result = new HashMap<>();
// write it into another HashMap in parallel way without key collisions
input.entrySet().parallelStream().unordered()
.forEach(entry -> result.put(entry.getKey(), entry.getValue()));
if(result.equals(input)) equal++;
}
System.out.println(equal);
On my machine this code usually prints something between 20 and 40 instead of 100. If I change HashMap to TreeMap, it usually fails with NullPointerException or becomes stuck in the infinite loop inside TreeMap implementation.
I'm no expert on streams but I assume there is no fancy synchronization employed here and thus I wouldn't consider adding elements to itemsById in parallel as threadsafe.
One of the things that could happen would be an endless loop since if both elements would happen to end up in the same bucket the underlying list might be messed up and elements could refer to each other in a cycle (A.next = B, B.next = A). A ConcurrentHashMap would prevent that by synchronizing write access on the bucket, i.e. unless the elements end up in the same bucket it would not block but if they do the add is sequential.
This code is not thread-safe.
Oracle docs state:
Operations like forEach and peek are designed for side effects; a
lambda expression that returns void, such as one that invokes
System.out.println, can do nothing but have side effects. Even so, you
should use the forEach and peek operations with care; if you use one
of these operations with a parallel stream, then the Java runtime may
invoke the lambda expression that you specified as its parameter
concurrently from multiple threads.

Sequential streams and shared state

The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.

Mutating instance or local object variables in Lambda java 8

I know that for concurrency reasons I cannot update the value of a local variable in a lambda in Java 8. So this is illegal:
double d = 0;
orders.forEach( (o) -> {
d+= o.getTotal();
});
But, what about updating an instance variable or changing the state of a local object?, For example a Swing application I have a button and a label declared as instance variables, when I click the button I want to hide the label
jButton1.addActionListener(( e) -> {
jLabel.setVisible(false);
});
I get no compiler errors and works fine, but... is it right to change state of an object in a lambda?, Will I have concurrency problems or something bad in the future?
Here another example. Imagine that the following code is in the method doGet of a servlet
Will I have some problem here?, If the answer is yes: Why?
String key = request.getParameter("key");
Map<String, String> resultMap = new HashMap<>();
Map<String, String> map = new HashMap<>();
//Load map
map.forEach((k, v) -> {
if (k.equals(key)) {
resultMap.put(k, v);
}
});
response.getWriter().print(resultMap);
What I want to know is: When is it right to mutate the state of an object instance in a lambda?
Your assumptions are incorrect.
You can only change effectively final variables in lambdas, because lambdas are syntactic sugar* over anonymous inner classes.
*They are actually more than only syntactic sugar, but that is not relevant here.
And in anonymous inner classes you can only change effectively final variables, hence the same holds for lambdas.
You can do anything you want with lambdas as long as the compiler allows it, onto the behaviour part now:
If you modify state that depends on other state, in a parallel setting, then you are in trouble.
If you modify state that depends on other state, in a linear setting, then everything is fine.
If you modify state that does not depend on anything else, then everything is fine as well.
Some examples:
class MutableNonSafeInt {
private int i = 0;
public void increase() {
i++;
}
public int get() {
return i;
}
}
MutableNonSafeInt integer = new MutableNonSafeInt();
IntStream.range(0, 1000000)
.forEach(i -> integer.increase());
System.out.println(integer.get());
This will print 1000000 as expected no matter what happens, even though it depends on the previous state.
Now let's parallelize the stream:
MutableNonSafeInt integer = new MutableNonSafeInt();
IntStream.range(0, 1000000)
.parallel()
.forEach(i -> integer.increase());
System.out.println(integer.get());
Now it prints different integers, like 199205, or 249165, because other threads are not always seeing the changes that different threads have made, because there is no synchronization.
But say that we now get rid of our dummy class and use the AtomicInteger, which is thread-safe, we get the following:
AtomicInteger integer = new AtomicInteger(0);
IntStream.range(0, 1000000)
.parallel()
.forEach(i -> integer.getAndIncrement());
System.out.println(integer.get());
Now it correctly prints 1000000 again.
Synchronization is costly however, and we have lost nearly all benefits of parallelization here.
In general: yes, you may get concurrency problems, but only the ones you already had. Lambdafying it won't make code non-threadsafe where it was before, or vice versa. In the example you give, your code is (probably) threadsafe because an ActionListener is only ever called on the event-dispatching thread. Provided you have observed the Swing single-threaded rule, no other thread ever accesses jLabel, and if so there can be no thread interference on it. But that question is orthogonal to the use of lambdas.
in case 'forEach' is distributed to different threads/cores you might have concurrency issues. consider using atomics or concurrent structures (like ConcurrentHashMap)

Categories

Resources