Java 8 updating a map in parallel stream

Java 8 updating a map in parallel stream - java

I have two loops. In the inner loop, I hit a Database, get the result and perform some computatiosn on the result (which involves calling other private method) and put the result it in a map.
Will this approach cause any problem like putting null for any of the keys?
No two threads will update the same value. i.e)the key that is computed will be unique. (If it loops n times, there will be n keys)
Map<String,String> m = new ConcurrentHashMap<>();
obj1.getProp().parallelStream().forEach(k1 -> { //obj.getProp() returns a list
obj2.parallelStream().forEach(k2-> { //obj2 is a list
String key = constructKey(k1,k2);
//Hit a DB and get the result
//Computations on the result
//Call some other methods
m.put(key, result);
});
});

You should not use the Stream API unless you’ve fully understood that it is more than an alternative spelling for loops. Generally, if your code contains a forEach on a stream, you should ask yourself at least once whether this is really the best solution for your task, but if your code contains a nested forEach calls, you should know that it can’t be the right thing.
It might work, as when adding to a concurrent map, like in your question, however, it defeats the purpose of the Stream API.
Besides that, arrays don’t have a parallelStream() method, thus, when the result type of obj.getProp() and the type of obj2 are arrays, as your comments say, you have to use Arrays.stream(…) to construct a stream.
What you want to do can be implemented as
Map<String,String> m =
Arrays.stream(obj1.getProp()).parallel()
.flatMap(k1 -> Arrays.stream(obj2).map(k2 -> constructKey(k1, k2)))
.collect(Collectors.toConcurrentMap(key -> key, key -> {
//Hit a DB and get the result
//Computations on the result
//Call some other methods
return result;
}));
The benefit of this is not only a better utilization of parallel processing, but also that it works even if you use Collectors.toMap, creating a non-concurrent Map, instead of Collectors.toConcurrentMap; the framework will take care of producing it in a thread-safe manner.
So unless you definitely need a concurrent map for concurrent later-one processing, you can use either; which one will perform better depends on factors whose discussion would exceed the scope of this answer.
So with the correct usage of the Stream API, it will be thread safe, regardless of which Map type you produce, and the remaining question is whether the database access is thread safe, which, as already explained in this answer depends on a lot of factors which you didn’t include in your question, so we can’t answer that.

Your question boils down to the parts "can I add to a concurrent hash map from multiple threads?" and "can I access my database in parallel?"
The answer to the first is: "yes", the answer to the second is "it depends"
Or a little longer: the two parallel streams which you use basically just start the inner lambda on multiple threads in the execution pool. The adding to the map itself is not a problem, that is what the concurrent hash map was made for.
Regarding the database, it depends on how you query it and on which level you share the object. If you use a connection pool with a different connection for each thread, you will probably be fine. For most databases, sharing a connection and getting a new statement per thread is also fine. Sharing a statement and getting a new result set leads to problems for quite a number of database drivers.

Related

How to prevent threads from reading inconsistent data from ConcurrentHashMap?

I'm using a ConcurrentHashMap<String, String> that works as a cache, and where read operations are performed to validate if an element is already in the cache and write operations to add an element to the cache.
So, my question is: what are the best practices to always read the most recent ConcorrentHashMap values?
I want to ensure data consistency and not have cases like:
With the map.get("key") method, the first thread validates that this key does not yet exist in the map, then it does the map.put("value")
The second thread reads the data before the first thread puts the element on the map, leading to inconsistent data.
Code example:
Optional<String> cacheValue = Optional.ofNullable(cachedMap.get("key"));
if (cacheValue.isPresent()) {
// Perform actions
} else {
cachedMap.putIfAbsent("key", "value");
// Perform actions
}
How can I ensure that my ConcurrentHashMap is synchronized and doesn't retrieve inconsistent data?
Should I perform these map operations inside a synchronized block?

You probably need to do it this way:
if (cachedMap.putIfAbsent("key", "value") == null) {
// Perform actions "IS NOT PRESENT"
} else {
// Perform actions "IS PRESENT"
}
Doing it in two checks is obviously not atomic, so if you're having problems with the wrong values getting put in the cache, then that's likely your problem.

what are the best practices to always read the most recent ConcurrentHashMap values?
Oracle's Javadoc for ConcurrentHashMap says, "Retrievals reflect the results of the most recently completed update operations holding upon their onset." In other words, any time you call map.get(...) or any other method on the map, you are always working with the "most recent" content.
*BUT*
Is that enough? Maybe not. If your program threads expect any kind of consistency between two or more keys in the map, or if your threads expect any kind of consistency between something that is stored in the map and something that is stored elsewhere, then you are going to need to provide some explicit higher-level synchronization between the threads.
I can't provide an example that would be specific to the problem that's puzzling you because your question doesn't really say what that problem is.

Java ParallelStream: several map or single map

Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?

I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.

I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.

The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}

I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.

How to safely consume Java Streams safely without isFinite() and isOrdered() methods?

There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)

After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.

Does double-checked locking work with a final Map in Java?

I'm trying to implement a thread-safe Map cache, and I want the cached Strings to be lazily initialized. Here's my first pass at an implementation:
public class ExampleClass {
private static final Map<String, String> CACHED_STRINGS = new HashMap<String, String>();
public String getText(String key) {
String string = CACHED_STRINGS.get(key);
if (string == null) {
synchronized (CACHED_STRINGS) {
string = CACHED_STRINGS.get(key);
if (string == null) {
string = createString();
CACHED_STRINGS.put(key, string);
}
}
}
return string;
}
}
After writing this code, Netbeans warned me about "double-checked locking," so I started researching it. I found The "Double-Checked Locking is Broken" Declaration and read it, but I'm unsure if my implementation falls prey to the issues it mentioned. It seems like all the issues mentioned in the article are related to object instantiation with the new operator within the synchronized block. I'm not using the new operator, and Strings are immutable, so I'm not sure that if the article is relevant to this situation or not. Is this a thread-safe way to cache strings in a HashMap? Does the thread-safety depend on what action is taken in the createString() method?

No it's not correct because the first access is done out side of a sync block.
It's somewhat down to how get and put might be implemented. You must bare in mind that they are not atomic operations.
For example, what if they were implemented like this:
public T get(string key){
Entry e = findEntry(key);
return e.value;
}
public void put(string key, string value){
Entry e = addNewEntry(key);
//danger for get while in-between these lines
e.value = value;
}
private Entry addNewEntry(key){
Entry entry = new Entry(key, ""); //a new entry starts with empty string not null!
addToBuckets(entry); //now it's findable by get
return entry;
}
Now the get might not return null when the put operation is still in progress, and the whole getText method could return the wrong value.
The example is a bit convoluted, but you can see that correct behaviour of your code relies on the inner workings of the map class. That's not good.
And while you can look that code up, you cannot account for compiler, JIT and processor optimisations and inlining which effectively can change the order of operations just like the wacky but correct way I chose to write that map implementation.

Consider use of a concurrent hashmap and the method Map.computeIfAbsent() which takes a function to call to compute a default value if key is absent from the map.
Map<String, String> cache = new ConcurrentHashMap<>( );
cache.computeIfAbsent( "key", key -> "ComputedDefaultValue" );
Javadoc: If the specified key is not already associated with a value, attempts to compute its value using the given mapping function and enters it into this map unless null. The entire method invocation is performed atomically, so the function is applied at most once per key. Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this map.

Non-trivial problem domains:
Concurrency is easy to do and hard to do correctly.
Caching is easy to do and hard to do correctly.
Both are right up there with Encryption in the category of hard to get right without an intimate understanding of the problem domain and its many subtle side effects and behaviors.
Combine them and you get a problem an order of magnitude harder than either one.
This is a non-trivial problem that your naive implementation will not solve in a bug free manner. The HashMap you are using is not going to threadsafe if any accesses are not checked and serialized, it will not be performant and will cause lots of contention that will cause lot of blocking and latency depending on the use.
The proper way to implement a lazy loading cache is to use something like Guava Cache with a Cache Loader it takes care of all the concurrency and cache race conditions for you transparently. A cursory glance through the source code shows how they do it.

No, and ConcurrentHashMap would not help.
Recap: the double check idiom is typically about assigning a new instance to a variable/field; it is broken because the compiler can reorder instructions, meaning the field can be assigned with a partially constructed object.
For your setup, you have a distinct issue: the map.get() is not safe from the put() which may be occurring thus possibly rehashing the table. Using a Concurrent hash map fixes ONLY that but not the risk of a false positive (that you think the map has no entry but it is actually being made). The issue is not so much a partially constructed object but the duplication of work.
As for the avoidable guava cacheloader: this is just a lazy-init callback that you give to the map so it can create the object if missing. This is essentially the same as putting all the 'if null' code inside the lock, which is certainly NOT going to be faster than good old direct synchronization. (The only times it makes sense to use a cacheloader is for pluggin-in a factory of such missing objects while you are passing the map to classes who don't know how to make missing objects and don't want to be told how).

Sequential streams and shared state

The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?

The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.

If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java 8 updating a map in parallel stream - java

Related

How to prevent threads from reading inconsistent data from ConcurrentHashMap?

Java ParallelStream: several map or single map

How to safely consume Java Streams safely without isFinite() and isOrdered() methods?

Does double-checked locking work with a final Map in Java?

Sequential streams and shared state

Categories

Resources