There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)
After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.
Related
Considering a Java Stream object generated by method:
// get Stream object
Stream<String> dataCollection = myFunctionToGenerateStreamObj(...)
// get count for logging purpose
int total = dataCollection.count()
// iterate each object and do something
dataCollection.forEach( obj -> {
anotherFunctionToProcessObj(obj);
});
Obviously above code will throw
IllegalStateException: stream has already been operated upon or closed
on the line for iteration. I understand there a solution to use Supplier class. I just wondering what is the best practice to code above logic in most common and efficient way.
The point is that the act of creating and iterating is potentially expensive.
Consider this simple one: Files.lines. It opens the file and reads it, line by line, exposing that as a Stream<String>.
You can, of course, wrap this in a supplier: () -> Files.lines(somePath) and pass that around. Anybody you gave that supplier to can invoke it, which results in the OS opening a file handle, and asking the disk to go spit out some bits.
But, the point is, the cost here is the stream process. The actual thing you do with the stream? Dwarfed, utterly, by the disk access (though with fast SSDs this matters less. If it helps, replace that with a file loaded over a network connection. That is metered).
Some streams aren't anything like that. if you have an ArrayList and you call .stream() on that, the 'processing' costs of the stream itself (making the stream object, and streaming through the list's elements), is an ignorable cost, it's just forwarding a memory pointer, through contiguous memory no less. It doesn't even cache miss much.
But Stream is an abstraction - the point is that it sucks to have to care. It's bad to have to document "if you pass a supplier that supplies a stream with high setup and processing costs, this code is quite slow and if its loaded across a metered network it'll really rack up your bill".
So, don't. Which means, the best thing to do is to stream only once.
Unfortunately there is no simple way to just make 2 Stream objects that are powered by a single stream through.
One way to do this count thing would be:
AtomicInteger counter = new AtomicInteger();
Stream<String> dataCollection = ....;
dataCollection.peek(x -> counter.incrementAndGet()).forEach(...);
It's annoying to have to do it like this. A key problem here is that lambdas don't like changing state, and methods can only return one thing.
If it is not an unbounded stream, what about map?:
List<Integer> integers = Arrays.asList(1, 2, 3, 4, 5, 6);
long count = integers.stream().map(x -> {
System.out.println("x = " + x);
return 1;
}).count();
System.out.println("count = " + count);
I think what you have can be represented as special case of reduction. More specifically, Stream.reduce(identity, accumulator, combiner).
int totalCount = dataCollection
.reduce(0, (currentCount, obj) -> {
anotherFunctionToProcessObj(obj);
return currentCount + 1;
}, Integer::sum);
System.out.println(totalCount);
Identity is the initial value for the counter - 0
Accumulator function will process object and return current count increased by 1
Combiner will sum the result of 2 accumulators
As stated in the stream-docs, the preferred way is reduction:
Many computations where one might be tempted to use side effects can be more safely and efficiently expressed without side-effects, such as using reduction instead of mutable accumulators.
As pointed out in comments, if the stream is infinite, you won't be able to reduce it(it's not possible to get the total count anyway). In this case external counter to keep track of the number of processed elements can be used:
AtomicInteger counter = new AtomicInteger(0);
dataCollection.forEach(obj -> {
anotherFunctionToProcessObj(obj);
counter.incrementAndGet();
});
Keep in mind, this is ok for this particular corner case only, as changing external state via side effects is bad practice.
As another option, you can just get the stream's iterator and use simple while loop.
Iterator<String> iterator = dataCollection.iterator();
int count = 0;
while (iterator.hasNext()) {
count++;
anotherFunctionToProcessObj(iterator.next());
}
System.out.println(count);
The stream you're pulling data from should not care about how large it is. After all, streams can be unbounded in size. This is also important in the context of what it is you're streaming; if you're streaming whole objects then it's a lot different than if you were streaming individual characters from a data source that was whole terabytes in size.
So...the only place you could realistically keep track of what you've seen is where the data is processed. In anotherFunctionToProcess, since you're just...consuming the data (doesn't look like you have to worry about threads too much), you could keep a tally inside of that class.
Because you're leveraging forEach, this already creates a side effect, and while you shouldn't look to create side effects when operating in streams, there's not a better way to do this without simply disregarding the requirement to count the elements.
This way, you only ever stream the data once and you keep track of what you process in the process.
Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?
I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.
I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.
The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}
I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.
In the javaodoc for the stream package, at the end of the section Parallelism, I read:
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
I mean, I know it is possible, specially when using sequential streams, but the same javadoc clearly states:
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
And also:
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; [...] The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
So, my question is: in which circumstances is it a good practice to use a stateful stream operation (and not for methods working by side-effect, such as forEach)?
A related question could be: why are there operations working by side effect, such as forEach? I always end up doing a good old for loop to avoid having side-effects in my lambda expression.
Examples of stateful stream lambdas:
collect(Collector): The Collector is by definition stateful, since it has to collect all the elements in a collection (state).
forEach(Consumer): The Consumer is by definition stateful, well except if it's a black hole (no-op).
peek(Consumer): The Consumer is by definition stateful, because why peek if not to store it somewhere (e.g. log).
So, Collector and Consumer are two lambda interfaces that by definition are stateful.
All the others, e.g. Predicate, Function, UnaryOperator, BinaryOperator, and Comparator, should be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
Suppose following scenario. You have a Stream<String> and you need to list the items in natural order prefexing each one with order number. So, for example on input you have: Banana, Apple and Grape. Output should be:
1. Apple
2. Banana
3. Grape
How you solve this task in Java Stream API? Pretty easily:
List<String> f = asList("Banana", "Apple", "Grape");
AtomicInteger number = new AtomicInteger(0);
String result = f.stream()
.sorted()
.sequential()
.map(i -> String.format("%d. %s", number.incrementAndGet(), i))
.collect(Collectors.joining("\n"));
Now if you look at this pipeline you'll see 3 stateful operations:
sorted() – stateful by definition. See documetation to Stream.sorted():
This is a stateful intermediate operation
map() – by itself could be stateless or not, but in this case it is not. To label positions you need to keep track of how much items already labeled;
collect() – is mutable reduction operation (from docs to Stream.collect()). Mutable operations are stateful by definition, because they change (mutate) shared state.
There are some controversy about why sorted() is stateful. From the Stream API documentation:
Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
So when applying term stateful/stateless to a Stream API we're talking more about function processing element of a stream, and not about function processing stream as a whole.
Also note that there is some confusion between terms stateless and deterministic. They are not the same.
Deterministic function provide same result given same arguments.
Stateless function retain no state from previous calls.
Those are different definitions. And in general case doesn't depend on each other. Determinism is about function result value while statelessness about function implementation.
When in doubt simply check the documentation to the specific operation. Examples:
Stream.map mapper parameter:
mapper - a non-interfering, stateless function to apply to each element
Here documentation explicitly says that the function must be stateless.
Stream.forEach action parameter:
action - a non-interfering action to perform on the elements
Here it's not specified that the action is stateless, thus it can be stateful.
In general it's always explicitly written on every method documentation.
A stateless function returns the same output for the same inputs, "no matter what".
It's easy to create non-stateless functions in an imperative language like Java. e.g.
func = input -> currentTime();
If we do stream.map(func) with a stateful func, the resulting stream will depend on how func is invoked at runtime; the behavior of the application will be hard to understand (but not that hard).
If func is stateless, stream.map(func) will always produce the same stream, no matter how map is implemented and executed. This is nice and desirable.
Note that "no matter what" implies that a stateless function must be thread-safe.
If a function returns void, isn't it always stateless? Well... there's another connotation of stateless - invoking a stateless function should not have side effects that are "important" to the application.
If func has no "important" side effects, it's safe to invoke func arbitarily. For example, stream.map(func) can safely invoke func multiple times even on the same element. (But don't worry, Stream is never gonna do that).
What is an "important" side effect? That is very subjective.
At the very least, invoking fun will cost some CPU time, which is not exactly free. This might be concerning for performance critical applications; or on expensive platforms (cough AWS).
If func logs something on hardisk, it may or may not be an "important" side effect. (It too costs $$)
If func queries an external service that costs dearly, it is very concerning, it can bankrupt you.
Now, forget about money. Purely from application logic point of view, func could cause mutation to some state that the application depends on; even if func returns the same output for the same inputs, it still cannot be considered "stateless". For example, if in stream.map(func), func adds each element to a list, and later the application uses the list, the resulting list will depend on how func is invoked at runtime. This is frawned upon by functional-programmers.
If we do stream.forEach( e->log(e) ), is it stateless? We can consider it stateless if
we don't care about the cost of log
log() can be invoked concurrently
we don't care about the order of log entries
log entries have no impact on this application's logic
The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.
How do I effectively parallel my computation of pi (just as an example)?
This works (and takes about 15secs on my machine):
Stream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).mapToDouble(d->4.0d/d).sum()
But all of the following parallel variants run into an OutOfMemoryError
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).parallel().limit(999999999L).map(d->4.0d/d).sum();
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).parallel().map(d->4.0d/d).sum();
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).map(d->4.0d/d).parallel().sum();
So, what do I need to do to get parallel processing of this (large) stream?
I already checked if autoboxing is causing the memory consumption, but it is not. This works also:
DoubleStream.iterate(1, d->-(d+Math.abs(2*d)/d)).boxed().limit(999999999L).mapToDouble(d->4/d).sum()
The problem is that you are using constructs which are hard to parallelize.
First, Stream.iterate(…) creates a sequence of numbers where each calculation depends on the previous value, hence, it offers no room for parallel computation. Even worse, it creates an infinite stream which will be handled by the implementation like a stream with unknown size. For splitting the stream, the values have to be collected into arrays before they can be handed over to other computation threads.
Second, providing a limit(…) doesn’t improve the situation, it makes the situation even worse. Applying a limit removes the size information which the implementation just had gathered for the array fragments. The reason is that the stream is ordered, thus a thread processing an array fragment doesn’t know whether it can process all elements as that depends on the information how many previous elements other threads are processing. This is documented:
“… it can be quite expensive on ordered parallel pipelines, especially for large values of maxSize, since limit(n) is constrained to return not just any n elements, but the first n elements in the encounter order.”
That’s a pity as we perfectly know that the combination of an infinite sequence returned by iterate with a limit(…) actually has an exactly known size. But the implementation doesn’t know. And the API doesn’t provide a way to create an efficient combination of the two. But we can do it ourselves:
static DoubleStream iterate(double seed, DoubleUnaryOperator f, long limit) {
return StreamSupport.doubleStream(new Spliterators.AbstractDoubleSpliterator(limit,
Spliterator.ORDERED|Spliterator.SIZED|Spliterator.IMMUTABLE|Spliterator.NONNULL) {
long remaining=limit;
double value=seed;
public boolean tryAdvance(DoubleConsumer action) {
if(remaining==0) return false;
double d=value;
if(--remaining>0) value=f.applyAsDouble(d);
action.accept(d);
return true;
}
}, false);
}
Once we have such an iterate-with-limit method we can use it like
iterate(1d, d -> -(d+2*(Math.abs(d)/d)), 999999999L).parallel().map(d->4.0d/d).sum()
this still doesn’t benefit much from parallel execution due to the sequential nature of the source, but it works. On my four core machine it managed to get roughly 20% gain.
This is because the default ForkJoinPool implementation used by the parallel() method does not limit the number of threads that get created. The solution is to provide a custom implementation of a ForkJoinPool that is limited to the number of threads that it executes in parallel. This can be achieved as mentioned below:
ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors());
forkJoinPool.submit(() -> DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).parallel().limit(999999999L).map(d->4.0d/d).sum());