Is it ok to process and count processed data in such way?
long count = userDao.findApprovedWithoutData().parallelStream().filter(u -> {
Data d = dataDao.findInfoByEmail(u.getEmail());
boolean ret = false;
if (d != null) {
String result = "";
result += getFieldValue(d::getName, ". \n");
result += getFieldValue(d::getOrganization, ". \n");
result += getFieldValue(d::getAddress, ". \n");
if(!result.isEmpty()) {
u.setData(d.getInfo());
userDao.update(u);
ret = true;
}
}
return ret;
}).count();
So, in short: iterate over not complete records, update if data is present and count this number of records?
IMHO this is bad code, because:
The filter predicate has (quite significant) side effects
Predicates should not have side effects (just like getters shouldn't). It's unexpected, and that makes it bad.
The filter predicate is very inefficient
Each execution of the predicate causes a large chain of queries to fire, which makes this code not scaleable.
At first glance, the main purpose seems to be getting a count, but really that's a minor (dispensable) bit of info
Good code makes it obvious what is going on (unlike this code)
You should change the code to use a (fairly simple) single update query (that employs a join) and get the count from the "number of rows updated" info in the result from the persistence API.
It depends on your definition of process . I cannot give you a clear yes or no because, I think it is hard to conclude without understanding your code and how it is implemented.
You are using Parallel Stream and what happens there is Java runtime splits the Stream into sub-streams based on number of available threads in ForkJoinPool's common pool.
When using parallelism you need to be careful for possible side effects:
Interference (Lambda expression in a stream should not interfere)
Lambda expressions in stream operations should not interfere.
Interference occurs when the source of a stream is modified while a
pipeline processes the stream.
Statetful Lambda expressions
Avoid using stateful lambda expressions as parameters in stream
operations. A stateful lambda expression is one whose result depends
on any state that might change during the execution of a pipeline.
Looking at your question and applying the above points to it.
Non-interference > strongly states that Lambda expressions should not interfere with the source of stream (unless stream source is concurrent) during pipeline operation because it can cause:
Exception (i.e. ConcurrentModificationException)
Incorrect Answer
Nonconformant behaviour
With exception of well-behaved streams where the modification takes place during intermediate operation (i.e. filter), read more in here.
Your Lambda expression does interfere with the source of the stream, which is not advised but, the interference is within Intermediate operation and now everything comes down to whether the stream is well-behaved or not. So you might consider re-thinking your lambda expression when it comes to interference. It might also come down to how you update the source of the stream via userDao.udpate, which is not clear from your question.
Stateful Lambda Expression > Your Lambda expression does not seem to be stateful and that is because the result of Lambda depends on value/s that do not change during the execution of the pipeline. So this does not apply to your case.
I advise you go through the documentation of Java 8 Stream as well as this blog which explains Java 8 Stream really well with examples.
Related
While looking into the source code of the WrappingSpliterator::trySplit, I was very mislead by it's implementation:
#Override
public Spliterator<P_OUT> trySplit() {
if (isParallel && buffer == null && !finished) {
init();
Spliterator<P_IN> split = spliterator.trySplit();
return (split == null) ? null : wrap(split);
}
else
return null;
}
And if you are wondering why this matters, is because for example this:
Arrays.asList(1,2,3,4,5)
.stream()
.filter(x -> x != 1)
.spliterator();
is using it. In my understanding the addition of any intermediate operation to a stream, will cause that code to be triggered.
Basically this method says that unless the stream is parallel, treat this Spliterator as one that can not be split, at all. And this matters to me. In one of my methods (this is how I got to that code), I get a Stream as input and "parse" it in smaller pieces, manually, with trySplit. You can think for example that I am trying to do a findLast from a Stream.
And this is where my desire to split in smaller chunks is nuked, because as soon as I do:
Spliterator<T> sp = stream.spliterator();
Spliterator<T> prefixSplit = sp.trySplit();
I find out that prefixSplit is null, meaning that I basically can't do anything else other than consume the entire sp with forEachRemaning.
And this is a bit weird, may be it makes some sense for when filter is present; because in this case the only way (in my understanding) a Spliterator could be returned is using some kind of a buffer, may be even with a predefined size (much like Files::lines). But why this:
Arrays.asList(1,2,3,4)
.stream()
.sorted()
.spliterator()
.trySplit();
returns null is something I don't understand. sorted is a stateful operation that buffers the elements anyway, without actually reducing or increasing their initial number, so at least theoretically this can return something other than null...
When you invoke spliterator() on a Stream, there are only two possible outcomes with the current implementation.
If the stream has no intermediate operations you’ll get the source spliterator that has been used to construct the stream and whose splitting capability is entirely independent from the stream’s parallel state, as in fact, the spliterator doesn’t know anything about the stream.
Otherwise, you’ll get a WrappingSpliterator, which will encapsulate a source Spliterator and a pipeline state, expressed as PipelineHelper. This combination of Spliterator and PipelineHelper does not need to work in parallel and, in fact, would not work in case of distinct(), as the WrappingSpliterator will get an entirely different combination, depending on whether the Stream is parallel or not.
For stateless intermediate operations, it would not make a difference though. But, as discussed in “Why the tryAdvance of stream.spliterator() may accumulate items into a buffer?”, the WrappingSpliterator is a “one-fits-all implementation” that doesn’t consider the actual nature of the pipeline, so its limitations are the superset of all possible limitations of all supported pipeline stages. So the existence of one scenario that wouldn’t work when ignoring the parallel flag is enough to forbid splitting for all pipelines when not being parallel.
While going through articles of sequential streams the question came in my mind that are there any performance benefits of using sequential streams over traditional for loops or streams are just sequential syntactic sugar with an additional performance overhead?
Consider Below Example where I can not see any performance benefits of using sequential streams:
Stream.of("d2", "a2", "b1", "b3", "c")
.filter(s -> {
System.out.println("filter: " + s);
return s.startsWith("a");
})
.forEach(s -> System.out.println("forEach: " + s));
Using classic java:
String[] strings = {"d2", "a2", "b1", "b3", "c"};
for (String s : strings)
{
System.out.println("Before filtering: " + s);
if (s.startsWith("a"))
{
System.out.println("After Filtering: " + s);
}
}
Point Here is in streams processing of a2 starts only after all the operations on d2 is complete(Earlier I thought while d2 is being processed by foreach ,filter would have strated operating on a2 but that is not the case as per this article : https://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/), same is the case with classic java, so what should be the motivation of using streams beyond "expressive" and "elegant" coding style?I know there are performance overheads for compiler while handling streams, does anyone know/have experienced about any performance benefits while using sequential streams?
First of all, letting special cases, like omitting a redundant sorted operation or returning the known size on count(), aside, the time complexity of an operation usually doesn’t change, so all differences in execution timing are usually about a constant offset or a (rather small) factor, not fundamental changes.
You can always write a manual loop doing basically the same as the Stream implementation does internally. So, internal optimizations, as mentioned by this answer could always get dismissed with “but I could do the same in my loop”.
But… when we compare “the Stream” with “a loop”, is it really reasonable to assume that all manual loops are written in the most efficient manner for the particular use case? A particular Stream implementation will apply its optimizations to all use cases where applicable, regardless of the experience level of the calling code’s author. I’ve already seen loops missing the opportunity to short-circuit or performing redundant operations not needed for a particular use case.
Another aspect is the information needed to perform certain optimizations. The Stream API is built around the Spliterator interface which can provide characteristics of the source data, e.g. it allows to find out whether the data has a meaningful order needed to be retained for certain operations or whether it is already pre-sorted, to the natural order or with a particular comparator. It may also provide the expected number of elements, as an estimate or exact, when predictable.
A method receiving an arbitrary Collection, to implement an algorithm with an ordinary loop, would have a hard time to find out, whether there are such characteristics. A List implies a meaningful order, whereas a Set usually does not, unless it’s a SortedSet or a LinkedHashSet, whereas the latter is a particular implementation class, rather than an interface. So testing against all known constellations may still miss 3rd party implementations with special contracts not expressible by a predefined interface.
Of course, since Java 8, you could acquire a Spliterator yourself, to examine these characteristics, but that would change your loop solution to a non-trivial thing and also imply repeating the work already done with the Stream API.
There’s also another interesting difference between Spliterator based Stream solutions and conventional loops, using an Iterator when iterating over something other than an array. The pattern is to invoke hasNext on the iterator, followed by next, unless hasNext returned false. But the contract of Iterator does not mandate this pattern. A caller may invoke next without hasNext, even multiple times, when it is known to succeed (e.g. you do already know the collection’s size). Also, a caller may invoke hasNext multiple times without next in case the caller did not remember the result of the previous call.
As a consequence, Iterator implementations have to perform redundant operations, e.g. the loop condition is effectively checked twice, once in hasNext, to return a boolean, and once in next, to throw a NoSuchElementException when not fulfilled. Often, the hasNext has to perform the actual traversal operation and store the result into the Iterator instance, to ensure that the result stays valid until the subsequent next call. The next operation in turn, has to check whether such a traversal did already happen or whether it has to perform the operation itself. In practice, the hot spot optimizer may or may not eliminate the overhead imposed by the Iterator design.
In contrast, the Spliterator has a single traversal method, boolean tryAdvance(Consumer<? super T> action), which performs the actual operation and returns whether there was an element. This simplifies the loop logic significantly. There’s even the void forEachRemaining(Consumer<? super T> action) for non-short-circuiting operations, which allows the actual implementation to provide the entire looping logic. E.g., in case of ArrayList the operation will end up at a simple counting loop over the indices, performing a plain array access.
You may compare such design with, e.g. readLine() of BufferedReader, which performs the operation and returns null after the last element, or find() of a regex Matcher, which performs the search, updates the matcher’s state and returns the success state.
But the impact of such design differences is hard to predict in an environment with an optimizer designed specifically to identify and eliminate redundant operations. The takeaway is that there is some potential for Stream based solutions to turn out to be even faster, while it depends on a lot of factors whether it will ever materialize in a particular scenario. As said at the beginning, it’s usually not changing the overall time complexity, which would be more important to worry about.
Streams might (and have some tricks already) under the hood, that a traditional for-loop does not. For example:
Arrays.asList(1,2,3)
.map(x -> x + 1)
.count();
Since java-9, map will be skipped, since you don't really care about it.
Or internal implementation might check if a certain data structure is already sorted, for example:
someSource.stream()
.sorted()
....
If someSource is already sorted (like a TreeSet), in such a case sorted would be a no-op. There are many of these optimizations that are done internally and there is ground for even more that may be will be done in the future.
If you were to use streams still, you could have created a stream out of your array using Arrays.stream and used a forEach as:
Arrays.stream(strings).forEach(s -> {
System.out.println("Before filtering: " + s);
if (s.startsWith("a")) {
System.out.println("After Filtering: " + s);
}
});
On the performance note, since you would be willing to traverse the entire array, there is no specific benefit from using streams over loops. More about it has been discussed In Java, what are the advantages of streams over loops? and other linked questions.
enter image description hereIf using stream, we can use with parallel(), as bellow:
Stream<String> stringStream = Stream.of("d2", "a2", "b1", "b3", "c")
.parallel()
.filter(s -> s.startsWith("d"));
It's faster because your computer will normally be able to run more than one thread together.
Test it's:
#Test
public void forEachVsStreamVsParallelStream_Test() {
IntStream range = IntStream.range(Integer.MIN_VALUE, Integer.MAX_VALUE);
StopWatch stopWatch = new StopWatch();
stopWatch.start("for each");
int forEachResult = 0;
for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
if (i % 15 == 0)
forEachResult++;
}
stopWatch.stop();
stopWatch.start("stream");
long streamResult = range
.filter(v -> (v % 15 == 0))
.count();
stopWatch.stop();
range = IntStream.range(Integer.MIN_VALUE, Integer.MAX_VALUE);
stopWatch.start("parallel stream");
long parallelStreamResult = range
.parallel()
.filter(v -> (v % 15 == 0))
.count();
stopWatch.stop();
System.out.println(String.format("forEachResult: %s%s" +
"parallelStreamResult: %s%s" +
"streamResult: %s%s",
forEachResult, System.lineSeparator(),
parallelStreamResult, System.lineSeparator(),
streamResult, System.lineSeparator()));
System.out.println("prettyPrint: " + stopWatch.prettyPrint());
System.out.println("Time Elapsed: " + stopWatch.getTotalTimeSeconds());
}
I am trying to use Java Streams to make the sequential processing of a list of customers run in parallel. This is a short-term band-aid to a problem that we are solving as part of re-architecting our entire system.
What I am starting with is a List<Customer> Customers structure that contains the customer contact information and all the relevant transaction data. Conceptually, the code I am replacing looks like:
long emailsSent = 0;
List<Customer> customers = methodLoadingAllrelevantData();
for (Customer customer: customers) {
boolean isEmailSent = sendEmail(customer);
if (isEmailSent) {
emailsSent++;
}
}
The sendMail(customer) function:
Determines if an email should be sent
Formats the email
Attempts to send the email
Returns true if the email was sent successfully
Not great code, but I am just trying to get some more speed from the existing code, not trying to make it better. The method and all its calls are 100-percent thread-safe.
I put it in the following stream structure:
ForkJoinPool limitedParallelThreadPool = new ForkJoinPool(numberOfThreads);
emailsSent = limitedParallelThreadPool.submit( () ->
customers.stream().parallel()
.map(this::_emailCustomer)
.filter(b -> b == true).count()
).get();
This does work as expected, returning the same data as the sequential version.
My questions are: because the purpose of my method is to generates an email, is it poor practice for me to use a map function? Is there a better answer? I am, in effect, mapping a Customer to a boolean, but part of this mapping requires the process to trigger an email.
I was originally trying to use a forEach()operator, but I could not figure out how to get the count without adding state information to the sendMail function, which interferes with it being thread-safe.
Returns true if the email was sent successfully
It wouldn't be the worst idea to take advantage of the fact that your _emailCustomer method returns a boolean, so you can use Stream#filter instead of a combination of both Stream#map and Stream#filter:
customers.parallelStream()
.filter(this::_emailCustomer)
.count()
To answer your question, though, it would depend on the use-case whether or not Stream#map is the correct intermediate operation to use. According to the documentation of Stream#map, the Function argument that the method accepts must be:
a non-interfering, stateless function to apply to each element
If your _emailCustomer method is either interfering or stateful, then I would refrain from calling it within Stream#map, especially in a parallel context.
Since you don't care in which order those emails are getting sent, I'd say you kind of OK in this example. It's just that you are relying on side-effects of the map intermediate operation and potentially that can bite you. For example this:
Stream.of(1,2,3,4)
.map(x -> x + 1)
.count();
will not execute the map at all (starting with java-9), since all you want is count and map will not change the final count. Your example is safe from that since you are filtering, thus the final count is not known, thus map has to be executed. As said, for a parallel environment there is no guarantee about the order in which map will be executed.
It's a pity though that your sendEmail returns something at all, all the email services I wrote were more like a event thing - fire and forget; but I can't tell your exact scenario needed.
Think about the fact that your map operation will block, until you get a response back and that might trigger this part of the documentation that you need to look at:
A ForkJoinPool is constructed with a given target parallelism level; by default, equal to the number of available processors. The pool attempts to maintain enough active (or available) threads by dynamically adding, suspending, or resuming internal worker threads, even if some tasks are stalled waiting to join others
According to Javadocs for SE 8 Stream.map() does the following
Returns a stream consisting of the results of applying the given function to the elements of this stream.
However, a book I'm reading (Learning Network Programming with Java, Richard M. Reese) on networking implements roughly the following code snippet in an echo server.
Supplier<String> inputLine = () -> {
try {
return br.readLine();
} catch(IOException e) {
e.printStackTrace();
return null;
}
};
Stream.generate(inputLine).map((msg) -> {
System.out.println("Recieved: " + (msg == null ? "end of stream" : msg));
out.println("echo: " + msg);
return msg;
}).allMatch((msg) -> msg != null);
This is supposed to be a functional way to accomplish getting user input to print to the socket input stream. It works as intended, but I don't quite understand how. Is it because map knows the stream is infinite so it lazily executes as new stream tokens become available? It seems like adding something to a collection currently being iterated over by map is a little black magick. Could someone please help me understand what is going on behind the scenes?
Here is how I restated this in order to avoid the confusing map usage. I believe the author was trying to avoid an infinite loop since you can't break out of a forEach.
Stream.generate(inputLine).allMatch((msg) -> {
boolean alive = msg != null;
System.out.println("Recieved: " + (alive ? msg : "end of stream"));
out.println("echo: " + msg);
return alive;
});
Streams are lazy. Think of them as workers in a chain that pass buckets to each other. The laziness is in the fact that they will only ask the worker behind them for the next bucket if the worker in front of them asks them for it.
So it's best to think about this as allMatch - being a final action, thus eager - asking the map stream for the next item, and the map stream asking the generate stream for the next item, and the generate stream going to its supplier, and providing that item as soon as it arrives.
It stops when allMatch stops asking for items. And it does so when it knows the answer. Are all items in this stream not null? As soon as the allMatch receives an item that is null, it knows the answer is false, and will finish and not ask for any more items. Because the stream is infinite, it will not stop otherwise.
So you have two factors causing this to work the way it work - one is allMatch asking eagerly for the next item (as long as the previous ones weren't null), and the generate stream that - in order to supply that next item - may need to wait for the supplier that waits for the user to send more input.
But it should be said that map shouldn't have been used here. There should not be side effects in map - it should be used for mapping an item of one type to an item of another type. I think this example was used only as a study aid. The much simpler and straightforward way would be to use BufferedReader's method lines() which gives you a finite Stream of the lines coming from the buffered reader.
Yes - Streams are setup lazily until and unless you perform a terminal operation (final action) on the Stream. Or simpler:
For as long as the operations on your stream return another Stream, you do not have a terminal operation, and you keep on chaining until you have something returning anything other than a Stream, including void.
This makes sense, as to be able to return anything other than a Stream, the operations earlier in your stream will need to be evaluated to actually be able to provide the data.
In this case, and as per documentation, allMatch returns a boolean, and thus final execution of your stream is required to calculate that boolean. This is the point also where you provide a Predicate limiting your resulting Stream.
Also note that in the documentation it states:
This is a short-circuiting terminal operation.
Follow that link for more information on those terminal operations, but a terminal operation basically means that it will actually execute the operation. Additionally, the limiting of your infinite stream is the 'short-circuiting' aspect of that method.
Here are two the most relevant sentences of the java-stream documentation. The snippet you provided is a perfect example of these working together:
Stream::generate(Supplier<T> s) says that it returns:
Returns an infinite sequential unordered stream where each element is generated by the provided Supplier.
3rd dot of Stream package documentation:
Laziness-seeking. Many stream operations, such as filtering, mapping, or duplicate removal, can be implemented lazily, exposing opportunities for optimization. For example, "find the first String with three consecutive vowels" need not examine all the input strings. Stream operations are divided into intermediate (Stream-producing) operations and terminal (value- or side-effect-producing) operations. Intermediate operations are always lazy.
In a shortcut, this generated stream await the further elements until the terminal operation is reached. As long as the execution inside the supplied Supplier<T>, the stream pipeline continues.
As an example, if you provide the following Supplier, the execution has no chance to stop and will continue infinitely:
Supplier<String> inputLine = () -> {
return "Hello world";
};
This question already has answers here:
In Java, what are the advantages of streams over loops? [closed]
(5 answers)
Closed 5 years ago.
Is there any advantage of using stream filter operation over iterator with continue operation?
Example for iteration:
for (ApiSite apiSite : sites) {
Site mSite = Site.getSiteByName(apiSite.getName());
if (mSite == null || deletedSitesToSkip.contains(mSite)) {
LOGGER.info("Skipping site: {} as this has been deleted ", apiSite.getName());
continue;
}
// operation
}
stream with filter example:
sites.stream().filter(apiSite -> {
Site mSite = Site.getSiteByName(apiSite.getName());
return (mSite != null && !deletedSitesToSkip.contains(mSite));
}).map(//some operation);
First, the filter behavior depends on which terminal operation is invoked.
IF a short-circuiting terminal operation, e.g:anyMatch is invoked, the filter will be exit until the first satisfied element.
IF a non-short-circuiting terminal operation, e.g:count, collect is invoked, the filter's behavior just like as yours. continue and process the next until the last.
Second, the stream api will make your code more expressive.
Third, Stream api allows you do some time-consuming operations in parallel . but for loop will take more effort to do it, for example:
stream.parallel().map(it-> remoteCall(it)).collect(...);
Fourth, Stream operation can be serialized, cached, the apache spark core & streaming api is a framework that processing distributed operations on nodes in a cluster.
The stream version has better semantics: you clearly operate on a set of data, filter it by a condition, and apply an operation on the matching elements. This adds to the readability, and allows to easily extend the processing by elementar building blocks (filter, map) without the risk of breaking much (maintainability).
Moreso, the streams operations (for filtering, mapping, ...) can be designed as reusable building blocks (DRY principle), which can be individualy tested, thus gaining reliability.
All in all, those are non-functional benefits (the functionality of both variants is the same), adding to the quality of the code.
Besides, continue is worth avoiding completely, as it quickly leads to awkward code when nesting loop blocks, making it hard to read and maintain.