Akka stream sort by id in java - java

I need to sort my akka stream list by id in java
I have list of objects in akka source:
SystemCodeTracking{id=9, EntityName='table3'}
SystemCodeTracking{id=2, EntityName='table2'}
SystemCodeTracking{id=10, EntityName='table1'}
I need to sort it to:
SystemCodeTracking{id=2, EntityName='table2'}
SystemCodeTracking{id=9, EntityName='table3'}
SystemCodeTracking{id=10, EntityName='table1'}
Code should be like following:
Source<SystemCodeTracking, SourceQueueWithComplete<SystemCodeTracking>> loggedSource = source.map(single -> sortingFunction(single));
My question is how to do the sortingFunction ?

A stream is by definition unbounded => providing a perfect ordering requires to have observed all the data before emitting the first one.
However, there are tons of situations though where it can be assumed that a stream is just partially unsorted, that is, data gets slightly mixed up due to concurrent processing, though each element would end up being no further than, say, 1000 positions from its real position.
When that is the case, you can use a sort method with a buffer, like so:
* partial sort of a stream: wait for <bufferSize> to be buffered, the start flushing them out in order
* */
def sort[T, S](bufferSize: Int, order: T => S)(implicit ordering: Ordering[S]): () => T => Iterable[T] = () => {
var buffer = List.empty[T]
t: T => {
buffer = (buffer :+ t).sortBy(order)
if (buffer.size < bufferSize) Iterable.empty[T]
else {
val r = buffer.head
buffer = buffer.tail
List(r)
}
}
}
which can simply be used as part of a statefulMapConcat, as follows:
someSource
// sort the stream by timestpam, using a buffer of 1000,
.statefulMapConcat(sort(1000, _.timestamp))
...

Sorting basically kills the nature of your stream, as you'll need to consume the whole stream - i.e. fitting it into memory - to apply a sorting function. Anyway, it is possible by exhausting the whole source to a Sink.seq, and then sort the result.
source.runWith(Sink.seq, materializer)
and then on the completion stage result call
sortingFunction(result)
If you want to sort chunks of the source, and not the whole things, you can do something like
source.grouped(10).map(batch -> sortingFunction(batch))

You cannot sort a sequence (be it Akka Stream or anything else) if you can't find a minimum element of that sequence (which is to be emitted first in the resulting sorted stream). This is usually the case if the stream is infinite.
In some cases, you can sort even infinite stream, even within low memory limit - e. g. if you can determine next minimal element by looking at only N last elements. Or you can utilize the case-specific knowledge that, when some_condition is satisfied, you can be sure that some_class of elements will never occur in your stream.
Otherwise your only option is to escalate sorting problem one level up: do you really need strong order in your stream? Maybe it's enough to just partition the stream - i. e. split it into sub-streams of elements having the same key?
I guess the reason why Akka Streams has no Flow.sort out of the box is that Akka Streams are all about boundedness of resource consumption, and sorting streams requires O(n) of memory.

Related

How do elements go through the stream?

How do elements of a stream go thought the stream itself? Is it like it takes 1 element and passes it thought all functions (map then sort then collect) and then takes second elements and repeats the cycle or is it like it takes all elements and maps them then sorts and finally collects?
new ArrayList<Integer>().stream()
.map(x -> x.byteValue())
.sorted()
.collect(Collectors.toList());
It depends entirely on the stream. It is usually evaluated lazily, which means it takes it one element at a time, but under certain conditions it needs to get all the elements before it continues to the next step. For example, consider the following code:
IntStream.generate(() -> (int) (Math.random() * 100))
.limit(20)
.filter(i -> i % 2 == 0)
.sorted()
.forEach(System.out::println);
This stream generates random numbers from 0 to 99, limited to 20 elements, after which it filters the numbers by checking wether or not they are even, if they are, they continue. Until now, it's done one element at a time. The change comes when you request a sorting of the stream. The sorted() method sorts the stream by the natural ordering of the elements, or by a provided comparator. For you to sort something you need access to all elements, because you don't know the last element's value until you get it. It could be the first element after you sort it. So this method waits for the entire stream, sorts it and returns the sorted stream. After that this code just prints the sorted stream one element at a time.
That depends on the actual Streamimplementation. This mostly applies to parallel streams, because spliterators tend to chunk the amount of data and you don't know which element will be process when.
In general, a stream goes through each element in order (but doesn't have to). The simplest way to check this behaviour is to put in some breakpoints and see when they actually hit.
Also, certain operations may wait until all prior operations are executed (namely collet())
I advise to check the javadoc and read it carefully, because it gives away enough hints to get an expectation.
something like this, yes.
if you have a stream of integers let's say 1,2,3,4,5 and you do some operations on it, let's say stream().map(x -> x*3).filter(x -> x%2==0).findFirst()
it will first take the first value (1), it will be multiplied by 3, and then it will check if it's even.
Because it's not, it will take the second one (2), multiply by 3 (=6), check if it is even (it is), find first.
this will be the first one and now it stops and returns.
Which means the other integers from the stream won't be evaluated (multiplied and checked if even) as it is not necessary

Length of an infinite IntStream?

I have created an randomIntStream by this:
final static PrimitiveIterator.OfInt startValue = new Random().ints(0, 60).iterator();
The documentation says this stream is actually endless.
I want to understand what happens there in the backround.
ints(0,60) is generating an infinite stream of integers. If this is infinite, why my machine is not leaking any memory?
I wonder, how many numbers are actually really generated and if this implemenentation can cause an error at the point where the stream still ends? Or will this stream constantly filled with new integers on the fly and it really never ends therefore?
And if I already ask this question, what is the best practise right now to generate random numbers nowadays?
The stream is infinite¹ so you can generate as many ints as you want without running out. It does not mean that it keeps generating them when you aren't asking for any.
How many numbers are actually generated depends on the code you write. Every time you retrieve a value from the iterator, a value is generated. None is generated in the background, so there's no "extra" memory being used.
¹ as far as your lifetime is concerned, see Eran's answer
To be exact,
IntStream java.util.Random.ints(int randomNumberOrigin, int randomNumberBound) returns:
an effectively unlimited stream of pseudorandom int values, each conforming to the given origin (inclusive) and bound (exclusive).
This doesn't mean infinite. Looking at the Javadoc, you'll see an implementation note stating that it actually limits the returned IntStream to Long.MAX_VALUE elements:
Implementation Note:
This method is implemented to be equivalent to ints(Long.MAX_VALUE, randomNumberOrigin, randomNumberBound).
Of course Long.MAX_VALUE is a very large number, and therefore the returned IntStream can be seen as "effectively" without limit. For example, if you consume 1000000 ints of that stream every second, it will take you about 292471 years to run out of elements.
That said, as mentioned by the other answers, that IntStream only generates as many numbers as are required by its consumer (i.e. the terminal operation that consumes the ints).
Streams do not (in general1) store all of their elements in any kind of a data structure:
No storage. A stream is not a data structure that stores elements; instead, it conveys elements from a source such as a data structure, an array, a generator function, or an I/O channel, through a pipeline of computational operations.
Instead, each stream element is computed one-by-one each time the stream advances. In your example, each random int would actually be computed when you invoke startValue.nextInt().
So when we do e.g. new Random().ints(0,60), the fact that the stream is effectively infinite isn't a problem, because no random ints are actually computed until we perform some action that traverses the stream. Once we do traverse the stream, ints are only computed when we request them.
Here's a small example using Stream.generate (also an infinite stream) which shows this order of operations:
Stream.generate(() -> {
System.out.println("generating...");
return "hello!";
})
.limit(3)
.forEach(elem -> {
System.out.println(elem);
});
The output of that code is:
generating...
hello!
generating...
hello!
generating...
hello!
Notice that our generator supplier is called once just before every call of our forEach consumer, and no more. If we didn't use limit(3), the program could run forever, but it wouldn't run out of memory.
If we did new Random().ints(0,60).forEach(...), it would work the same way. The stream would do random.nextInt(60) once before every call to the forEach consumer. The elements wouldn't be accumulated anywhere unless we used some action that required it, such as distinct() or a collector instead of forEach.
Some streams probably use a data structure behind the scenes for temporary storage. For example, it's common to use a stack during tree traversal algorithms. Also, some streams such as those created using a Stream.Builder will require a data structure to put their elements in.
As said by #Kayaman in his answer. The stream is infinite in the way that numbers can be generated forever. The point lies in the word can. It does only generate numbers if you really request them. It will not just generate X amount of numbers and then stores them somewhere (unless you tell it to do so).
So if you want to generate n (where n is an integer) random numbers. You can just call the overload of ints(0, 60), ints(n, 0, 60) on the stream returned by Random#ints():
new Random().ints(n, 0, 60)
Above will still not generate n random numbers, because it is an IntStream which is lazily executed. So when not using a terminal operation (e.g. collect() or forEach()) nothing really happens.
Creating a generator does not generate any numbers. In concept, this generator will continually generate new numbers; there is no point at which it would not return the next value when asked.

Sorted stream derived from Infinite Stream fails to iterate

import java.util.stream.*;
import java.util.*;
class TestInfiniteStream {
public static void main(String args[]) {
IntStream infiniteStream = new Random().ints();
IntStream sortedStream = infiniteStream.sorted();
sortedStream.forEach(i -> System.out.println(i));
}
}
After compiling and executing this code I get the following error.
Exception in thread "main" java.lang.IllegalArgumentException: Stream size exceeds max array size
Does sorting a stream fail on an infinite stream?
The simple answer to “Does sorting a stream fail on an infinite stream?” is “Yes.” sorted() is a stateful intermediate operation which has been implemented by buffering the entire contents and sorting it, before passing any elements to the downstream operations.
In theory, it doesn’t need to be that way. Since you are using forEach, which has been explicitly specified as processing the elements in an undefined order, the sorting step could be omitted in your new Random().ints().sorted().forEach(System.out::println); use case. But even if you used forEachOrdered, there is a theoretically achievable correct answer. Since your stream is infinite and will repeatedly contain all int values, a correct sorted output would print -2147483648 (==Integer.MIN_VALUE) forever, as that’s the smallest value that is contained infinite times in that stream.
However, to give this correct answer, the implementation would need specific code to handle this scenario, which is of not much practical value. Instead, the implementation handles this case like any other sorting of a stream scenario, which will fail for infinite streams.
In this specific case, the stream has an optimization that leads to a different, unusual exception message. As Eugene pointed out, this stream behaves like a fixed size stream of Long.MAX_VALUE (==2⁶³) elements rather than a truly infinite stream. That’s fair, considering that the stream produced by Random will repeat after 2⁴⁸ values, so the entire stream has been repeated 32768 times before it will end instead of running forever. You are unlikely to witness this “sudden” ending after processing 9223372036854775807 elements anyway. But a consequence of this optimization is that the stream will fail-fast with the “Stream size exceeds max array size” message instead of failing with an “OutOfMemoryError” after some processing.
If you eliminate the size information, e.g. via
new Random().ints().filter(x -> true).sorted().forEach(System.out::println);
the operation will try to buffer until failing with java.lang.OutOfMemoryError. The same happens with
IntStream.generate(new Random()::nextInt).sorted().forEach(System.out::println);
which provides no size information to the stream in the first place. In either case, it never gets to sort anything as the buffering happens before the sorting starts.
If you want to get “sorted runs for some limit of elements” as you said in a comment, you have to apply a limit before sorting, e.g.
new Random().ints().limit(100).sorted().forEach(System.out::println);
though it will be more efficient to still use a sized stream, e.g.
new Random().ints(100).sorted().forEach(System.out::println);
No, you cannot sort an infinite stream.
Your infinite stream new Random().ints() produces more integers than can be stored in an array (or any array), which is used behind the scenes for storing the integers to be sorted. An array of course cannot hold an infinite number of integers; only a capacity close to Integer.MAX_VALUE numbers.
Taking a step back, how would anyone or anything sort an infinite number of integers? Nothing can and no one can; it would take at least an infinite amount of time.
Does sorting a stream fail on an infinite stream?
You've kind of answered your own question; the IllegalArgumentException is the specific cause of the failure. It only occurred because you made a program that attempted to do it, and you ran up against a Java array limitation.
The sorted method will attempt to read the entire stream before sorting anything. It won't do any intermediate sorting before it has read the entire stream, so no sorting will be done, partial or full.
Well that is an interesting question, unfortunately your key description is in comments. First of all, there isn't really an "infinite" stream in your case, the responsible spliterator RandomIntsSpliterator even says that:
... and also by treating "infinite" as equivalent to Long.MAX_VALUE...
And now the interesting part, that spliterator will will report these characteristics:
public int characteristics() {
return (Spliterator.SIZED | Spliterator.SUBSIZED |
Spliterator.NONNULL | Spliterator.IMMUTABLE);
}
Well, I don't know the reasons of why, but reporting SIZED or SUBSIZED for an infinite stream... May be it does not matter either (because you usually chain these with a limit).
Well because SIZED is reported (with a size of Long.MAX_VALUE), there is a sink for this internally SizedIntSortingSink that has a check:
if (size >= Nodes.MAX_ARRAY_SIZE)
throw new IllegalArgumentException(Nodes.BAD_SIZE);
which obviously will fail.
On the contrast IntStream.generate does not report SIZED - which makes sense to me, so the entire input has to be buffered by sorted and later handled to the terminal operation; obviously this will fail with an OutOfMemory.
It's also interesting to prove that distinct will not act as a full barrier here, waiting for all values to be processed. Instead it can pass them to the terminal operation, once it knows that this has not been seen before:
Random r = new Random();
IntStream.generate(() -> r.nextInt())
.distinct()
.forEach(System.out::println);
This will run for quite a while, before finally dying with a OutOfMemory. or as very nicely added by Holger in comments, it could hang too:
IntStream.generate(() -> new Random()
.nextInt(2))
.distinct()
.forEach(System.out::println);

Java8 - Count after filter on stream

I hope this question was not asked before.
In java 8, I have an array of String myArray in input and an integer maxLength.
I want to count the number of string in my array smaller than maxLength. I WANT to use stream to resolve this issue.
For that I thought to do this :
int solution = Arrays.stream(myArray).filter(s -> s.length() <= maxLength).count();
However I'm not sure if it is the right way to do this. It will need to go through first array once and then go through the filtered array to count.
But if I don't use a stream, I could easely make an algorithm where I loop once over myArray.
My questions are very easy: Is there a way to resolve this issue with the same time performance than with a loop ? Is it always a "good" solution to use stream ?
However I'm not sure if it is the right way to do this. It will need
to go through first array once and then go through the filtered array
to count.
Your assumption that it will perform multiple passes is wrong. There is something calling operation fusion i.e. multiple operations can be executed in a single pass on the data;
In this case Arrays.stream(myArray) will create a stream object (cheap operation and lightweight object) , filter(s -> s.length() <= maxLength).count(); will be combined into a single pass on the data because there is no stateful operation in the pipeline as opposed to filtering all the elements of the stream and then counting all the elements which pass the predicate.
A quote from Brian Goetz post here states:
Stream pipelines, in contrast, fuse their operations into as few
passes on the data as possible, often a single pass. (Stateful
intermediate operations, such as sorting, can introduce barrier points
that necessitate multipass execution.)
As for:
My questions are very easy: Is there a way to resolve this issue with
the same time performance than with a loop ?
depends on the amount of data and cost per element. Anyhow, for a small number of elements the imperative for loops will almost always win if not always.
Is it always a "good" solution to use stream ?
No, if you really care about performance then measure, measure and measure.
Use streams for it being declarative, for its abstraction, composition and the possibility of benefitting from parallelism when you know you will benefit from it that is.
You can use range instead of stream and filter the output.
int solution = IntStream.range(0, myArray.length)
.filter(index -> myArray[index].length() <= maxLength)
.count();

Java 8 stream peek and limit interaction

Why this code in java 8:
IntStream.range(0, 10)
.peek(System.out::print)
.limit(3)
.count();
outputs:
012
I'd expect it to output 0123456789, because peek preceeds limit.
It seems to me even more peculiar because of the fact that this:
IntStream.range(0, 10)
.peek(System.out::print)
.map(x -> x * 2)
.count();
outputs 0123456789 as expected (not 02481012141618).
P.S.: .count() here is used just to consume stream, it can be replaced with anything else
The most important thing to know about streams are that they do not contain elements themselves (like collections) but are working like a pipe whose values are lazily evaluated. That means that the statements that build up a stream - including mapping, filtering, or whatever - are not evaluated until the terminal operation runs.
In your first example, the stream tries to count from 0 to 9, one at each time doing the following:
print out the value
check whether 3 values are passed (if yes, terminate)
So you really get the output 012.
In your second example, the stream again counts from 0 to 9, one at each time doing the following:
print out the value
maping x to x*2, thus forwarding the double of the value to the next step
As you can see the output comes before the mapping and thus you get the result 0123456789. Try to switch the peek and the map calls. Then you will get your expected output.
From the docs:
limit() is a short-circuiting stateful intermediate operation.
map() is an intermediate operation
Again from the docs what that essentially means is that limit() will return a stream with x values from the stream it received.
An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result.
Streams are defined to do lazy processing. So in order to complete your count() operation it doesn’t need to look at the other items. Otherwise, it would be broken, as limit(…) is defined to be a proper way of processing infinite streams in a finite time (by not processing more than limit items).
In principle, it would be possible to complete your request without ever looking at the int values at all, as the operation chain limit(3).count() doesn’t need any processing of the previous operations (other than verifying whether the stream has at least 3 items).
Streams use lazy evaluation, the intermediate operations, i.e. peek() are not executed till the terminal operation runs.
For instances, the following code will just print 1 .In fact, as soon as the first element of the stream,1, will reach the terminal operation, findAny(), the stream execution will be ended.
Arrays.asList(1,2,3)
.stream()
.peek(System.out::print)
.filter((n)->n<3)
.findAny();
Viceversa, in the following example, will be printed 123. In fact the terminal operation, noneMatch(), needs to evaluate all the elements of the stream in order to make sure there is no match with its Predicate: n>4
Arrays.asList(1, 2, 3)
.stream()
.peek(System.out::print)
.noneMatch(n -> n > 4);
For future readers struggling to understand how the count method doesn't execute the peek method before it, I thought I add this additional note:
As per Java 9, the Java documentation for the count method states that:
An implementation may choose to not execute the stream pipeline
(either sequentially or in parallel) if it is capable of computing the
count directly from the stream source.
This means terminating the stream with count is no longer enough to ensure the execution of all previous steps, such as peek.

Categories

Resources