Lazy stream operations and unresolved reference for stream() - java

I'm developing some log analyzing tool with kotlin. I have a large amount of incoming logs so it impossible to load them all into the memory, I need to process them in "pipeline" manner.
And I found out two things disappointing me:
As I understand all stream-like methods for kotlin collections (filter, map and so on) are not lazy. E.g. I have 1 GB of logs and want to get lengths of first ten lines that are matches the given regexp. If I write it as is, filtering and transforming will be applied to whole gigabyte of strings in memory.
I can't write l.stream(), where l defined as val l = ArrayList<String>(). Compiler says: "Unresolved reference: stream".
So the questions are: are you going to make collection functions lazy? And why can't I access the stream() method?

Kotlin does not use Java 8 Streams, instead there is lazy Sequence<T>. It has API mostly unified with Iterable<T>, so you can learn more about its usage here.
Sequence<T> is similar to Stream<T>, but it offers more when it comes to sequential data (e.g. takeWhile), though it has no parallel operations support at the moment*.
Another reason for introducing a replacement for Stream API is that Kotlin targets Java 6, which has no Streams, so they were dropped from Kotlin stdlib in favor of Sequence<T>.
A Sequence<T> can be created from an Iterable<T> (which every Collection<T> is) with asSequence() method:
val l = ArrayList<String>()
val sequence = l.asSequence()
This is equivalent to .stream() in Java and will let you process a collection lazily. Otherwise, transformations are eagerly applied to a collection.
* If you need it, the workaround is to rollback to Java 8 Streams:
(collection as java.lang.Collection<T>).parallelStream()

Related

Why Stream operations is duplicated with Collectors?

Please allow me to make some complaints, maybe it is boringly but I want to describe:"Why did this question will be raised?".
I have answered questions is different from others here, here and here last night.
After I get dig into it, I found there are many duplicated logic between Stream and Collector that violates Don't repeat yourself principle, e.g: Stream#map & Collectors#mapping, Stream#filter & Collectors#filtering in jdk-9 and .etc.
But it seems to reasonable since Stream abide by Tell, Don't ask principle/Law of Demeter and Collector abide by Composition over Inheritance principle.
I can only think of a few reasons why Stream operations is duplicated with Collectors as below:
We don't care of how the Stream is created in a big context. in this case Stream operation is more effectively and faster than Collector since it can mapping a Stream to another Stream simply, for example:
consuming(stream.map(...));
consuming(stream.collect(mapping(...,toList())).stream());
void consuming(Stream<?> stream){...}
Collector is more powerful that can composes Collectors together to collecting elements in a stream, However, Stream only providing some useful/highly used operations. for example:
stream.collect(groupingBy(
..., mapping(
..., collectingAndThen(reducing(...), ...)
)
));
Stream operations is more expressiveness than Collector when doing some simpler working, but they are more slower than Collectors since it will creates a new stream for each operation and Stream is more heavier and abstract than Collector. for example:
stream.map(...).collect(collector);
stream.collect(mapping(..., collector));
Collector can't applying short-circuiting terminal operation as Stream. for example:
stream.filter(...).findFirst();
Does anyone can come up with other disadvantage/advantage why Stream operations is duplicated with Collectors here? I'd like to re-understand them. Thanks in advance.
Chaining a dedicated terminal stream operation might be considered more expressive by those being used to chained method calls rather than the “LISP style” of composed collector factory calls. But it also allows optimized execution strategies for the stream implementation, as it knows the actual operation instead of just seeing a Collector abstraction.
On the other hand, as you named it yourself, Collectors can be composed, allowing to perform these operation embedded in another collector, at places where stream operations are not possible anymore. I suppose, this mirroring become apparent only at a late stage of the Java 8 development, which is the reason why some operations lacked their counterpart, like filtering or flatMapping, which will be there only in Java 9. So, having two different APIs doing similar things, was not a design decision made at the start of the development.
The Collectors methods that seem to duplicate Stream methods are offering additional functionality. They make sense when used in combination with other Collectors.
For example, if we consider Collectors.mapping(), the most common use is to pass it to a Collectors.groupingBy Collector.
Consider this example (taken from the Javadoc):
List<Person> people = ...
Map<City, Set<String>> namesByCity
= people.stream().collect(groupingBy(Person::getCity, TreeMap::new,
mapping(Person::getLastName, toSet())));
mapping is used here to transform the element type of the value Collection of each group from Person to String.
Without it (and the toSet() Collector) the output would be Map<City, List<Person>>.
Now, you can certainly map a Stream<Person> to a Stream<String> using people.stream().map(Person::getLastName), but then you will lose the ability to group these last names by some other property of Person (Person::getCity in this example).

Apache Spark: Effectively using mapPartitions in Java

In the currently early-release textbook titled High Performance Spark, the developers of Spark note that:
To allow Spark the flexibility to spill some records
to disk, it is important to represent your functions inside of mapPartitions in such a
way that your functions don’t force loading the entire partition in-memory (e.g.
implicitly converting to a list). Iterators have many methods we can write functional style
transformations on, or you can construct your own custom iterator. When a
transformation directly takes and returns an iterator without forcing it through
another collection, we call these iterator-to-iterator transformations.
However, the textbook lacks good examples using mapPartitions or similar variations of the method. And there's few good code examples existing online--most of which are Scala. For example, we see this Scala code using mapPartitions written by zero323 on How to add columns into org.apache.spark.sql.Row inside of mapPartitions.
def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)
sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show
Unfortunately, Java doesn't provide anything as nice as iter.map(...) for iterators. So it begs the question, how can one effectively use the iterator-to-iterator transformations with mapPartitions without entirely spilling an RDD to disk as a list?
JavaRDD<OutObj> collection = prevCollection.mapPartitions((Iterator<InObj> iter) -> {
ArrayList<OutObj> out = new ArrayList<>();
while(iter.hasNext()) {
InObj current = iter.next();
out.add(someChange(current));
}
return out.iterator();
});
This seems to be the general syntax for using mapPartitions in Java examples, but I don't see how this would be the most efficient, supposing you have a JavaRDD with tens of thousands of records (or even more...since, Spark is for big data). You'd eventually end up with a list of all the objects in the iterator, just to turn it back into an iterator (which begs to say that a map function of some sort would be much more efficient here).
Note: while these 8 lines of code using mapPartitions could be written as 1 line with a map or flatMap, I'm intentionally using mapPartitions to take advantage of the fact that it operates over each partition rather than each element in the RDD.
Any ideas, please?
One way to prevent forcing the "materialization" of the entire partition is by converting the Iterator into a Stream, and then using Stream's functional API (e.g. map function).
How to convert an iterator to a stream? suggests a few good ways to convert an Iterator into a Stream, so taking one of the options suggested there we can end up with:
rdd.mapPartitions((Iterator<InObj> iter) -> {
Iterable<InObj> iterable = () -> iter;
return StreamSupport.stream(iterable.spliterator(), false)
.map(s -> transformRow(s)) // or whatever transformation
.iterator();
});
Which should be an "Itrator-to-Iterator" transformation, because all the intermediate APIs used (Iterable, Stream) are lazily evaluated.
EDIT: I haven't tested it myself, but the OP commented, and I quote, that "there is no efficiency increase by using a Stream over a list". I don't know why that is, and I don't know if that would be true in general, but worth mentioning.

Why doesn't Collection<T> Implement Stream<T>? [duplicate]

This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?
From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.
A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model
First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.
My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.

Does a sequential stream in Java 8 use the combiner parameter on calling collect?

If I call collect on a sequential stream (eg. from calling Collection.stream()) then will it use the combiner parameter I pass to collect? I presume not but I see nothing in the documentation. If I'm correct, then it seems unfortunate to have to supply something that I know will not be used (if I know it is a sequential stream).
Keep in mind to develop against interface specifications -- not against the implementation. The implementation might change with the next Java version, whereas the specification should remain stable.
The specification does not differentiate between sequential and parallel streams. For that reason, you should assume, that the combiner might be used. Actually, there are good examples showing that combiners for sequential streams can improve the performance. For example, the following reduce operation concatenates a list of strings. Executing the code without combiner has quadratic complexity. A smart execution with combiner can reduce the runtime by magnitudes.
List<String> tokens = ...;
String result = tokens.stream().reduce("", String::concat, String::concat);

How to create a Scala parallel collection from a Java collection

The easiest way to convert a Java Collection to a Scala equivalent is using JavaConversions, since Scala 2.8.. These implicit defs return wrappers for the contained Java Collection.
Scala 2.9 introduced parallel collections, where operations on a collection can be executed in parallel and the result collected later. This is easily implemented, converting an existing collection into a parallel one is as simple as:
myCollection.par
But there's a problem with using 'par' on collections converted from Java collections using JavaConversions. As described in Parallel Collection Conversions, inherently sequential collections are 'forced' into a new parallel collection by evaluating all of the values and adding them to the new parallel collection:
Other collections, such as lists, queues or streams, are inherently
sequential in the sense that the elements must be accessed one after
the other. These collections are converted to their parallel variants
by copying the elements into a similar parallel collection. For
example, a functional list is converted into a standard immutable
parallel sequence, which is a parallel vector.
This causes problems when the original Java collection is intended to be lazily evaluated. For instance, if only a Java Iterable is returned, later converted to a Scala Iterable, there's no guarantee that the contents of the Iterable are intended to be accessed eagerly or not. So how should a parallel collection be created from a Java collection without sustaining the cost of evaluating each element? It is this cost I am trying to avoid by using a parallel collection to execute them in parallel and hopefully 'take' the first n results that are offered.
According to Parallel Collection Conversions there are a series of collection types that cost constant time, but there doesn't appear to be a way of getting a guarantee that these types can be created by JavaConversions (e.g. 'Set' can be created, but is that a 'HashSet'?).
First, every collection obtained via JavaConversions from a Java collection is not a by-default parallelizable Scala collection - this means that it will always be reevaluated into its corresponding parallel collection implementation. The reason for this is that parallel execution relies on the concepts of Splitters at least - it has to be splittable into smaller subsets that different processors can then work on.
I don't know how your Java collection looks in the data-structure sense, but if it's a tree-like thing or an array underneath whose elements are evaluated lazily, chances are that you can easily implement a Splitter.
If you do not want to eagerly force a lazy collection that implements a Java collection API, then your only option is to implement a new type of a parallel collection for that particular lazy Java collection. In this new implementation you have to provide means of splitting the iterator (that is, a Splitter).
Once you implement this new parallel collection which knows how to split your data-structure, you should create a custom Scala wrapper for your specific Java collection (at this point it's just a little bit of extra boilerplate, see how it's done in JavaConversions) and override its par to return your specific parallel collection.
You might even be able to do this generically for indexed sequences. Given that your Java collection is a sequence (in Java, a List) with a particularly efficient get method, you could implement the Splitter as an iterator that calls get within the initial range from 0 to size - 1, and is split by subdividing this range.
If you do, patches to the Standard library are always welcome.
Parallel requires random access and java.lang.Iterable doesn't provide it. This is a fundamental mismatch that no amount of conversions will comfortably get you past.
To use a non-programming analogy, you cannot get a person from Australia to England by sending one person from Singapore to England and another from Australia to Singapore at the same time.
Or in programming if you're processing a live stream of data you cannot parallelise it by processing the data from now at the same time as the data from five minutes ago without adding latency.
You would need something that provides at least some random access, like java.util.List.listIterator(Int) instead of Iterable.

Categories

Resources