Java - Intersection of multiple collections using stream + lambdas - java

I have the following function for the unification of multiple collections (includes repeated elements):
public static <T> List<T> unify(Collection<T>... collections) {
return Arrays.stream(collections)
.flatMap(Collection::stream)
.collect(Collectors.toList());
}
It would be nice to have a function with a similar signature for the intersection of collections (using type equality). For example:
public static <T> List<T> intersect(Collection<T>... collections) {
//Here is where the magic happens
}
I found an implementation of the intersect function, but it doesnt use streams:
public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
Set<T> common = new LinkedHashSet<T>();
if (!collections.isEmpty()) {
Iterator<? extends Collection<T>> iterator = collections.iterator();
common.addAll(iterator.next());
while (iterator.hasNext()) {
common.retainAll(iterator.next());
}
}
return common;
}
Is there any way to implement something similar to the unify function making use of streams? Im not so experienced in java8/stream api, because of that some advice would be really helpful.

You can write your own collector in some utility class and use it:
public static <T, S extends Collection<T>> Collector<S, ?, Set<T>> intersecting() {
class Acc {
Set<T> result;
void accept(S s) {
if(result == null) result = new HashSet<>(s);
else result.retainAll(s);
}
Acc combine(Acc other) {
if(result == null) return other;
if(other.result != null) result.retainAll(other.result);
return this;
}
}
return Collector.of(Acc::new, Acc::accept, Acc::combine,
acc -> acc.result == null ? Collections.emptySet() : acc.result,
Collector.Characteristics.UNORDERED);
}
The usage would be pretty simple:
Set<T> result = Arrays.stream(collections).collect(MyCollectors.intersecting());
Note however that collector cannot short-circuit: even if intermediate result will be an empty collection, it will still process the rest of the stream.
Such collector is readily available in my free StreamEx library (see MoreCollectors.intersecting()). It works with normal streams like above, but if you use it with StreamEx (which extends normal stream) it becomes short-circuiting: the processing may actually stop early.

While it’s tempting to think of retainAll as a black-box bulk operation that must be the most efficient way to implement an intersection operation, it just implies iterating over the entire collection and testing for each element whether it is contained in the collection passed as argument. The fact that you are calling it on a Set does not imply any advantage, as it is the other collection, whose contains method will determine the overall performance.
This implies that linearly scanning a set and testing each element for containment within all other collections will be on par with performing retainAll for each collection. Bonus points for iterating over the smallest collection in the first place:
public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
if(collections.isEmpty()) return Collections.emptySet();
Collection<T> smallest
= Collections.min(collections, Comparator.comparingInt(Collection::size));
return smallest.stream().distinct()
.filter(t -> collections.stream().allMatch(c -> c==smallest || c.contains(t)))
.collect(Collectors.toSet());
}
or, alternatively
public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
if(collections.isEmpty()) return Collections.emptySet();
Collection<T> smallest
= Collections.min(collections, Comparator.comparingInt(Collection::size));
HashSet<T> result=new HashSet<>(smallest);
result.removeIf(t -> collections.stream().anyMatch(c -> c!=smallest&& !c.contains(t)));
return result;
}

I think maybe it would make more sense to use Set instead of List (maybe that was a typo in your question):
public static <T> Set<T> intersect(Collection<T>... collections) {
//Here is where the magic happens
return (Set<T>) Arrays.stream(collections).reduce(
(a,b) -> {
Set<T> c = new HashSet<>(a);
c.retainAll(b);
return c;
}).orElseGet(HashSet::new);
}

and here's a Set implementation. retainAll() is a Collection method, so it works on all of them.
public static <T> Set<T> intersect(Collection<T>... collections)
{
return new HashSet<T>(Arrays.stream(collections).reduce(
((a, b) -> {
a.retainAll(b);
return a;
})
).orElse(new HashSet<T>());
}
And with List<> if order is important.
public static <T> List<T> intersect2(Collection<T>... collections)
{
return new ArrayList<T>(Arrays.stream(collections).reduce(
((a, b) -> {
a.retainAll(b);
return a;
})
).orElse(new ArrayList<T>()));
}
Java Collections lets them look almost identical. If required, you could filter the List to be distinct as it may contain duplicates.
public static <T> List<T> intersect2(Collection<T>... collections)
{
return new ArrayList<T>(Arrays.stream(collections).reduce(
((a, b) -> {
a.retainAll(b);
return a;
})
).orElse(new ArrayList<T>())).stream().distinct());
}

You can write it with streams as follows:
return collections.stream()
.findFirst() // find the first collection
.map(HashSet::new) // make a set out of it
.map(first -> collections.stream()
.skip(1) // don't need to process the first one
.collect(() -> first, Set::retainAll, Set::retainAll)
)
.orElseGet(HashSet::new); // if the input collection was empty, return empty set
The 3-argument collect replicates your retainAll logic
The streams implementation gives you the flexibility to tweak the logic more easily. For example, if all your collections are sets, you might want to start with the smallest set instead of the first one (for performance). To do that, you would replace findFirst() with min(comparing(Collection::size)) and get rid of the skip(1). Or you could see if you get better performance with the type of data you work with by running the second stream in parallel and all you would need to do is change stream to parallelStream.

Related

Java general lambda memoization

I want to write a utility for general memoization in Java, I want the code be able to look like this:
Util.memoize(() -> longCalculation(1));
where
private Integer longCalculation(Integer x) {
try {
Thread.sleep(1000);
} catch (InterruptedException ignored) {}
return x * 2;
}
To do this, I was thinking I could do something like this:
public class Util{
private static final Map<Object, Object> cache = new ConcurrentHashMap<>();
public interface Operator<T> {
T op();
}
public static<T> T memoize(Operator<T> o) {
ConcurrentHashMap<Object, T> memo = cache.containsKey(o.getClass()) ? (ConcurrentHashMap<Object, T>) cache.get(o.getClass()) : new ConcurrentHashMap<>();
if (memo.containsKey(o)) {
return memo.get(o);
} else {
T val = o.op();
memo.put(o, val);
return val;
}
}
}
I was expecting this to work, but I see no memoization being done. I have tracked it down to the o.getClass() being different for each invocation.
I was thinking that I could try to get the run-time type of T but I cannot figure out a way of doing that.
The answer by Lino points out a couple of flaws in the code, but doesn't work if not reusing the same lambda.
This is because o.getClass() does not return the class of what is returned by the lambda, but the class of the lambda itself. As such, the below code returns two different classes:
Util.memoize(() -> longCalculation(1));
Util.memoize(() -> longCalculation(1));
I don't think there is a good way to find out the class of the returned type without actually executing the potentially long running code, which of course is what you want to avoid.
With this in mind I would suggest passing the class as a second parameter to memoize(). This would give you:
#SuppressWarnings("unchecked")
public static <T> T memoize(Operator<T> o, Class<T> clazz) {
return (T) cache.computeIfAbsent(clazz, k -> o.op());
}
This is based on that you change the type of cache to:
private static final Map<Class<?>, Object> cache = new ConcurrentHashMap<>();
Unfortunately, you have to downcast the Object to a T, but you can guarantee that it is safe with the #SuppressWarnings("unchecked") annotation. After all, you are in control of the code and know that the class of the value will be the same as the key in the map.
An alternative would be to use Guavas ClassToInstanceMap:
private static final ClassToInstanceMap<Object> cache = MutableClassToInstanceMap.create(new ConcurrentHashMap<>());
This, however, doesn't allow you to use computeIfAbsent() without casting, since it returns an Object, so the code would become a bit more verbose:
public static <T> T memoize(Operator<T> o, Class<T> clazz) {
T cachedCalculation = cache.getInstance(clazz);
if (cachedCalculation != null) {
return cachedCalculation;
}
T calculation = o.op();
cache.put(clazz, calculation);
return calculation;
}
As a final side note, you don't need to specify your own functional interface, but you can use the Supplier interface:
#SuppressWarnings("unchecked")
public static <T> T memoize(Supplier<T> o, Class<T> clazz) {
return (T) cache.computeIfAbsent(clazz, k -> o.get());
}
The problem you have is in the line:
ConcurrentHashMap<Object, T> memo = cache.containsKey(o.getClass()) ? (ConcurrentHashMap<Object, T>) cache.get(o.getClass()) : new ConcurrentHashMap<>();
You check whether an entry with the key o.getClass() exists. If yes, you get() it else you use a newly initialized ConcurrentHashMap. The problem now with that is, you don't save this newly created map, back in the cache.
So either:
Place cache.put(o.getClass(), memo); after the line above
Or even better use the computeIfAbsent() method:
ConcurrentHashMap<Object, T> memo = cache.computeIfAbsent(o.getClass(),
k -> new ConcurrentHashMap<>());
Also because you know the structure of your cache you can make it more typesafe, so that you don't have to cast everywhere:
private static final Map<Object, Map<Operator<?>, Object>> cache = new ConcurrentHashMap<>();
Also you can shorten your method even more by using the earlier mentioned computeIfAbsent():
public static <T> T memoize(Operator<T> o) {
return (T) cache
.computeIfAbsent(o.getClass(), k -> new ConcurrentHashMap<>())
.computeIfAbsent(o, k -> o.op());
}
(T): simply casts the unknown return type of Object to the required output type T
.computeIfAbsent(o.getClass(), k -> new ConcurrentHashMap<>()): invokes the provided lambda k -> new ConcurrentHashMap<>() when there is no mapping for the key o.getClass() in cache
.computeIfAbsent(o, k -> o.op());: this is invoked on the returned value from the computeIfAbsent call of 2.. If o doesn't exist in the nested map then execute the lambda k -> o.op() the return value is then stored in the map and returned.

Collector to split stream up into chunks of given size

I've got a problem at hand that I'm trying to solve with something I'm pretty sure I'm not supposed to do but don't see an alternative. I'm given a List of Strings and should split it up into chunks of a given size. The result then has to be passed to some method for further processing. As the list might be huge the processing should be done asynchronously.
My approach is to create a custom Collector that takes the Stream of Strings and converts it to a Stream<List<Long>>:
final Stream<List<Long>> chunks = list
.stream()
.parallel()
.collect(MyCollector.toChunks(CHUNK_SIZE))
.flatMap(p -> doStuff(p))
.collect(MyCollector.toChunks(CHUNK_SIZE))
.map(...)
...
The code for the Collector:
public final class MyCollector<T, A extends List<List<T>>, R extends Stream<List<T>>> implements Collector<T, A, R> {
private final AtomicInteger index = new AtomicInteger(0);
private final AtomicInteger current = new AtomicInteger(-1);
private final int chunkSize;
private MyCollector(final int chunkSize){
this.chunkSize = chunkSize;
}
#Override
public Supplier<A> supplier() {
return () -> (A)new ArrayList<List<T>>();
}
#Override
public BiConsumer<A, T> accumulator() {
return (A candidate, T acc) -> {
if (index.getAndIncrement() % chunkSize == 0){
candidate.add(new ArrayList<>(chunkSize));
current.incrementAndGet();
}
candidate.get(current.get()).add(acc);
};
}
#Override
public BinaryOperator<A> combiner() {
return (a1, a2) -> {
a1.addAll(a2);
return a1;
};
}
#Override
public Function<A, R> finisher() {
return (a) -> (R)a.stream();
}
#Override
public Set<Characteristics> characteristics() {
return Collections.unmodifiableSet(EnumSet.of(Characteristics.CONCURRENT, Characteristics.UNORDERED));
}
public static <T> MyCollector<T, List<List<T>>, Stream<List<T>>> toChunks(final int chunkSize){
return new MyCollector<>(chunkSize);
}
}
This seems to work in most cases but I get a NPE sometimes.. I'm sure the in the accumulator is not thread safe as there might be two threads interfering when adding new Lists to the main List. I don't mind a chunk having a few too many or too little elements though.
I've tried this instead of the current supplier function:
return () -> (A)new ArrayList<List<T>>(){{add(new ArrayList<T>());}};
To make sure there is always a List present. This doesn't work at all and results in empty lists.
Issues:
I'm pretty sure a custom Spliterator would be a good solution. It would not work for synchronous scenarios however. Also, am I sure the Spliterator is called?
I'm aware I shouldn't have state at all but not sure how to change it.
Questions:
Is this approach completely wrong or somehow fixable?
If I use a Spliterator - can I be sure it's called or is that decided by the underlying implementation?
I'm pretty sure the casts to (A) and (R) in the supplier and finisher are not necessary but IntelliJ complains. Is there something I'm missing?
EDIT:
I've added some more to the client code as the suggestions with IntStream.range won't work when chained.
I realize I could do it differently as suggested in a comment but it's also a little bit about style and knowing if it's possible.
I have CONCURRENT characteristic because I assume the Stream API would fall back to synchronous handling otherwise. The solution is not thread-safe as stated before.
Any help would be greatly appreciated.
Best,
D
I can't comment yet, but I wanted to post the following link to a very similar issue (though not a duplicate, as far as I understand): Java 8 Stream with batch processing
You might also be interested in the following issue on GitHub: https://github.com/jOOQ/jOOL/issues/296
Now, your use of CONCURRENT characteristic is wrong - the doc say the following about Collector.Characteristics.CONCURRENT:
Indicates that this collector is concurrent, meaning that the result container can support the accumulator function being called concurrently with the same result container from multiple threads.
This means that the supplier only gets called once, and the combiner actually never gets called (cf. the source of ReferencePipeline.collect() method). That's why you got NPEs sometimes.
As a result, I suggest a simplified version of what you came up with:
public static <T> Collector<T, List<List<T>>, Stream<List<T>>> chunked(int chunkSize) {
return Collector.of(
ArrayList::new,
(outerList, item) -> {
if (outerList.isEmpty() || last(outerList).size() >= chunkSize) {
outerList.add(new ArrayList<>(chunkSize));
}
last(outerList).add(item);
},
(a, b) -> {
a.addAll(b);
return a;
},
List::stream,
Collector.Characteristics.UNORDERED
);
}
private static <T> T last(List<T> list) {
return list.get(list.size() - 1);
}
Alternatively, you could write a truly concurrent Collector using proper synchronization, but if you don't mind having more than one list with a size less than chunkSize (which is the effect you can get with a non-concurrent Collector like the one I proposed above), I wouldn't bother.
Here is one way, in the spirit of doing it all in one expression, which is oddly satisfying: first associate each string with its index in the list, then use that in the collector to pick a string list to put each string into. Then stream those lists in parallel to your converter method.
final Stream<List<Long>> longListStream = IntStream.range(0, strings.size())
.parallel()
.mapToObj(i -> new AbstractMap.SimpleEntry<>(i, strings.get(i)))
.collect(
() -> IntStream.range(0, strings.size() / CHUNK_SIZE + 1)
.mapToObj(i -> new LinkedList<String>())
.collect(Collectors.toList()),
(stringListList, entry) -> {
stringListList.get(entry.getKey() % CHUNK_SIZE).add(entry.getValue());
},
(stringListList1, stringListList2) -> { })
.parallelStream()
.map(this::doStuffWithStringsAndGetLongsBack);

Java 8, how to group stream elements to sets using BiPredicate

I have stream of files, and a method which takes two files as an argument, and return if they have same content or not.
I want to reduce this stream of files to a set (or map) of sets grouping all the files with identical content.
I know this is possible by refactoring the compare method to take one file, returning a hash and then grouping the stream by the hash returned by the function given to the collector. But what is the cleanest way to achieve this with a compare method, which takes two files and returns a boolean?
For clarity, here is an example of the obvious way with the one argument function solution
file.stream().collect(groupingBy(f -> Utility.getHash(f))
But in my case I have the following method which I want to utilize in the partitioning process
public boolean isFileSame(File f, File f2) {
return Files.equal(f, f2)
}
If all you have is a BiPredicate without an associated hash function that would allow an efficient lookup, you can only use a linear probing. There is no builtin collector doing that, but a custom collector working close to the original groupingBy collector can be implemented like
public static <T> Collector<T,?,Map<T,Set<T>>> groupingBy(BiPredicate<T,T> p) {
return Collector.of(HashMap::new,
(map,t) -> {
for(Map.Entry<T,Set<T>> e: map.entrySet())
if(p.test(t, e.getKey())) {
e.getValue().add(t);
return;
}
map.computeIfAbsent(t, x->new HashSet<>()).add(t);
}, (m1,m2) -> {
if(m1.isEmpty()) return m2;
m2.forEach((t,set) -> {
for(Map.Entry<T,Set<T>> e: m1.entrySet())
if(p.test(t, e.getKey())) {
e.getValue().addAll(set);
return;
}
m1.put(t, set);
});
return m1;
}
);
but, of course, the more resulting groups you have, the worse the performance will be.
For your specific task, it will be much more efficient to use
public static ByteBuffer readUnchecked(Path p) {
try {
return ByteBuffer.wrap(Files.readAllBytes(p));
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
}
and
Set<Set<Path>> groupsByContents = your stream of Path instances
.collect(Collectors.collectingAndThen(
Collectors.groupingBy(YourClass::readUnchecked, Collectors.toSet()),
map -> new HashSet<>(map.values())));
which will group the files by contents and does hashing implicitly. Keep in mind that equal hash does not imply equal contents but this solution does already take care of this. The finishing function map -> new HashSet<>(map.values()) ensures that the resulting collection does not keep the file’s contents in memory after the operation.
A possible solution by the helper class Wrapper:
files.stream()
.collect(groupingBy(f -> Wrapper.of(f, Utility::getHash, Files::equals)))
.keySet().stream().map(Wrapper::value).collect(toList());
If you won't to use the Utility.getHash for some reason, try to use File.length() for the hash function. The Wrapper provides a general solution to customize the hash/equals function for any type (e.g. array). it's useful to keep it into your tool kit. Here is the sample implementation for the Wrapper:
public class Wrapper<T> {
private final T value;
private final ToIntFunction<? super T> hashFunction;
private final BiFunction<? super T, ? super T, Boolean> equalsFunction;
private int hashCode;
private Wrapper(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
this.value = value;
this.hashFunction = hashFunction;
this.equalsFunction = equalsFunction;
}
public static <T> Wrapper<T> of(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
return new Wrapper<>(value, hashFunction, equalsFunction);
}
public T value() {
return value;
}
#Override
public int hashCode() {
if (hashCode == 0) {
hashCode = value == null ? 0 : hashFunction.applyAsInt(value);
}
return hashCode;
}
#Override
public boolean equals(Object obj) {
return (obj == this) || (obj instanceof Wrapper && equalsFunction.apply(((Wrapper<T>) obj).value, value));
}
// TODO ...
}

Java flatmap Iterator<Pair<Stream<A>, Stream<B>>> to Pair<Stream<A>, Stream<B>>

I'm trying to implement a method with the following signature:
public static <A,B> Pair<Stream<A>, Stream<B>> flatten(Iterator<Pair<Stream<A>, Stream<B>>> iterator);
Where the goal of the method is to flatten each of the stream types into a single stream and wrap the output in a pair. I only have an Iterator (not an Iterable) and I can't alter the method signature, so I have to perform the flattening in a single iteration.
My current best implementation is
public static <A,B> Pair<Stream<A>, Stream<B>> flatten(Iterator<Pair<Stream<A>, Stream<B>> iterator) {
Stream<A> aStream = Stream.empty();
Stream<B> bStream = Stream.empty();
while(iterator.hasNext()) {
Pair<Stream<A>, Stream<B>> elm = iterator.next();
aStream = Stream.concat(aStream, elm.first);
bStream = Stream.concat(bStream, elm.second);
}
return Pair.of(aStream, bStream);
}
But while this is technically correct I'm not super happy with this for two reasons:
Stream.concat warns against doing this kind of thing because it may lead to a StackOverflowError.
Stylistically I'd rather it be purely functional if possible instead of having to loop over the iterator and re-assign the streams throughout.
It feels like Stream#flatMap should be suited here (after transforming the input Iterator to a Stream using Guava's Streams.stream(Iterator), but it seems to not work because of the Pair type in the middle.
One additional requirement is that any of the iterator/streams may be very large (the input could contain anywhere from a single pair of exceedingly large streams to many of one item streams, for example) so solutions ideally shouldn't contain collecting results into in-memory collections.
Well guava's Streams.stream is no magic and it's actually internally just:
StreamSupport.stream(Spliterators.spliteratorUnknownSize(iterator, 0), false);
So probably no need to link that to your method while you could use it directly.
And you could use Stream.Builder just for that:
public static <A, B> Pair<Stream<A>, Stream<B>> flatten(Iterator<Pair<Stream<A>, Stream<B>>> iterator) {
Stream.Builder<Stream<A>> builderA = Stream.builder();
Stream.Builder<Stream<B>> builderB = Stream.builder();
iterator.forEachRemaining(pair -> {
builderA.add(pair.first);
builderB.add(pair.second);
});
return Pair.of(builderA.build().flatMap(Function.identity()), builderB.build().flatMap(Function.identity()));
}
Avoiding to collect the whole Iterator (like you actually do in the question) is quite difficult since you don't know how the resulting streams will be consumed: one could be entirely consumed, requiring to consume the iterator entirely as well, while the other is not consumed at all, requiring to keep track of all pairs produced – effectively collecting them somewhere.
Only if the streams are consumed more or less at the "speed", you could benefit from not collecting the whole iterator. But such consumption implies to either work with the iterator of one of the resulting streams, or to consume the streams in parallel threads – which would require additional synchronization.
I thus suggest to collect all pairs into a List instead, and then generate the new Pair from that list:
public static <A,B> Pair<Stream<A>, Stream<B>> flatten(Iterator<Pair<Stream<A>, Stream<B>>> iterator) {
Iterable<Pair<Stream<A>, Stream<B>>> iterable = () -> iterator;
final List<Pair<Stream<A>, Stream<B>>> allPairs =
StreamSupport.stream(iterable.spliterator(), false)
.collect(Collectors.toList());
return Pair.of(
allPairs.stream().flatMap(p -> p.first),
allPairs.stream().flatMap(p -> p.second)
);
}
This does not consume any of the original streams yet, while keeping a simple solution that avoids nested stream concatenations.
First of all this would be a "more functional" version of your code, that you say you'd prefer stylistically:
<A, B> Pair<Stream<A>, Stream<B>> flattenFunctional(Iterator<Pair<Stream<A>, Stream<B>>> iterator) {
return Streams.stream(iterator)
.reduce(Pair.of(Stream.empty(), Stream.empty()),
(a, b) -> Pair.of(
Stream.concat(a.first, b.first),
Stream.concat(a.second, b.second)));
}
The warning about a possible StackOverflowError still applies here as Stream.concat is used.
To avoid that and also thinking about performance and memory use for large datasets, I have the following suggestion (not functional at all). You can create a pair of custom Iterator (for A, B types) and use Guava's Streams.stream() to get a pair of streams. Put these custom iterators in a class with pair of stacks of iterators. If for instance in the first pair in iterator, Stream<A> has less elements than Stream<B> then after Stream<A> is exhausted, call iterator.next() and push an iterator of B into its stack. Here is the class with the pair of stacks (add a constructor):
class PairStreamIterator<A, B> {
private final Iterator<Pair<Stream<A>, Stream<B>>> iterator;
private final Queue<Iterator<A>> stackA = new ArrayDeque<>();
private final Queue<Iterator<B>> stackB = new ArrayDeque<>();
Iterator<A> getItA() {
return new Iterator<A>() {
#Override public boolean hasNext() {
if (!stackA.isEmpty() && !stackA.peek().hasNext()) {
stackA.remove();
return hasNext();
} else if (!stackA.isEmpty() && stackA.peek().hasNext()) {
return true;
} else if (iterator.hasNext()) {
Pair<Stream<A>, Stream<B>> pair = iterator.next();
stackA.add(pair.first.iterator());
stackB.add(pair.second.iterator());
return hasNext();
}
return false;
}
#Override public A next() {
return stackA.peek().next();
}
};
}
// repeat for Iterator<B>
}
and the flatten method:
<A, B> Pair<Stream<A>, Stream<B>> flattenIt(Iterator<Pair<Stream<A>, Stream<B>>> iterator) {
final PairStreamIterator<A, B> pair = new PairStreamIterator<>(iterator);
return Pair.of(Streams.stream(pair.getItA()), Streams.stream(pair.getItB()));
}
The 2 stacks will typically hold 1 or 2 iterators if you consume the 2 streams in the result pair of flatten at the same rate. Worst case scenario is if you plan to consume one of the streams of the resulting pair completely and then the other. In that case all the iterators required for the second flattened stream will remain in the stack of iterators. I don't think there is any way around that I am afraid. As these are stored in the heap in memory you won't get StackOverflowError albeit you may still get OutOfMemoryError
A possible caveat is the use of recursion in hasNext. That will only be a problem if you encounter many consecutive empty streams in your input.

filter, sort and limit an unmodifiable list in Java

I want to filter, then sort and then limit an unmodifiable list in Java and Guava. Is there any smart way to do this more efficient than below?
public static <T> List<T> execute(final List<T> list, final Predicate<? super T> filter, final Comparator<? super T> sort, final Integer limit) {
final List<T> newList = Lists.newArrayList(Iterables.filter(list, filter));
Collections.sort(newList, sort);
if (limit > newList.size()) {
return newList;
}
return newList.subList(0, limit);
}
Thank you!
if (limit != null) {
return newList.subList(0, limit);
}
This will blow whenever limit > newList.size().
I guess, only one thing can be optimized: You don't really have to sort the whole collection, if you want only a part of it. Doing this is a bit complicated, so you should measure first if you really need it.
As stated in the comments, it's actually easy, as Guava already does it:
public static <T> List<T> execute(
final List<T> list,
final Predicate<? super T> filter,
final Comparator<? super T> sort,
final Integer limit) {
final FluentIterable<T> filtered = FluentIterable.from(list).filter(filter);
return Ordering.from(sort).leastOf(filtered, limit);
}
List<T> newList = FluentIterable.from(list)
.filter(predicate)
.limit(limit)
.toSortedList(comparator);
// (This sorts last so is behaviorally different from the original example, but it
// gives an idea of a somewhat more readable approach to this type of thing.)
// Java 8's Stream, unlike Guava's FluentIterable, has a sort method which
// makes this easier:
List<T> newList = list.stream()
.filter(predicate)
.sort(comparator)
.limit(limit)
.collect(toList());
Just use that instead of your execute method, and just don't call filter if you don't have a predicate to filter with, etc. Note how it's clear what's being done when you look at that code. Compare it to:
List<T> newList = execute(list, predicate, comparator, limit);
or even worse:
List<T> newList = execute(list, null, comparator, null);
Defining a single method like this only obfuscates your code.
Your tests quite complicate the thing... If you want the user to be able to call this method without using a filter (for example), just overload your method. null should not be a legal value to that kind of parameters.

Categories

Resources