Java 8, how to group stream elements to sets using BiPredicate - java

I have stream of files, and a method which takes two files as an argument, and return if they have same content or not.
I want to reduce this stream of files to a set (or map) of sets grouping all the files with identical content.
I know this is possible by refactoring the compare method to take one file, returning a hash and then grouping the stream by the hash returned by the function given to the collector. But what is the cleanest way to achieve this with a compare method, which takes two files and returns a boolean?
For clarity, here is an example of the obvious way with the one argument function solution
file.stream().collect(groupingBy(f -> Utility.getHash(f))
But in my case I have the following method which I want to utilize in the partitioning process
public boolean isFileSame(File f, File f2) {
return Files.equal(f, f2)
}

If all you have is a BiPredicate without an associated hash function that would allow an efficient lookup, you can only use a linear probing. There is no builtin collector doing that, but a custom collector working close to the original groupingBy collector can be implemented like
public static <T> Collector<T,?,Map<T,Set<T>>> groupingBy(BiPredicate<T,T> p) {
return Collector.of(HashMap::new,
(map,t) -> {
for(Map.Entry<T,Set<T>> e: map.entrySet())
if(p.test(t, e.getKey())) {
e.getValue().add(t);
return;
}
map.computeIfAbsent(t, x->new HashSet<>()).add(t);
}, (m1,m2) -> {
if(m1.isEmpty()) return m2;
m2.forEach((t,set) -> {
for(Map.Entry<T,Set<T>> e: m1.entrySet())
if(p.test(t, e.getKey())) {
e.getValue().addAll(set);
return;
}
m1.put(t, set);
});
return m1;
}
);
but, of course, the more resulting groups you have, the worse the performance will be.
For your specific task, it will be much more efficient to use
public static ByteBuffer readUnchecked(Path p) {
try {
return ByteBuffer.wrap(Files.readAllBytes(p));
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
}
and
Set<Set<Path>> groupsByContents = your stream of Path instances
.collect(Collectors.collectingAndThen(
Collectors.groupingBy(YourClass::readUnchecked, Collectors.toSet()),
map -> new HashSet<>(map.values())));
which will group the files by contents and does hashing implicitly. Keep in mind that equal hash does not imply equal contents but this solution does already take care of this. The finishing function map -> new HashSet<>(map.values()) ensures that the resulting collection does not keep the file’s contents in memory after the operation.

A possible solution by the helper class Wrapper:
files.stream()
.collect(groupingBy(f -> Wrapper.of(f, Utility::getHash, Files::equals)))
.keySet().stream().map(Wrapper::value).collect(toList());
If you won't to use the Utility.getHash for some reason, try to use File.length() for the hash function. The Wrapper provides a general solution to customize the hash/equals function for any type (e.g. array). it's useful to keep it into your tool kit. Here is the sample implementation for the Wrapper:
public class Wrapper<T> {
private final T value;
private final ToIntFunction<? super T> hashFunction;
private final BiFunction<? super T, ? super T, Boolean> equalsFunction;
private int hashCode;
private Wrapper(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
this.value = value;
this.hashFunction = hashFunction;
this.equalsFunction = equalsFunction;
}
public static <T> Wrapper<T> of(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
return new Wrapper<>(value, hashFunction, equalsFunction);
}
public T value() {
return value;
}
#Override
public int hashCode() {
if (hashCode == 0) {
hashCode = value == null ? 0 : hashFunction.applyAsInt(value);
}
return hashCode;
}
#Override
public boolean equals(Object obj) {
return (obj == this) || (obj instanceof Wrapper && equalsFunction.apply(((Wrapper<T>) obj).value, value));
}
// TODO ...
}

Related

Order element of a Map of generic object --- by a generic attribute of the object --- Lamda - Java

I need a method to order a Map of generic object by a generic attribute of the object. I tried the code below, similar to other example I found on StackOverFlow, but I didn't find any example with a generic attribute. I'm not expert of lamda, so for me it is hard to understand clearly some logics.
I get error on compareTo; Ntebeans says me:
"cannot find symbol symbol: method compareTo(CAP#1)
location: class Object where CAP#1 is a fresh type-variable:
CAP#1 extends Object from capture of ?"
Example:
I have a object 'car' with attribute 'name'
I have an Hashmap<Integer,car>, containing items: key 1, object with name=Ford --- key 2, object with name=Audi --- key 3, object with name=Fiat
The first element of the map has key 1, the second has key 2, the third has key 3
I would like to have in output an Arraylist where: - The first element is object 'Audi', the second is object 'Fiat', the third is object 'Ford', so to have the 3 names sorted.
In order to invoke this method I would use for example:
ArrayList<Car> SORTED_Cars = get_ListOfObject_SortedByAttribute(myHashMap, car -> car.getName() );
I should get an ArrayList of object 'car' ordered by attribute 'name'.
The final task is to have a method that I'll use with Map of different Objects, then with different attributes.
Note that I use this checking condition
if (MyMap_Arg!=null && MyMap_Arg.size()>0 && MyMap_Arg.values()!=null)
because I prefer get null when the ordering is not possible or the map is empty.
How should be the code below to work?
private static <T> List<T> get_ListOfObject_SortedByAttribute(final Map<?, T> MyMap_Arg, final Function<T, ?> MY_AttributeValueExtractor__Arg ) {
List<T> result = null;
try {
if (MyMap_Arg!=null && MyMap_Arg.size()>0 && MyMap_Arg.values()!=null){
if (MY_AttributeValueExtractor__Arg!=null ) {
//_____________________________________________________
//Crea una lista di oggetti che hanno MY_AttributeValueExtractor_1_Arg!=null; altrimenti applicando '.compare' darebbe exception
List<T> MY_LIST__SenzaNull = MyMap_Arg.values().stream().filter( o -> MY_AttributeValueExtractor__Arg.apply(o)!=null ).collect(Collectors.toList());
//_____________________________________________________
//TEST ********* Ordina la lista di oggetti alfabeticamente in base a MY_AttributeValueExtractor_1_Arg
result = MY_LIST__SenzaNull.stream().sorted(
(o1, o2)-> MY_AttributeValueExtractor__Arg.apply(o1).
compareTo( MY_AttributeValueExtractor__Arg.apply(o2) )
).
collect(Collectors.toList());
//_____________________________________________________
}
}
} catch (Exception ex) {
result=null;
}
return result;
}
Given your code (with refactorings for clarity)
static <T> List<T> getListSortedByAttribute(Map<?, T> aMap,
Function<T, ?> attributeValueExtractor) {
List<T> result = null;
try {
if (aMap != null && aMap.size() > 0 && aMap.values() != null) {
if (attributeValueExtractor != null) {
List<T> listWithoutElementsReturingNullAsAttributeValue = aMap
.values()
.stream()
.filter(o -> attributeValueExtractor.apply(o) != null)
.collect(Collectors.toList());
result = listWithoutElementsReturingNullAsAttributeValue
.stream()
.sorted((o1, o2) -> attributeValueExtractor.apply(o1).
compareTo(attributeValueExtractor.apply(o2)))
.collect(Collectors.toList());
}
}
} catch (Exception ex) {
result = null;
}
return result;
}
you want to use a compareTo method to compare (sort) list elements by one of their attributes given as function
(o1, o2) -> attributeValueExtractor.apply(o1)
.compareTo(attributeValueExtractor.apply(o2))
With which you get the compile-time error
Error:(149, 78) java: cannot find symbol
symbol: method compareTo(capture#1 of ?)
location: class java.lang.Object
Meaning, there's nothing in your code which ensures that your attributes have such a method (and can be compared for sorting). In particular, Function<T, ?> attributeValueExtractor) says that something type T is mapped (by your function) to type ? which can only be Object; and Object doesn't have a compareTo method. (Note that Object doesn't have such a method because there's simply no meaningful way of comparing arbitrary objects with each other).
To fix it (at the very least), you need to ensure that your objects implement such a method. So your method signature
static <T> List<T> getListSortedByAttribute(
Map<?, T> aMap, Function<T, ?> attributeValueExtractor)
needs to change to
static <T, U extends Comparable<U>> List<T> getListSortedByAttribute(
Map<?, T> aMap, Function<T, U> attributeValueExtractor)
where U types are required to implement the interface Comparable which has the method compareTo.
With that you get (with a few more refactorings and addition of an example)
public static void main(String[] args) {
Map<Integer, Car> map = new HashMap<>();
map.put(1, new Car("Ford"));
map.put(2, new Car("Audi"));
map.put(3, new Car("Fiat"));
map.put(4, new Car(null));
List<Car> list = getListSortedByAttribute(map, Car::getName);
System.out.println(list);
}
static <T, U extends Comparable<U>> List<T> getSortedValues(
Map<?, T> aMap, Function<T, U> attributeExtractor) {
List<T> result = null;
try {
if (aMap != null && aMap.size() > 0) {
if (attributeExtractor != null) {
result = aMap
.values()
.stream()
.filter(o -> attributeExtractor.apply(o) != null)
.sorted((o1, o2) -> attributeExtractor.apply(o1).
compareTo(attributeExtractor.apply(o2)))
.collect(Collectors.toList());
}
}
} catch (Exception ex) {
result = null;
}
return result;
}
static class Car {
private final String name;
Car(String name) { this.name = name; }
String getName() { return name; }
#Override
public String toString() { return name; }
// code for equals() and hashCode() omitted for brevity
}
which prints
[Audi, Fiat, Ford]
As for List<Car> list = getListSortedByAttribute(map, Car::getName); note that you can use a lambda expression (Car car) -> car.getName(), or a method reference Car::getName as method argument. I chose the later because I find it shorter.
The invocation then works because the return type of Car::getName is a String which implements Comparable and therefore has a compareTo method implementation.
Notes about your code (my opinion)
When using a method which returns a Collection type like yours it's very surprising if such a method returns null instead of an empty collection – and it'll be the cause of many (and also surprising) NullPointerExceptions down the road.
As pointed out in the comments aMap.values() != null is always true because (surprise) the Java API designers/implementers decided that this method should always return a non-null collection, i.e. an empty collection if there are no values. Anyhow that condition has no effect, it's always true.
aMap != null and attributeValueExtractor != null, simply throw instead because calling your method with null arguments simply shouldn't be a valid invocation which is allowed to continue (fail fast principle).
aMap.size() > 0 isn't really needed as the subsequent stream-code handles that without any problem, i.e. the result is simply an empty list. And intentionally returing null for that case isn't something I'd every do (as explained above).
Don't swallow exceptions, either throw (fail fast principle) or recover meaningfully. But returning null as outlined above isn't a meaningful recovery.
With respect to the above and further changes you then get
static <T, U extends Comparable<U>> List<T> getSortedValues(
Map<?, T> aMap, Function<T, U> attributeExtractor) {
Objects.requireNonNull(aMap, "Map to be sorted cannot be null");
Objects.requireNonNull(attributeExtractor, "Function to extract a value cannot be null");
return aMap
.values()
.stream()
.filter(o -> attributeExtractor.apply(o) != null)
.sorted(Comparator.comparing(attributeExtractor))
.collect(Collectors.toList());
}

Java general lambda memoization

I want to write a utility for general memoization in Java, I want the code be able to look like this:
Util.memoize(() -> longCalculation(1));
where
private Integer longCalculation(Integer x) {
try {
Thread.sleep(1000);
} catch (InterruptedException ignored) {}
return x * 2;
}
To do this, I was thinking I could do something like this:
public class Util{
private static final Map<Object, Object> cache = new ConcurrentHashMap<>();
public interface Operator<T> {
T op();
}
public static<T> T memoize(Operator<T> o) {
ConcurrentHashMap<Object, T> memo = cache.containsKey(o.getClass()) ? (ConcurrentHashMap<Object, T>) cache.get(o.getClass()) : new ConcurrentHashMap<>();
if (memo.containsKey(o)) {
return memo.get(o);
} else {
T val = o.op();
memo.put(o, val);
return val;
}
}
}
I was expecting this to work, but I see no memoization being done. I have tracked it down to the o.getClass() being different for each invocation.
I was thinking that I could try to get the run-time type of T but I cannot figure out a way of doing that.
The answer by Lino points out a couple of flaws in the code, but doesn't work if not reusing the same lambda.
This is because o.getClass() does not return the class of what is returned by the lambda, but the class of the lambda itself. As such, the below code returns two different classes:
Util.memoize(() -> longCalculation(1));
Util.memoize(() -> longCalculation(1));
I don't think there is a good way to find out the class of the returned type without actually executing the potentially long running code, which of course is what you want to avoid.
With this in mind I would suggest passing the class as a second parameter to memoize(). This would give you:
#SuppressWarnings("unchecked")
public static <T> T memoize(Operator<T> o, Class<T> clazz) {
return (T) cache.computeIfAbsent(clazz, k -> o.op());
}
This is based on that you change the type of cache to:
private static final Map<Class<?>, Object> cache = new ConcurrentHashMap<>();
Unfortunately, you have to downcast the Object to a T, but you can guarantee that it is safe with the #SuppressWarnings("unchecked") annotation. After all, you are in control of the code and know that the class of the value will be the same as the key in the map.
An alternative would be to use Guavas ClassToInstanceMap:
private static final ClassToInstanceMap<Object> cache = MutableClassToInstanceMap.create(new ConcurrentHashMap<>());
This, however, doesn't allow you to use computeIfAbsent() without casting, since it returns an Object, so the code would become a bit more verbose:
public static <T> T memoize(Operator<T> o, Class<T> clazz) {
T cachedCalculation = cache.getInstance(clazz);
if (cachedCalculation != null) {
return cachedCalculation;
}
T calculation = o.op();
cache.put(clazz, calculation);
return calculation;
}
As a final side note, you don't need to specify your own functional interface, but you can use the Supplier interface:
#SuppressWarnings("unchecked")
public static <T> T memoize(Supplier<T> o, Class<T> clazz) {
return (T) cache.computeIfAbsent(clazz, k -> o.get());
}
The problem you have is in the line:
ConcurrentHashMap<Object, T> memo = cache.containsKey(o.getClass()) ? (ConcurrentHashMap<Object, T>) cache.get(o.getClass()) : new ConcurrentHashMap<>();
You check whether an entry with the key o.getClass() exists. If yes, you get() it else you use a newly initialized ConcurrentHashMap. The problem now with that is, you don't save this newly created map, back in the cache.
So either:
Place cache.put(o.getClass(), memo); after the line above
Or even better use the computeIfAbsent() method:
ConcurrentHashMap<Object, T> memo = cache.computeIfAbsent(o.getClass(),
k -> new ConcurrentHashMap<>());
Also because you know the structure of your cache you can make it more typesafe, so that you don't have to cast everywhere:
private static final Map<Object, Map<Operator<?>, Object>> cache = new ConcurrentHashMap<>();
Also you can shorten your method even more by using the earlier mentioned computeIfAbsent():
public static <T> T memoize(Operator<T> o) {
return (T) cache
.computeIfAbsent(o.getClass(), k -> new ConcurrentHashMap<>())
.computeIfAbsent(o, k -> o.op());
}
(T): simply casts the unknown return type of Object to the required output type T
.computeIfAbsent(o.getClass(), k -> new ConcurrentHashMap<>()): invokes the provided lambda k -> new ConcurrentHashMap<>() when there is no mapping for the key o.getClass() in cache
.computeIfAbsent(o, k -> o.op());: this is invoked on the returned value from the computeIfAbsent call of 2.. If o doesn't exist in the nested map then execute the lambda k -> o.op() the return value is then stored in the map and returned.

Transform list to mapping using java streams

I have the following pattern repeated throughout my code:
class X<T, V>
{
V doTransform(T t) {
return null; // dummy implementation
}
Map<T, V> transform(List<T> item) {
return item.stream().map(x->new AbstractMap.SimpleEntry<>(x, doTransform(x))).collect(toMap(x->x.getKey(), x->x.getValue()));
}
}
Requiring the use of AbstractMap.SimpleEntry is messy and clunky. Linqs use of anonymous types is more elegant.
Is there a simpler way to achieve this using streams?
Thx in advance.
You can call doTransform in the value mapper:
Map<T, V> transform(List<T> item) {
return item.stream().collect(toMap(x -> x, x -> doTransform(x)));
}
Unfortunately, Java doesn't have an exact equivalent of C#'s anonymous types.
In this specific case, you don't need the intermediate map operation as #Jorn Vernee has suggested. instead, you can perform the key and value extraction in the toMap collector.
However, when it gets to cases where you think you need something as such of C#'s anonymous types you may consider:
anonymous objects (may not always be what you want depending on your use case)
Arrays.asList(...), List.of(...) (may not always be what you want depending on your use case)
an array (may not always be what you want depending on your use case)
Ultimately, If you really need to map to something that can contain two different types of elements then I'd stick with the AbstractMap.SimpleEntry.
That, said your current example can be simplified to:
Map<T, V> transform(List<T> items) {
return items.stream().collect(toMap(Function.identity(),this::doTransform));
}
In this specific example, there is no need to do the intermediate storage at all:
Map<T, V> transform(List<T> item) {
return item.stream().collect(toMap(x -> x, x -> doTransform(x)));
}
But if you need it, Java 9 offers a simpler factory method,
Map<T, V> transform(List<T> item) {
return item.stream()
.map(x -> Map.entry(x, doTransform(x)))
.collect(toMap(x -> x.getKey(), x -> x.getValue()));
}
as long as you don’t have to deal with null.
You can use an anonymous inner class here,
Map<T, V> transform(List<T> item) {
return item.stream()
.map(x -> new Object(){ T t = x; V v = doTransform(x); })
.collect(toMap(x -> x.t, x -> x.v));
}
but it’s less efficient. It’s an inner class which captures a reference to the surrounding this, also it captures x, so you have two fields, t and the synthetic one for capturing x, for the same thing.
The latter could be circumvented by using a method, e.g.
Map<T, V> transform(List<T> item) {
return item.stream()
.map(x -> new Object(){ T getKey() { return x; } V v = doTransform(x); })
.collect(toMap(x -> x.getKey(), x -> x.v));
}
But it doesn’t add to readability.
The only true anonymous types are the types generated for lambda expressions, which could be used to store information via higher order functions:
Map<T, V> transform(List<T> item) {
return item.stream()
.map(x -> capture(x, doTransform(x)))
.collect(HashMap::new, (m,f) -> f.accept(m::put), HashMap::putAll);
}
public static <A,B> Consumer<BiConsumer<A,B>> capture(A a, B b) {
return f -> f.accept(a, b);
}
but you’d soon hit the limitations of Java’s type system (it still isn’t a functional programming language) if you try this with more complex scenarios.

Generics with optional multiple bounds, e.g. List<? extends Integer OR String>

I have a method that should only accept a Map whose key is of type String and value of type Integer or String, but not, say, Boolean.
For example,
map.put("prop1", 1); // allowed
map.put("prop2", "value"); // allowed
map.put("prop3", true); // compile time error
It is not possible to declare a Map as below (to enforce compile time check).
void setProperties(Map<String, ? extends Integer || String> properties)
What is the best alternative other than declaring the value type as an unbounded wildcard and validating for Integer or String at runtime?
void setProperties(Map<String, ?> properties)
This method accepts a set of properties to configure an underlying service entity. The entity supports property values of type String and Integer alone. For example, a property maxLength=2 is valid, defaultTimezone=UTC is also valid, but allowDuplicate=false is invalid.
Another solution would be a custom Map implementation and overrides of the put and putAll methods to validate the data:
public class ValidatedMap extends HashMap<String, Object> {
#Override
public Object put(final String key, final Object value) {
validate(value);
return super.put(key, value);
}
#Override
public void putAll(final Map<? extends String, ?> m) {
m.values().forEach(v -> validate(v));
super.putAll(m);
}
private void validate(final Object value) {
if (value instanceof String || value instanceof Integer) {
// OK
} else {
// TODO: use some custom exception
throw new RuntimeException("Illegal value type");
}
}
}
NB: use the Map implementation that fits your needs as base class
Since Integer and String closest common ancestor in the class hierarchy is Object you cannot achieve what you are trying to do - you can help compiler to narrow the type to Object only.
You can either
wrap your value into a class which can contain either Integer or String, or
extend Map as in the #RC's answer, or
wrap 2 Maps in a class
You can’t declare a type variable to be either of two types. But you can create a helper class to encapsulate values not having a public constructor but factory methods for dedicated types:
public static final class Value {
private final Object value;
private Value(Object o) { value=o; }
}
public static Value value(int i) {
// you could verify the range here
return new Value(i);
}
public static Value value(String s) {
// could reject null or invalid string contents here
return new Value(s);
}
// these helper methods may be superseded by Java 9’s Map.of(...) methods
public static <K,V> Map<K,V> map(K k, V v) { return Collections.singletonMap(k, v); }
public static <K,V> Map<K,V> map(K k1, V v1, K k2, V v2) {
final HashMap<K, V> m = new HashMap<>();
m.put(k1, v1);
m.put(k2, v2);
return m;
}
public static <K,V> Map<K,V> map(K k1, V v1, K k2, V v2, K k3, V v3) {
final Map<K, V> m = map(k1, v1, k2, v2);
m.put(k3, v3);
return m;
}
public void setProperties(Map<String, Value> properties) {
Map<String,Object> actual;
if(properties.isEmpty()) actual = Collections.emptyMap();
else {
actual = new HashMap<>(properties.size());
for(Map.Entry<String, Value> e: properties.entrySet())
actual.put(e.getKey(), e.getValue().value);
}
// proceed with actual map
}
If you are using 3rd party libraries with map builders, you don’t need the map methods, they’re convenient for short maps only. With this pattern, you may call the method like
setProperties(map("mapLength", value(2), "timezone", value("UTC")));
Since there are only the two Value factory methods for int and String, no other type can be passed to the map. Note that this also allows using int as parameter type, so widening of byte, short etc. to int is possible here.
Define two overloads:
void setIntegerProperties(Map<String, Integer> properties)
void setStringProperties(Map<String, String> properties)
They have to be called different things, because you can't have two methods with the same erasure.
I'm fairly certain if any language was going to disallow multiple accepted types for a value, it would be Java. If you really need this kind of capability, I'd suggest looking into other languages. Python can definitely do it.
What's the use case for having both Integers and Strings as the values to your map? If we are really dealing with just Integers and Strings, you're going to have to either:
Define a wrapper object that can hold either a String or an Integer. I would advise against this though, because it will look a lot like the other solution below.
Pick either String or Integer to be the value (String seems like the easier choice), and then just do extra work outside of the map to work with both data types.
Map<String, String> map;
Integer myValue = 5;
if (myValue instanceof Integer) {
String temp = myValue.toString();
map.put(key, temp);
}
// taking things out of the map requires more delicate care.
try { // parseInt() can throw a NumberFormatException
Integer result = Integer.parseInt(map.get(key));
}
catch (NumberFormatException e) {} // do something here
This is a very ugly solution, but it's probably one of the only reasonable solutions that can be provided using Java to maintain some sense of strong typing to your values.

Java - Intersection of multiple collections using stream + lambdas

I have the following function for the unification of multiple collections (includes repeated elements):
public static <T> List<T> unify(Collection<T>... collections) {
return Arrays.stream(collections)
.flatMap(Collection::stream)
.collect(Collectors.toList());
}
It would be nice to have a function with a similar signature for the intersection of collections (using type equality). For example:
public static <T> List<T> intersect(Collection<T>... collections) {
//Here is where the magic happens
}
I found an implementation of the intersect function, but it doesnt use streams:
public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
Set<T> common = new LinkedHashSet<T>();
if (!collections.isEmpty()) {
Iterator<? extends Collection<T>> iterator = collections.iterator();
common.addAll(iterator.next());
while (iterator.hasNext()) {
common.retainAll(iterator.next());
}
}
return common;
}
Is there any way to implement something similar to the unify function making use of streams? Im not so experienced in java8/stream api, because of that some advice would be really helpful.
You can write your own collector in some utility class and use it:
public static <T, S extends Collection<T>> Collector<S, ?, Set<T>> intersecting() {
class Acc {
Set<T> result;
void accept(S s) {
if(result == null) result = new HashSet<>(s);
else result.retainAll(s);
}
Acc combine(Acc other) {
if(result == null) return other;
if(other.result != null) result.retainAll(other.result);
return this;
}
}
return Collector.of(Acc::new, Acc::accept, Acc::combine,
acc -> acc.result == null ? Collections.emptySet() : acc.result,
Collector.Characteristics.UNORDERED);
}
The usage would be pretty simple:
Set<T> result = Arrays.stream(collections).collect(MyCollectors.intersecting());
Note however that collector cannot short-circuit: even if intermediate result will be an empty collection, it will still process the rest of the stream.
Such collector is readily available in my free StreamEx library (see MoreCollectors.intersecting()). It works with normal streams like above, but if you use it with StreamEx (which extends normal stream) it becomes short-circuiting: the processing may actually stop early.
While it’s tempting to think of retainAll as a black-box bulk operation that must be the most efficient way to implement an intersection operation, it just implies iterating over the entire collection and testing for each element whether it is contained in the collection passed as argument. The fact that you are calling it on a Set does not imply any advantage, as it is the other collection, whose contains method will determine the overall performance.
This implies that linearly scanning a set and testing each element for containment within all other collections will be on par with performing retainAll for each collection. Bonus points for iterating over the smallest collection in the first place:
public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
if(collections.isEmpty()) return Collections.emptySet();
Collection<T> smallest
= Collections.min(collections, Comparator.comparingInt(Collection::size));
return smallest.stream().distinct()
.filter(t -> collections.stream().allMatch(c -> c==smallest || c.contains(t)))
.collect(Collectors.toSet());
}
or, alternatively
public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
if(collections.isEmpty()) return Collections.emptySet();
Collection<T> smallest
= Collections.min(collections, Comparator.comparingInt(Collection::size));
HashSet<T> result=new HashSet<>(smallest);
result.removeIf(t -> collections.stream().anyMatch(c -> c!=smallest&& !c.contains(t)));
return result;
}
I think maybe it would make more sense to use Set instead of List (maybe that was a typo in your question):
public static <T> Set<T> intersect(Collection<T>... collections) {
//Here is where the magic happens
return (Set<T>) Arrays.stream(collections).reduce(
(a,b) -> {
Set<T> c = new HashSet<>(a);
c.retainAll(b);
return c;
}).orElseGet(HashSet::new);
}
and here's a Set implementation. retainAll() is a Collection method, so it works on all of them.
public static <T> Set<T> intersect(Collection<T>... collections)
{
return new HashSet<T>(Arrays.stream(collections).reduce(
((a, b) -> {
a.retainAll(b);
return a;
})
).orElse(new HashSet<T>());
}
And with List<> if order is important.
public static <T> List<T> intersect2(Collection<T>... collections)
{
return new ArrayList<T>(Arrays.stream(collections).reduce(
((a, b) -> {
a.retainAll(b);
return a;
})
).orElse(new ArrayList<T>()));
}
Java Collections lets them look almost identical. If required, you could filter the List to be distinct as it may contain duplicates.
public static <T> List<T> intersect2(Collection<T>... collections)
{
return new ArrayList<T>(Arrays.stream(collections).reduce(
((a, b) -> {
a.retainAll(b);
return a;
})
).orElse(new ArrayList<T>())).stream().distinct());
}
You can write it with streams as follows:
return collections.stream()
.findFirst() // find the first collection
.map(HashSet::new) // make a set out of it
.map(first -> collections.stream()
.skip(1) // don't need to process the first one
.collect(() -> first, Set::retainAll, Set::retainAll)
)
.orElseGet(HashSet::new); // if the input collection was empty, return empty set
The 3-argument collect replicates your retainAll logic
The streams implementation gives you the flexibility to tweak the logic more easily. For example, if all your collections are sets, you might want to start with the smallest set instead of the first one (for performance). To do that, you would replace findFirst() with min(comparing(Collection::size)) and get rid of the skip(1). Or you could see if you get better performance with the type of data you work with by running the second stream in parallel and all you would need to do is change stream to parallelStream.

Categories

Resources