filter KeyValueGrouped Dataset in spark

filter KeyValueGrouped Dataset in spark - java

I have a typed dataset of a custom class and use groupbykey method on it. You know that it results a KeyValueGroupedDataset. I want to filter this new dataset but there is no filter method for this type of dataset. So, My question is: How can I filter on this type of dataset? (Java solution is needed. spark version: 2.3.1).
sampleData:
"id":1,"fname":"Gale","lname":"Willmett","email":"gwillmett0#nhs.uk","gender":"Female"
"id":2,"fname":"Chantalle","lname":"Wilcher","email":"cwilcher1#blinklist.com","gender":"Female"
"id":3,"fname":"Polly","lname":"Grandisson","email":"pgrandisson2#linkedin.com","gender":"Female"
"id":3,"fname":"Moshe","lname":"Pink","email":"mpink3#twitter.com","gender":"Male"
"id":2,"fname":"Yorke","lname":"Ginnelly","email":"yginnelly4#apple.com","gender":"Male"
And What I did:
Dataset<Person> peopleDS = spark.read().format("parquet").load("\path").as(Encoders.bean(Person.class));
KeyValueGroupedDataset<String, Person> KVDS = peopleDS.groupByKey( (MapFunction<Person, String> ) f -> f.getGender() , Encoders.STRING());
//How Can I filter on KVDS's id field?
Update1 (use of flatMapGroups):
Dataset<Person> persons = KVDS.flatMapGroups((FlatMapGroupsFunction <String,Person,Person>) (f,k) -> (Iterator<Person>) k , Encoders.bean(Person.class));
Update2 (use of MapGroups)
Dataset<Person> peopleMap = KVDS.mapGroups((MapGroupsFunction <String,Person,Person>) (f,g) -> {
while (g.hasNext()) {
//What can I do here?
}
},Encoders.bean(Person.Class);
Update3 : I want to filter those groups that distinct of their ids is greater than 1. for example in below picture: I want just Female groups because distinct of their ids is greater that 1 (first field is id. Others are fname,lname,email and gender).
Update4: I did What I want with "RDD", but I want to do exactly this part of code with "Dataset":
List<Tuple2<String, Iterable<Person>>> f = PersonRDD
.mapToPair(s -> new Tuple2<>(s.getGender(), s)).groupByKey()
.filter(t -> ((Collection<Person>) t._2()).stream().mapToInt(e -> e.getId).distinct().count() > 1)
.collect();

Why don't you filter on id before grouping ? GroupByKey is an expensive action, it should be faster to filter first.
If you really want to group first, you may have to then use .flatMapGroups with identity function.
Not sure about java code but scala version would be something as follow:
peopleDS
.groupByKey(_.gender)
.mapGroups { case (gender, persons) => persons.filter(your condition) }
But again, you should filter first :). Specially since your ID field is already available before grouping.

Grouping is used for aggregation functions, you can find functions like "agg" in "KeyValueGroupedDataset" class. If you apply aggregation function for ex. "count", you will get "Dataset", and "filter" function will be available.
"groupBy" without aggregation function looks strange, other function, for ex. "distinct" can be used.
Filtering example with "FlatMapGroupsFunction":
.flatMapGroups(
(FlatMapGroupsFunction<String, Person, Person>) (f, k) -> {
List<Person> result = new ArrayList<>();
while (k.hasNext()) {
Person value = k.next();
// filter condition here
if (value != null) {
result.add(value);
}
}
return result.iterator();
},
Encoders.bean(Person.class))

Related

Merging two stream operation into one in Java for performance improvement

I have this object
Class A {
int count;
String name;
}
I have a list of my above custom object as below :
List<A> aList = new ArrayList<>();
A a = new A(1,"abc");
A b = new A(0,"def");
A c = new A(0,"xyz");
aList.add(a);
aList.add(b);
aList.add(c);
I will get this list as input in my service. Now based upon some scenario, first I need to set "count" to ZERO for all elements in the list and based on a check with "name" I need to set the count as ONE for a particular name.
This is how I am doing now :
String tempName = "Some Name like abc/def/xyz";
alist.stream().forEach(x -> x.setCount(0));
aList.stream().filter(x -> x.getName().equalsIgnoreCase(tempName))
.findFirst()
.ifPresent(y -> y.setCount(1));
This is doing my job, but I want to know if I can simplify the above logic and use one single stream instead of two and improve the performance by avoiding looping through the list twice.

Just check if the name matches in the first loop:
alist.forEach(x -> x.setCount(x.getName().equalsIgnoreCase(tempName) ? 1 : 0));

How to identify duplicate records in a list?

I have the following problem:
I want to remove duplicate data from a list of a Vo depending if the registered field is the same, I show you the solution that I am trying. Then this is the data from the list that I am making
List<MyVo> dataList = new ArrayList<MyVo>();
MyVo data1 = new MyVo();
data1.setValidated(1);
data1.setName("Fernando");
data1.setRegistered("008982");
MyVo data2 = new MyVo();
data2.setValidated(0);
data2.setName("Orlando");
data2.setRegistered("008986");
MyVo data3 = new MyVo();
data3.setValidated(1);
data3.setName("Magda");
data3.setRegistered("008982");
MyVo data4 = new MyVo();
data4.setValidated(1);
data4.setName("Jess");
data4.setRegistered("006782");
dataList.add(data1);
dataList.add(data2);
dataList.add(data3);
dataList.add(data4);
The first thing I have to do and separate it into two different lists depending on whether the data is validated or not, for that the value of the registered validated.
List<MyVo> registeredBusinesses = new ArrayList<MyVo>();
List<MyVo> unregisteredBusinesses = new ArrayList<MyVo>();
for (MyVo map : dataList) {
if (map.getValidated == 0) {
unregisteredBusinesses.add(map);
}else {
registeredBusinesses.add(map);
}
}
now the list of registered businesses I want to remove the data that is repeated with the same value from its registered field and make a new list. this is what it took but it doesn't work right
List<MyVo> duplicateList = registeredBusinesses.stream().filter(distictByRegistered(MyVo::getRegistered)).collect(Collectors.toList());
public static <T> Predicate<T> distictByRegistered(Function<? super T, ?> keyExtractor) {
Set<Object> seen = ConcurrentHashMap.newKeySet();
return t -> seen.add(keyExtractor.apply(t));
}
however using this method I get the following output:
{["validated":1,"name":"Fernando","registered":"008982"],
["validated":1,"name":"Jess","registered":"006782"]}
the output I want to obtain is the following:
the unregisteredBusinesses list:
{["validated":0,"name":"Orlando","registered":"008986"]}
the registeredBusinesses list:
{["validated":1,"name":"Jess","registered":"006782"]}
the registeredDuplicateBusinesses list:
{["validated":1,"name":"Fernando","registered":"008982"],
["validated":1,"name":"Magda","registered":"008982"]}
I don't know how to do it, could you help me? I would like to use lambdas to reduce the code, for example of the first for when I separate into two lists

You are looking for both registered and unregistered businesses. This is where instead of making use of 0 and 1, you could choose to implement the attribute as a boolean isRegistered such as 0 is false and 1 is true going forward. Your existing code with if-else could be re-written as :
Map<Boolean, List<MyVo>> partitionBasedOnRegistered = dataList.stream()
.collect(Collectors.partitioningBy(MyVo::isRegistered));
List<MyVo> unregisteredBusinesses = partitionBasedOnRegistered.get(Boolean.FALSE); // here
List<MyVo> registeredBusinesses = partitionBasedOnRegistered.get(Boolean.TRUE);

Your approach looks almost correct, grouping by Function.identity() will properly flag duplicates (based on equals() implementation!), you could also group by an unique property/id in your object if you have one, what you're missing is to manipulate the resulting map to get a list with all duplicates. I've added comments describing what's happening here.
List<MyVo> duplicateList = registeredBusinesses.stream()
.collect(Collectors.groupingBy(Function.identity()))
.entrySet()
.stream()
.filter(e -> e.getValue().size() > 1) //this is a stream of Map.Entry<MyVo, List<MyVo>>, then we want to check value.size() > 1
.map(Map.Entry::getValue) //We convert this into a Stream<List<MyVo>>
.flatMap(Collection::stream) //Now we want to have all duplicates in the same stream, so we flatMap it using Collections::stream
.collect(Collectors.toList()); //On this stage we have a Stream<MyVo> with all duplicates, so we can collect it to a list.
Additionally, you could also use stream API to split dataList into registered and unRegistered.
First we create a method isUnregistered in MyVo
public boolean isUnregistered() {
return getrRegistered() == 0;
}
Then
Map<Boolean, List<MyVo>> registeredMap = dataList.stream().collect(Collectors.groupingBy(MyVo::isUnregistered));
Where map.get(true) will be unregisteredBusinesses and map.get(false) registeredBusinesses

Familiarizing yourself with the concept of the Collectors.partitioningBy shall help you problem-solve this further. There are two places amongst your current requirement where it could be implied.
You are looking for both registered and unregistered businesses. This is where instead of making use of 0 and 1, you could choose to implement the attribute as a boolean isRegistered such as 0 is false and 1 is true going forward. Your existing code with if-else could be re-written as :
Map<Boolean, List<MyVo>> partitionBasedOnRegistered = dataList.stream()
.collect(Collectors.partitioningBy(MyVo::isRegistered));
List<MyVo> unregisteredBusinesses = partitionBasedOnRegistered.get(Boolean.FALSE); // here
List<MyVo> registeredBusinesses = partitionBasedOnRegistered.get(Boolean.TRUE);
After you try to groupBy the registered businesses based on the registration number(despite of identity), you require both the duplicate elements and the ones which are unique as well. Effectively all entries, but again partitioned into two buckets, i.e. one with value size == 1 and others with size > 1. Since grouping would ensure, minimum one element corresponding to each key, you can collect the required output with an additional mapping.
Map<String, List<MyVo>> groupByRegistrationNumber = // group registered businesses by number
Map<Boolean, List<List<MyVo>>> partitionBasedOnDuplicates = groupByRegistrationNumber
.entrySet().stream()
.collect(Collectors.partitioningBy(e -> e.getValue().size() > 1,
Collectors.mapping(Map.Entry::getValue, Collectors.toList())));
If you access the FALSE values of the above map, that would provide you the groupedRegisteredUniqueBusiness and on the other hand values against TRUE key would provide you groupedRegisteredDuplicateBusiness.
Do take a note, that if you were to flatten this List<List<MyVo> in order to get List<MyVo> as output, you could also make use of the flatMapping collector which has a JDK inbuilt implementation with Java-9 and above.

MongoTemplate database query, instead of in-service code, to get required data

I need to create a mongotemplate database query to get a specific number of elements into a list.
At the moment I just get all the elements with findAll(), and then I modify the obtained data using code that I have writen within the service class.
Initially, I have a Laptop class with fields price::BigDecimal and name::String and I use findAll() to get a list of them.
Then I put those in a HashMap, where key is the name field, sorted from most expensive to cheapest.
Map<String, List<Laptop>> laptopsMap = laptopsFrom.stream()
.collect(Collectors.groupingBy(Laptop::getName,
Collectors.collectingAndThen(Collectors.toList(),
l -> l.stream()
.sorted(Comparator.comparing(Laptop::getPrice).reversed())
.collect(Collectors.toList())
))
);
So the results are like below:
[{"MSI", [2200, 1100, 900]},
{"HP", [3200, 900, 800]},
{"Dell", [2500, 2000, 700]}]
Then, I use the code in the bottom of the question, to create a Laptop list with the following contents:
[{"HP", 3200}, {"Dell", 2500}, {"MSI", 2200},
{"Dell", 2000}, {"MSI", 1100}, {"HP", 900},
{"MSI", 900}, {"HP", 800}, {"Dell", 700}]
So basically, I iterate the map and from each key, I extract the next in line element of the list.
do {
for (Map.Entry<String, List<Laptop>> entry :
laptopsMap.entrySet()) {
String key = entry.getKey();
List<Laptop> value = entry.getValue();
finalResultsList.add(value.get(0));
value.remove(0);
if (value.size() == 0) {
laptopsMap.entrySet()
.removeIf(pr -> pr.getKey().equals(key));
} else {
laptopsMap.replace(key, value);
}
}
} while(!laptopsMap.isEmpty());
I instead of all this in-class code need to use a mongoTemplate database argument, but I cant seem to figure out how to create such a complex query. I have read material about Aggregation but I have not found anything helpful enough. At the moment, I have started putting a query together as shown below:
Query query = new Query();
query.limit(numOfLaptops);
query.addCriteria(Criteria.where(Laptop.PRICE).gte(minPrice));

Stream Filter of 1 list based on another list

I am posting my query after having searched in this forum & google, but was unable to resolve the same.
eg: Link1 Link2 Link3
I am trying to filter List 2 (multi column) based on the values in List 1.
List1:
- [Datsun]
- [Volvo]
- [BMW]
- [Mercedes]
List2:
- [1-Jun-1995, Audi, 25.3, 500.4, 300]
- [7-Apr-1996, BMW, 35.3, 250.2, 500]
- [3-May-1996, Porsche, 45.3, 750.8, 200]
- [2-Nov-1998, Volvo, 75.3, 150.2, 100]
- [7-Dec-1999, BMW, 95.3, 850.2, 900]
expected o/p:
- [7-Apr-1996, BMW, 35.3, 250.2, 500]
- [2-Nov-1998, Volvo, 75.3, 150.2, 100]
- [7-Dec-1999, BMW, 95.3, 850.2, 900]
Code
// List 1 in above eg
List<dataCarName> listCarName = new ArrayList<>();
// List 2 in above eg
List<dataCar> listCar = new ArrayList<>();
// Values to the 2 lists are populated from excel
List<dataCar> listOutput = listCar.stream().filter(e -> e.getName().contains("BMW")).collect(Collectors.toList());
In the above code if I provide a specific value I can filter, but not sure how to check if Car Name in List 2 exits in List 1.
Hope the issue I face is clear, await guidance (Am still relatively new to Java, hence forgive if the above query is very basic).
Edit
I believe the link-3 provided above should resolve, but in my case it is not working. Maybe because the values in list-1 are populated as
org.gradle04.Main.Cars.dataCarName#4148db48 .. etc.
I am able to get the value in human readable format only when I do a forEach on List 1 by calling the getName method.

It's not clear why you have a List<DataCarName> in first place instead of a List/Set<String>.
The predicate you have to provide must check if for the corresponding data car instance, there's its name in the list.
e -> e.getName().contains("BMW") will only check if the name of the data car contains BMW which is not what you want. Your first attempt then may be
e -> listCarName.contains(e.getName())
but since listCarName is a List<DataCarName> and e.getName() a string (I presume), you'll get an empty list as a result.
The first option you have is to change the predicate so that you get a stream from the list of data car names, map them to their string representation and check that any of these names corresponds to the current data car instance's name you are currently filtering:
List<DataCar> listOutput =
listCar.stream()
.filter(e -> listCarName.stream().map(DataCarName::getName).anyMatch(name -> name.equals(e.getName())))
.collect(Collectors.toList());
Now this is very expensive because you create a stream for each instance in the data car stream pipeline. A better way would be to build a Set<String> with the cars' name upfront and then simply use contains as a predicate on this set:
Set<String> carNames =
listCarName.stream()
.map(DataCarName::getName)
.collect(Collectors.toSet());
List<DataCar> listOutput =
listCar.stream()
.filter(e -> carNames.contains(e.getName()))
.collect(Collectors.toList());

in your DataCar type, does getName() return a String or the DataCarName enum type? If it is the enum, you might follow Alexis C's approach but instead of building a HashSet using Collectors.toSet(), build an EnumSet, which gives O(1) performance. Modifying Alexis' suggestion, the result would look like:
Set<DataCarName> carNames =
listCarName.stream()
.collect(Collectors.toCollection(
()-> EnumSet.noneOf(DataCarName.class)));
List<DataCar> listOutput =
listCar.stream()
.filter(car -> carNames.contains(car.getName()))
.collect(Collectors.toList());

Try this:
SortedMap<String, Account> accountMap, List<AccountReseponse> accountOwnersList
List<Map.Entry<String, Account>> entryList = accountMap.entrySet().stream().filter(account -> accountOwnersList.stream()
.anyMatch(accountOwner -> accountOwner.getAccount()
.getIdentifier().equals(account.getValue().getIdentifier())))
.collect(Collectors.toList());
Can also use .noneMatch().

#Alexis'a answer is nice, but I have another way around to get use of performance from Map and improve the part you do listCarName.stream().map(DataCarName::getName).anyMatch(name -> name.equals(e.getName())) for each item, first I make a map from listCar and making the key with the field that I want to compare, in this instance is car's name and filter out null values when I map the list1 to be CarData.
So it should be something like:
final Map<String, CarData> allCarsMap = listCar // Your List2
.stream().collect(Collectors.toMap(CarData::getName, o -> o));
final List<CarData> listOutput = // Your expected result
listCarName // Your List1
.stream()
.map(allCarsMap::get) // will map each name with a value in the map
.filter(Objects::nonNull) // filter any null value for any car name that does not exist in the map
.collect(Collectors.toList());
I hope this helps, maybe a little better performance in some scenarios?

Multiple conditions to filter a result set using Java 8

I am looking for some help in converting some code I have to use the really nifty Java 8 Stream library. Essentially I have a bunch of student objects and I would like to get back a list of filtered objects as seen below:
List<Integer> classRoomList;
Set<ScienceStudent> filteredStudents = new HashSet<>();
//Return only 5 students in the end
int limit = 5;
for (MathStudent s : mathStudents)
{
// Get the scienceStudent with the same id as the math student
ScienceStudent ss = scienceStudents.get(s.getId());
if (classRoomList.contains(ss.getClassroomId()))
{
if (!exclusionStudents.contains(ss))
{
if (limit > 0)
{
filteredStudents.add(ss);
limit--;
}
}
}
}
Of course the above is a super contrived example I made up for the sake of learning more Java 8. Assume all students are extended from a Student object with studentId and classRoomId. An additional requirement I would require is the have the result be an Immutable set.

A quite literal translation (and the required classes to play around)
interface ScienceStudent {
String getClassroomId();
}
interface MathStudent {
String getId();
}
Set<ScienceStudent> filter(
Collection<MathStudent> mathStudents,
Map<String, ScienceStudent> scienceStudents,
Set<ScienceStudent> exclusionStudents,
List<String> classRoomList) {
return mathStudents.stream()
.map(s -> scienceStudents.get(s.getId()))
.filter(ss -> classRoomList.contains(ss.getClassroomId()))
.filter(ss -> !exclusionStudents.contains(ss))
.limit(5)
.collect(Collectors.toSet());
}
Multiple conditions to filter really just translate into multiple .filter calls or a combined big filter like ss -> classRoomList.contains(ss.getClassroomId()) && !exclusion...
Regarding immutable set: You best wrap that around the result manually because collect expects a mutable collection that can be filled from the stream and returned once finished. I don't see an easy way to do that directly with streams.
The null paranoid version
return mathStudents.stream().filter(Objects::nonNull) // math students could be null
.map(MathStudent::getId).filter(Objects::nonNull) // their id could be null
.map(scienceStudents::get).filter(Objects::nonNull) // and the mapped science student
.filter(ss -> classRoomList.contains(ss.getClassroomId()))
.filter(ss -> !exclusionStudents.contains(ss))
.limit(5)
.collect(Collectors.toSet());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

filter KeyValueGrouped Dataset in spark - java

Related

Merging two stream operation into one in Java for performance improvement

How to identify duplicate records in a list?

MongoTemplate database query, instead of in-service code, to get required data

Stream Filter of 1 list based on another list

Multiple conditions to filter a result set using Java 8

Categories

Resources