Efficient ways to traverse and group similar objects from a huge collection

Efficient ways to traverse and group similar objects from a huge collection - java

I am currently working towards on an implementation that basically involves attending to an arraylist of objects, say a 1000, find commonalities in their properties and group them.
For example
ArrayList itemList<CustomJaxbObj> = {Obj1,obj2,....objn} //n can reach to 1000
Object attributes - year of registration, location, amount
Grouping criteria - for objects with same year of reg and location...add the amount
If there are 10 Objects, out of which 8 objects have same loc and year of registration, add amount for all 8 and other 2 whose year of reg and loc match. So at the end of operation I am left with 2 objects. 1 which is a total sum of 8 matched objects and 1 which is a total of 2 matched criteria of objects.
Currently I am using dual traditional loops. Advanced loops are better but they dont offer much control over indices, which I need to perform grouping. It allows me to keep track of which individual entries combined to form a new entry of grouped entries.
for (i = 0; i < objlist.size(); i++) {
for(j = i+1; j< objList.size();j++){
//PErform the check with if/else condition and traverse the whole list
}
}
Although this does the job, looked very inefficient and process heavy. Is there a better way to do this. I have seen other answers which asked me to use Java8 streams, but the operations are complex, hence grouping needs to be done. I have given an example of doing something when there is a match but there is more to it than just adding.
Is there a better approach to this? A better data structure to hold data of this kind which makes searching and grouping easier?
Adding more perspective, apologies for not furnishing this info before.
The arraylist is a collection of jaxb objects from an incoming payload xml.
XML heirarchy
<Item>
<Item1>
<Item-Loc/>
<ItemID>
<Item-YearofReg/>
<Item-Details>
<ItemID/>
<Item-RefurbishMentDate>
<ItemRefurbLoc/>
</Item-Details>
</Item1>
<Item2></Item2>
<Item3></Item3>
....
</Item>
So the Jaxb Object of Item has a list of 900-1000 Items. Each item might have a sub section of ItemDetails which has a refurbishment date.The problem I face is, dual loops work fine when there is no Item Details section, and every item can be traversed and checked. Requirement says if the item has been refurbished, then we overlook its year of reg and instead consider year of refurbishment to match the criteria.
Another point is, Item Details need not belong to same Item in the section, that is Item1's item details can come up in Item2 Item Details section, item id is the field using which we map the correct item to its item details.
This would mean I cannot start making changes unless I have read through the complete list. Something a normal for loop would do it, but it would increase the cyclomatic complexity, which has already increased because of dual loops.
Hence the question, which would need a data structure to first store and analyse the list of objects before performing the grouping.
Apologies for not mentioning this before. My first question in stackoverflow, hence the inexperience.

Not 100% sure what your end goal is but here is something to get you started. to group by the two properties, you can do something like:
Map<String, Map<Integer, List<MyObjectType>>> map = itemList.stream()
.collect(Collectors.groupingBy(MyObjectType::getLoc,
Collectors.groupingBy(MyObjectType::getYear)));
The solution above assumes getLoc is a type String and getYear is a type Integer, you can then perform further stream operations to get the sum you want.

You can use hash to add the amounts of elements having same year of registration and location

You can use Collectors.groupingBy(classifier, downstream) with Collectors.summingInt as the downstream collector. You didn't post the class of the objects so I took the leave to define my own. But the idea is similar. I also used AbstractMap.SimpleEntry as the key to the final map.
import java.util.AbstractMap;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class GroupByYearAndLoc {
static class Node {
private Integer year;
private String loc;
private int value;
Node(final Integer year, final String loc, final int value) {
this.year = year;
this.loc = loc;
this.value = value;
}
}
public static void main(String[] args) {
List<Node> nodes = new ArrayList<>();
nodes.add(new Node(2017, "A", 10));
nodes.add(new Node(2017, "A", 12));
nodes.add(new Node(2017, "B", 13));
nodes.add(new Node(2016, "A", 10));
Map<AbstractMap.SimpleEntry<Integer, String>, Integer> sums = nodes.stream()
// group by year and location, then sum the value.
.collect(Collectors.groupingBy(n-> new AbstractMap.SimpleEntry<>(n.year, n.loc), Collectors.summingInt(x->x.value)));
sums.forEach((k, v)->{
System.out.printf("(%d, %s) = %d\n", k.getKey(), k.getValue(), v);
});
}
}
And the output:
(2017, A) = 22
(2016, A) = 10
(2017, B) = 13

I would make "Year+Location" concatenated be the key in a hashmap, and then let that map hold whatever is associated with each unique key. Then you can just have one "for loop" (not nested looping). That's the simplest approach.

Related

Java 8 : functional way to write sort, filter and count at same time

I am pretty new to Java, and I am trying to write the below logic in functional way.
I have a list of Objects, which have many fields. List<someObject>
The fields of interest for now are long timestamp and String bookType
The problem statement is - I want to find the count of number of Objects in given list which have the same bookType as the one with lowest timestamp.
For example, if we sort the given list of objects based on timestamp in ascending order, and the first object in the sorted list has bookType field as SOMETYPE ; then I want to find out how many Objects are there in the list with the bookType SOMETYPE
I have written this logic using the plain old non functional way, by maintaing 2 temp variables and then iterating over the list once to find the lowest timestamp and the corresponding bookType amd a count of each bookType
But this is not acceptable to be run in a lambda, as it requires variables to be final
I could only write the part where I could sort the given list based on timestamp
n.stream().sorted(Comparator.comparingLong(someObject::timestamp)).collect(Collectors.toList());
I am stuck how to proceed with finding the count of the lowest timestamp bookType

But this is not acceptable to be run in a lambda, as it requires variables to be final
First of all - this is not a problem since you can make your variable Effectively final by creating eg. single entry array and pass its single (first) object to the lambda
Second thing is that there's basically no sense to put everything in one lambda - think about this, how logically finding min value is connected with counting objects grouped by some attribute? It is not - putting this (somehow) to one stream will just obfuscate your code
What you should do - you should prepare method to find min value and returning you it's bookType then stream collection and group it by bookType and return size of the collection with given key value
It could look like on this scratch:
public class Item {
private final long timestamp;
private final String bookType;
// Constructors, getters etc
}
// ...
public int getSizeOfBookTypeByMinTimestamp() {
return items.stream()
.collect(Collectors.groupingBy(Item::getBookType))
.get(getMin(items))
.size();
}
private String getMin(List<Item> items) {
return items
.stream()
.min(Comparator.comparingLong(Item::getTimestamp))
.orElse( /* HANDLE NO OBJECT */ ) // you can use also orElseThrow etc
.getBookType();
}

The best way is to first find the item with the lowest timestamp and then filter the list for items with a matching timestamp. So in two steps:
Book first = n.stream().min(Comparator.comparingLong(someObject::timestamp).orElseThrow(NoSuchElementException::new);
List<Book> result = n.stream().filter(b -> b.timestamp.equals(first.timestamp)).collect(Collectors.toList());

Sorting by, which is better - hashmap, treemap, custom implementation

I have an ArrayList of Subjects and they have parameter Date, I want to group them into sorted array (maybe not array, but still sorted structure) of Day objects, so every Day will have parameter Date and object will contain only subjects with this date. So the thing I wanna do is somehow group them by date and then get them. I saw implementations of grouping by using HashMap, but then I have grouped structure but not sorted, so after that I should convert this to ArrayList for example. Or maybe I should use TreeMap, which will do the same but give me back sorted structure, or maybe best way is simply write my own sorter which will get ArrayList<Subject> and return ArrayList<Day>. Also I can use LinkedHashMap which will work too
So now I have no idea what is better and what should I choose? Important thing is that most likely I will not put new values or delete values from structure, I will only get them.
UPD: If I use map then Date will be key and Day object will be value.
By saying "get them" I meant iterate through them.
All this I'm doing in order to fill my UI elements with this info so most likely I will not search something in my structure later

Here's what I think you are asking for, but hopefully my answer can help even if it's not exactly it:
fast lookup using a Day as the key
the result of that lookup should be sorted (i.e. multiple times of the same day are ordered)
the possibility to see all subjects sorted by their Day
Here's one option. Use a Map that associates a Day to a sorted list of Subjects, so Map<Day, List<Subject>>. Since you don't need to add to it, you can build your mapping at the start and then sort it before you do any lookups. Here's an outline:
Map<Day, List<Subject>> buildMap(List<Subject> subjects) {
Map<Day, List<Subject>> map = new HashMap<Day, List<Subject>>();
// create a list of subjects for each day
for (Subject subject : subjects) {
if (!map.containsKey(subject.getDate().getDay())) {
map.put(subject.getDate().getDay(), new ArrayList<Subject>());
}
map.get(subject.getDate().getDay()).add(subject);
}
// go through and sort everything now that you have grouped them
for (Day day : map.keySet()) {
Collections.sort(map.get(day));
}
return map;
}
If you also need to be able to 'get' every entry sorted throughout the map, you could maintain a sorted list of days. Like so:
List<Day> buildSortedDaysList(Map<Day, List<Subject>> map) {
List<Day> sortedDays = new ArrayList<Day>(map.keySet());
// again, many ways to sort, but I assume Day implements Comparable
Collections.sort(sortedDays);
return sortedDays;
}
You could then wrap it in a class, of which I recommend you create a better name:
class SortedMapThing {
Map<Day, List<Subject>> map;
List<Day> orderedDays;
SortedMapThing(List<Subject> subjects) {
map = buildMap(subjects);
orderedDays = buildSortedDaysList(map);
}
List<Subject> getSubject(Day day) {
return map.get(day);
}
List<Subject> getAllSubjects() {
List<Subject> subjects = new ArrayList<Subject>();
for (Day day : orderedDays) {
subjects.addAll(map.get(day));
}
return subjects;
}
}
This implementation puts the work up front and gives you efficient lookup speed. If I misunderstood your question slightly, you should be able to adjust it accordingly. If I misunderstood your question entirely...I will be sad. Cheers!

Iterating and comparing big data set

Basically I receive a 2 big data lists from 2 different database, the list looks like this:
List 1:
=============
A000001
A000002
A000003
.
.
A999999
List 2:
=============
121111
000111
000003
000001
.
.
I need to compare two list and find out each data which is in List 1 is available in List 2 (after appending some standard key to it), so that and if it is available put it in 3rd list for further manipulation. As an example A000001 is available in List 1 as well as in List 2 (after appending some standard key to it) so I need to put it in 3rd list.
Basically I have this code, it does like this for each row in List 1, I'm iterating through all data in List 2 and doing comparison. (Both are array list)
List<String> list1 = //Data of list 1 from db
List<String> list2 = //Data of list 2 from db
for(String list1Item:list1) {
for(String list2Item:list2) {
String list2ItemAfterAppend = "A" + list2Item;
if(list1Item.equalsIgnoreCase(list2ItemAfterAppend)) {
//Add it to 3rd list
}
}
}
Yes, this logic works fine, but I feel this is not efficient way to iterate list. After putting timers, it's taking 13444 milliseconds on average for 2000x5000 list of data. My question is, is there any other logic you people can think of or suggest me to improve the performance of this code?
I hope I'm clear, if not please let me know if I can improve question.

You can order both list, then using only one loop iterate on both value, switching which index increments depending on which value is the biggest. Something like:
boolean isWorking = true;
Collections.sort(list1);
Collections.sort(list2);
int index1 = 0;
int index2 = 0;
while(isWorking){
String val1 = list1.get(index1);
String val2 = "A" + list2.get(index2);
int compare = val1.compareTo(val2)
if(compare == 0){
list3.add(val1);
index1++;
index2++;
}else if (compare > 0){
val2++;
}else{ // if(compare < 0)
val1++;
}
isWorking = !(index1 == list1.size() || index2 == list2.size() );
}
Be carefull about what kind of List you're using. The get(int i) on LinkedList is expensive, whereas it is not on an ArrayList. Also, you might want to save list1.size() and list2.size(), I dont't think it calcluates it everytime, but chek it. I'm not sure if it's really usefull/efficient, but you can initialise list3 with the size of the smallest of both list (taking into acount the loadFactor, look up for it), so list3 doesnt have to resize everytime.
The code above is not tested (maybe switch val1++ and val2++), but you get the idea. I believe it's faster than yours (because it's O(n+m) rather than O(n*m) but I'll let you see (both sort() and compareTo() will add some time compared to your method, but normally it shouldn't be too much). If you can, use your RDBMS to sort both list when you get them (so you don't have to do it in the Java code)

I think the problem is how big the list is and how much memory you have.
For me for under 1 million records, I will use a HashSet to make it faster.
Code may like:
Set<String> set1 = //Data of list 1 from db, when you get the data you make it a Set instead of a List. HashSet is enough for you to use.
List<String> list2 = //Data of list 2 from db
Then you just need to:
for(String list2Item:list2) {
if(set1.contains("A" + list2Item) {
}
}
Hope this can help you.

You can use intersection method from apache commons. Example:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.List;
import org.apache.commons.collections4.CollectionUtils;
public class NewClass {
public static void main(String[] args) {
List<String> list1 = Arrays.asList("A000001","A000002","A000003");
List<String> list2 = Arrays.asList("121111","000111","000001");
List<String> list3 = new ArrayList<>();
list2.stream().forEach((s) -> {list3.add("A"+s);});
Collection<String> common = CollectionUtils.intersection(list1, list3);
}
}

You could try to use the Stream API for this, the code to create the new list with Streams is very concise and straightforward and probably very similar in performance:
List<String> list3 = list2.stream()
.map(s->"A"+s)
.filter(list1::contains)
.collect(Collectors.toList());
If the list are big, you could try to process the list in parallel and use multiple threads to process the list. This may or may not improve the performance. Doing some measures its important to check if processing the list in parallel is actually improving the performance.
To process the stream in parallel, you only need to call the method parallel on the stream:
List<String> list3 = list2.stream()
.parallel()
.map(s->"A"+s)
.filter(list1::contains)
.collect(Collectors.toList());

Your code is doing a lot of String manipulation, 'equalsIgnoreCase' convert the Characters to upper/lower case. This is being executed in your inner loop and the size of your list is 5000x2000, so the String manipulation is being done millions of times.
Ideally, get your Strings in either upper or lower case from the database and avoid the conversion inside the inner loop. If this is not possible, probably converting the case of the String at the beginning improves the performance.
Then, you could create a new list with the elements of one of the lists and keep all the elements present in the other list, the code with the uppercase conversion could be:
list1.replaceAll(String::toUpperCase);
List<String> list3 = new ArrayList<>(list2);
list3.replaceAll(s->"A"+s.toUpperCase());
list3.retainAll(list1);

Storing data for a linked list in a pair in Java

I am trying to create a linked list that will take a large amount of data, either integers or strings, and get the frequency that they occur. I know how to create a basic linked list that would achieve this but since the amount of data is so large, I want to find a quicker way to sort through the data, instead of going through the entire linked list every time I call a certain method. In order to do this I need to make a Pair of <Object, Integer> where the Object is the data and the integer is the frequency it occurs.
So far I have tried creating arrays and lists that would help me sort out the data but cannot figure out how to get it into a Pair that represents the data and frequency. If you have any ideas that can help me at least get started that would be much appreciated.

First of all you must define your own data type, let's say
public FrequencyCount<T> implements Comparable<FrequencyCount<T>>
{
public final T data;
public int frequency;
public int compareTo(FrequencyCount<T> other) {
// implement this method to choose your correct natural ordering
}
}
With a similar object everything becomes trivial:
List<FrequencyCount<Some>> data = new ArrayList<FrequencyCount<Some>>();
Collections.sort(data);
Set<FrequencyCount<Some>> sortedData = new TreeSet<FrequencyCount<Some>>(data);

You could place all values into a List, create a Set from it and then iterate over the Set to find the frequency in the List using Collections.frequency: http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#frequency(java.util.Collection,%20java.lang.Object)
List<Integer> allValues = ...;
Set<Integer> uniqueValues = new HashSet<Integer>(allValues);
for(Integer val : uniqueValues) {
int frequency = Collections.frequency(allValues, val);
// use val and frequency as key and value as you wish
}

Implementing search based on 2 fields in a java class

I am trying to present a simplified version of my requirement here for ease of understanding.
I have this class
public class MyClass {
private byte[] data1;
private byte[] data2;
private long hash1; // Hash value for data1
private long hash2; // Hash value for data2
// getter and setters }
Now I need to search between 2 List instances of this class, find how many hash1's match between the 2 instances and for all matches how many corresponding hash2's match. The 2 list will have about 10 million objects of MyClass.
Now I am planning to iterate over first list and search in the second one. Is there a way I can optimize the search by sorting or ordering in any particular way? Should I sort both list or only 1?

Best solution would be to iterate there is no faster solution than this. You can create Hashmap and take advantage that map does not add same key but then it has its own creation overload

sort only second, iterate over first and do binary search in second, sort O(nlogn) and binary search for n item O(nlogn)
or use hashset for second, iterate over first and search in second, O(n)

If you have to check all the elements, I think you should iterate over the first list and have a Hashmap for the second one as said AmitD.
You just have to correctly override equals and hashcode in your MyClass class. Finally, I will recomend you to use basic types as much as possible. For example, for the first list, instead of a list will be better to use a simple array.
Also, at the beginning you could select which of the two lists is the shorter one (if there's a difference in the size) and iterate over that one.

I think you should create a hashmap for one of the lists (say list1) -
Map<Long, MyClass> map = new HashMap<Long, MyClass>(list1.size());//specify the capacity
//populate map like - put(myClass.getHash1(), myClass) : for each element in the list
Now just iterate through the second list (there is no point in sorting both) -
int hash1MatchCount = 0;
int hash2MatchCount = 0;
for(MyClass myClass : list2) {
MyClass mc = map.get(myClass.getHash1());
if(mc != null) {
hash1MatchCount++;
if(myClass.getHash2() == mc.getHash2) {
hash2MatchCount++;
}
}
}
Note: Assuming that there is no problem regarding hash1 being duplicates.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficient ways to traverse and group similar objects from a huge collection - java

You can use hash to add the amounts of elements having same year of registration and location

I would make "Year+Location" concatenated be the key in a hashmap, and then let that map hold whatever is associated with each unique key. Then you can just have one "for loop" (not nested looping). That's the simplest approach.

Related

Java 8 : functional way to write sort, filter and count at same time

Sorting by, which is better - hashmap, treemap, custom implementation

Iterating and comparing big data set

Storing data for a linked list in a pair in Java

Implementing search based on 2 fields in a java class

Categories

Resources