How to effectively process million records using JavaSpark

How to effectively process million records using JavaSpark - java

I am new to JavaSpark
I came into a requirement to compare and process millions of records, i used plain java multi threading but want to do in spark way, to increase performance
Problem Statement:
There are millions of records in our database,i need to compare it with another List and process it.
Example:
Step 1 : We have a List1 with million strings fetched from filesystem (this doesnt have an issue).
Step 2 : We are getting another million records from database and adding to List2.
Step 3 : iterate and compare List1 elements with List2 (if List1 element exist in List2 , update List2 element in database)
The challenge
Step 2 and Step 3 taking lot of time,how to convert this problem statement into JavaSpark way to increase performance.
What I have tried?
List paths (this contains million strings)
Iterator<T> oneMillionRecords =
database.fetching();// this is taking time
Iterable<T> iterable = () -> oneMillionRecords;
JavaRDD<T> parentPathRDDs = javaSparkContext.parallelize(
StreamSupport.stream(iterable.spliterator(), false)
.collect(Collectors.toList()));
List<T> avaliableResources = parentPathRDDs.filter(r -> {
return paths.contains(r.getPath()));
}).map(dr->{dr.setXXX("YYY"); return dr;}).collect();
List<T> unreachableResources = parentPathRDDs.filter(r -> {
return (!paths.contains(r.getPath())));
}).map(dr->{dr.setX("ZZZ"); return dr;}).collect();
List<T> allRes = new ArrayList<>();
allRes.addAll(avaliableResources);
allRes.addAll(unreachableResources);
resourcesToUpdate.addAll(allRes);
The above code didn't make much impact on performance, can anyone please suggest better solution?

Related

Assert List of Maps in Java ignoring the order

Is there any way where I can assert 2 lists of Maps ignoring the order? thanks
Example:
List<Map<String, Object>> dataList1 = new ArrayList<>();
List<String> headers1 = new ArrayList<>();
headers1.add("Header1");
headers1.add("Header2");
headers1.add("Header3");
Map<String, Object> dataMap1 = new LinkedHashMap<>();
dataMap1.put(headers1.get(0), "testData1");
dataMap1.put(headers1.get(1), "testData2");
dataMap1.put(headers1.get(2), "testData3");
Map<String, Object> dataMap2 = new LinkedHashMap<>();
dataMap2.put(headers1.get(0), "testData4");
dataMap2.put(headers1.get(1), "testData5");
dataMap2.put(headers1.get(2), "testData6");
dataList1.add(dataMap1);
dataList1.add(dataMap2);
List<Map<String, Object>> dataList2 = new ArrayList<>();
List<String> headers3 = new ArrayList<>();
headers3.add("Header1");
headers3.add("Header2");
headers3.add("Header3");
Map<String, Object> dataMap3 = new LinkedHashMap<>();
dataMap3.put(headers3.get(0), "testData1");
dataMap3.put(headers3.get(1), "testData2");
dataMap3.put(headers3.get(2), "testData3");
Map<String, Object> dataMap4 = new LinkedHashMap<>();
dataMap4.put(headers3.get(0), "testData4");
dataMap4.put(headers3.get(1), "testData5");
dataMap4.put(headers3.get(2), "testData6");
dataList2.add(dataMap4);
dataList2.add(dataMap3);
System.out.println(dataList1);
System.out.println(dataList2);
and the results would be:
[{Header1=testData1, Header2=testData2, Header3=testData3}, {Header1=testData4, Header2=testData5, Header3=testData6}]
[{Header1=testData4, Header2=testData5, Header3=testData6}, {Header1=testData1, Header2=testData2, Header3=testData3}]
I want to get a TRUE result since they are actually the same but with different order. thank you in advance!
EDIT:
Just to add. I am trying to check if the 2 lists of maps are equal from 2 different sources (Excel file vs Database data). so there's a chance that the lists of data have duplicates.

That is not what a list is for. If you don't care about the order, you should use a Set instead.
However if the order is important but you just want to make sure they contain the same Maps then you could just convert them to Sets for the assertion:
new HashSet<Map<String, Object>>(dataList1).equals(new HashSet<Map<String, Object>>(dataList2))

In the case you're describing the value lists would rather have "bag" semantics, i.e. sets which contain duplicates.
To compare those you need to write your own comparison logic (or find a library that provides it, e.g. Apache Commons Collections' CollectionUtils#isEqualCollection() or Hamcrest's Matchers#containsInAnyOrder()) since the default methods won't help. The assert would then be something like assertTrue(mapsAreEqual(actual, expected)) or assertEqual(new MapEqualWrapper(actual), new MapEqualWrapper(expected)) where MapEqualWrapper would implement the logic in its equals().
For the check you could sort the lists (or copies of them) and do a traditional comparison or use a frequency map (after other checks of course):
first check the sizes - if they are different the lists aren't equal
build a frequency map for the first list, i.e. increment the value by 1 for each occurence
check the elements in the second list by decreasing the occurences and removing any frequencies that hit 0
if you hit an element that has no entry in the frequency map you can stop already since the bags aren't equal
Sorting and comparing would be easier to implement with just a couple of lines but time complexity would be O(n*log(n)) due to the sorting.
On the other hand, time complexity for using the frequency map would basically be O(n) (iterations are O(n) and map put/get/remove should be O(1) in theory). This, however, shouldn't matter unless you need to quickly compare large lists and thus I'd go with the sort and compare method first.

performance finding elements in collection

I have two collections
The first contains all the elements.
The second contains the elements that I am interested in from the first collection.
The data is in an alphabetical format:
AAA
AA.12.AA
BBB.234.B1
CC.89
…
The first collection contains around 300.000 records roughly.
Now, if I want to get 10 thousand records from the first collection it is taking up to 40 seconds to find them.
Collection types: firstColl = ArrayList , secondColl = List
Action: I iterate all elements in the firstColl and for every element I check if the secondColl has the element in it.
Just want to know if anyone knows a most performance way to do it by using maybe BigList, Streams,...
CODE:
List<RegionPolygon> regionPolygons = new ArrayList<>();
for (RegionPolygon regionPolygon: result) {
if (regionsArray.contains(regionPolygon.getRegionRef())) {
regionPolygons.add(regionPolygon);
}
}
Note: RegionPolygon has a property which is a String with a very long value (more than 2000 thousand characters easily, although I am not using that property to check if it is the region that I am looking for). Just wanted to say this cause I don't know if this is part of the problem.
result = firstColl
regionsArray = secondColl
Thanks,

How to get a sublist view of elements not inside the given index range

I want to sweep over a list of files and in each step get just a validation subset and a training subset.
In Java I can use List#sublist(int from, int to) in order to get a sublist but is is there a nice and easy way to get all elements except those in that range? E.g.
List<File> valid = this.sampleFiles.subList(fromIndex, toIndex);
List<File> train = this.sampleFiles.notInSubList(fromIndex, toIndex);

Maybe the Stream API and Collectors.partitioningBy could help you. That way you get a map with the partitions as you wish them to be.
sampleFiles.stream()
.collect(Collectors.partitioningBy(/* your partition function */);
In the above example you would gain a Map<Boolean, List<File>>. But also other partitions are possible. Maybe IntStream.range would also work.
If you don't like streams, the following may also work...
List<String> sampleFiles = ...
List<String> valid = sampleFiles.subList(fromIndex, toIndex);
List<String> train = new ArrayList(sampleFiles);
train.removeAll(valid);
I don't think there is an easier way to get the remainder of a list, beside those variants.

How to parallelize a list of lists with Spark?

Suppose I read whole files:
JavaPairRDD<String, String> filesRDD = sc.wholeTextFiles(inputDataPath);
Then, I have the following mapper which s:
JavaRDD<List<String>> processingFiles = filesRDD.map(fileNameContent -> {
List<String> results = new ArrayList<String>();
for ( some loop ) {
if (condition) {
results.add(someString);
}
}
. . .
return results;
});
For the sake of argument, suppose that inside the mapper I need to make a list of strings, which I return from each file. Now, each string in each list can be viewed independently and needs to be processed later on independently. I don't want Spark to process each list at once, but each string of each list at once. Later when I use collect() I get a list of lists.
One way to put this is: how to parallelize this list of lists for each string individually not for each list individually?

Instead of mapping filesRDD to get a list of lists, flatmap it and you can get an RDD of strings.
EDIT: Adding comment out of request
Map is a 1:1 function where 1 input row -> 1 output row. Flatmap is a 1:N function where 1 input row -> many (or 0) output rows. If you use flatMap, you can design it so your output RDD is and RDD of strings whereas currently your output RDD is a RDD of lists of strings. It sounds like this is what you want. I'm not a java-spark user, so I can't give you syntax specifics. Check here for help on syntax

Java for loop improve performance

I have a for loop structure like this:
for(T element1 : list1) {
for(T element2 : element1.getSubElements()) {
...
}
}
list1 contains about 10.000 elements and element1.getSubElements() also contains around 10-20 elements for each iteration.
The loop takes around 2 minutes to finish.
Any ideas about how to improve this?

The looping doesn't take that long. The work you do in the loop takes the time.
Your options are;
use a profiler to optimise the work inside the loop.
use parallelStream() to see if doing the work across multiple threads improves the time it takes.
I have an if, if the condition is true I add the element2 to an ArrayList. Maybe the problem is in the if?
To use parallelStream you can do
List<T> list2 = list1.parallelStream()
.flatMap(e -> e.getSubElements().stream())
.filter(e -> e.isConditionTrue())
.collect(Collectors.toList());

Try Lambda expression
list1.forEach(new T<E>() {
public void accept(E value) {
System.out.println(value);
}
});
Actually both the forEach method and the Consumer interface have been added in Java 8, but you can already do something very similar in Java 5+ using libraries like guava or lambdaj . However Java 8 lambda expressions allow to achieve the same result in a less verbose and more readable way:
list1.forEach((E value) -> System.out.println(value));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to effectively process million records using JavaSpark - java

Related

Assert List of Maps in Java ignoring the order

performance finding elements in collection

How to get a sublist view of elements not inside the given index range

How to parallelize a list of lists with Spark?

Java for loop improve performance

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to effectively process million records using JavaSpark - java

Related

Assert List of Maps in Java ignoring the order

performance finding elements in collection

How to get a sublist view of elements *not* inside the given index range

How to parallelize a list of lists with Spark?

Java for loop improve performance

Categories

Resources

How to get a sublist view of elements not inside the given index range