How to parallelize a list of lists with Spark?

How to parallelize a list of lists with Spark? - java

Suppose I read whole files:
JavaPairRDD<String, String> filesRDD = sc.wholeTextFiles(inputDataPath);
Then, I have the following mapper which s:
JavaRDD<List<String>> processingFiles = filesRDD.map(fileNameContent -> {
List<String> results = new ArrayList<String>();
for ( some loop ) {
if (condition) {
results.add(someString);
}
}
. . .
return results;
});
For the sake of argument, suppose that inside the mapper I need to make a list of strings, which I return from each file. Now, each string in each list can be viewed independently and needs to be processed later on independently. I don't want Spark to process each list at once, but each string of each list at once. Later when I use collect() I get a list of lists.
One way to put this is: how to parallelize this list of lists for each string individually not for each list individually?

Instead of mapping filesRDD to get a list of lists, flatmap it and you can get an RDD of strings.
EDIT: Adding comment out of request
Map is a 1:1 function where 1 input row -> 1 output row. Flatmap is a 1:N function where 1 input row -> many (or 0) output rows. If you use flatMap, you can design it so your output RDD is and RDD of strings whereas currently your output RDD is a RDD of lists of strings. It sounds like this is what you want. I'm not a java-spark user, so I can't give you syntax specifics. Check here for help on syntax

Related

Combine List of streams into one single stream

I have a List<Stream<String>> that I get by doing a series of transactions.
The list size is dynamic (Maximum 3 elements) so I can't do:
Stream<String> finalStream = Stream.concat(list.get(0),Stream.concat(list.get(1),list.get(2));
I need to concatenate the list of Streams into one single Stream<String>.
Is there any simple way to do this?

If you have a list of lists, or a stream of streams, or any collection of collections, you can use flatMap to, well, flatten them. flatMap applies a mapping function which must return a stream to an input and streams each element of the result of the mapping function.
In your case, you could do:
var finalStream = list.stream().flatMap(x -> x);
x -> x is the identify function which returns the input unmodified. If you prefer, you can replace it with the expression Function.identity().

How to effectively process million records using JavaSpark

I am new to JavaSpark
I came into a requirement to compare and process millions of records, i used plain java multi threading but want to do in spark way, to increase performance
Problem Statement:
There are millions of records in our database,i need to compare it with another List and process it.
Example:
Step 1 : We have a List1 with million strings fetched from filesystem (this doesnt have an issue).
Step 2 : We are getting another million records from database and adding to List2.
Step 3 : iterate and compare List1 elements with List2 (if List1 element exist in List2 , update List2 element in database)
The challenge
Step 2 and Step 3 taking lot of time,how to convert this problem statement into JavaSpark way to increase performance.
What I have tried?
List paths (this contains million strings)
Iterator<T> oneMillionRecords =
database.fetching();// this is taking time
Iterable<T> iterable = () -> oneMillionRecords;
JavaRDD<T> parentPathRDDs = javaSparkContext.parallelize(
StreamSupport.stream(iterable.spliterator(), false)
.collect(Collectors.toList()));
List<T> avaliableResources = parentPathRDDs.filter(r -> {
return paths.contains(r.getPath()));
}).map(dr->{dr.setXXX("YYY"); return dr;}).collect();
List<T> unreachableResources = parentPathRDDs.filter(r -> {
return (!paths.contains(r.getPath())));
}).map(dr->{dr.setX("ZZZ"); return dr;}).collect();
List<T> allRes = new ArrayList<>();
allRes.addAll(avaliableResources);
allRes.addAll(unreachableResources);
resourcesToUpdate.addAll(allRes);
The above code didn't make much impact on performance, can anyone please suggest better solution?

Can we collect two lists from Java 8 streams?

Consider I have a list with two types of data,one valid and the other invalid.
If I starting filter through this list, can i collect two lists at the end?

Can we collect two lists from Java 8 streams?
You cannot but if you have a way to group elements of the Lists according to a condition.
In this case you could for example use Collectors.groupingBy() that will return Map<Foo, List<Bar>> where the values of the Map are the two List.
Note that in your case you don't need stream to do only filter.
Filter the invalid list with removeIf() and add all element of that in the first list :
invalidList.removeIf(o -> conditionToRemove);
goodList.addAll(invalidList);
If you don't want to change the state of goodList you can do a shallow copy of that :
invalidList.removeIf(o -> conditionToRemove);
List<Foo> terminalList = new ArrayList<>(goodList);
terminalList.addAll(invalidList);

This is a way using Java 8 streams API. Consider I have a List of String elements: the input list has strings with various lengths; only strings with length 3 are valid.
List<String> input = Arrays.asList("one", "two", "three", "four", "five");
Map<Boolean, List<String>> map = input.collect(Collectors.partitioningBy(s -> s.length() == 3));
System.out.println(map); // {false=[three, four, five], true=[one, two]}
The resulting Map has only two records; one with valid and the other with not-valid input list elements. The map record with key=true has the valid string as a List: one, two. The other is a key=false and the not-valid strings: three, four, five.
Note the Collectors.partitioningBy produces always two records in the resulting map irrespective of the existence of valid or not-valid values.

Another suggestion: filter using a lambda that adds the elements that you want to filter out of the stream to a separate list.

The collector can only return a single object!
But you could create a custom collector that simply puts the stream elements into two lists, to then return a list of lists containing these two lists.
There are many examples how to do that.

If you need to classify binary (true|false), you can use Collectors.partitioningBy that will return Map<Boolean, List> (or other downstream collection like Set, if you additionally specify).
If you need more than 2 categories - use groupBy or just Collectors.toMap of collections.

Java Stream. Extracting distinct values of multiple maps

I am trying to extract and count the number of different elements in the values of a map. The thing is that it's not just a map, but many of them and they're to be obtained from a list of maps.
Specifically, I have a Tournament class with a List<Event> event member. An event has a Map<Localization, Set<Timeslot>> unavailableLocalizations member. I would like to count the distincts for all those timeslots values.
So far I managed to count the distincts in just one map like this:
event.getUnavailableLocalizations().values().stream().distinct().count()
But what I can't figure out is how to do that for all the maps instead of for just one.
I guess I would need some way to take each event's map's values and put all that into the stream, then the rest would be just as I did.

Let's do it step by step:
listOfEvents.stream() //stream the events
.map(Event::getUnavailableLocalizations) //for each event, get the map
.map(Map::values) //get the values
.flatMap(Collection::stream) //flatMap to merge all the values into one stream
.distinct() //remove duplicates
.count(); //count

what is an efficient way to get arraylist of arraylist?

I have this string:
string1="A.1,B.2,C.4"
I want to get the following arraylist of arraylist:
<<"A","1">,<"B","2">,<"c","4">>
is there any way other than using for loop?
suppose that arraylists are unique, so I would have
Set<String> set= new HashSet<>(Arrays.asList(string1.split(",")));
now I want to split each element in the above set on ., without using for loop.

I was thinking of first splitting based on ,. Then, split each of those on ., put both values in a list, and then put that list in another list. Something pseudo-code-y that should work. :)
string1="A.1,B.2,C.4"
stringsWithDots[] = string1.split(",");
List<List<String>> result = new ArrayList<List<String>>();
for(String stringWithDots: stringsWithDots) {
finalSplit[] = stringWithDots.split(".");
List<String> list1 = Arrays.asList(finalSplit);
result.add(list1);
}
"A.1,B.2,C.4" would look like [[A,1],[B,2],[C,4]]
Edit.
[asList source]
[split] which is used to split a string [has a loop].
ArrayList is backed by an array. The asList function simply sets a reference to that array [source].
So, that thing you say about not using a loop. Well, be it using stream or some internal function, loops happen; just that you might not be seeing it in the immediate code that you write.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parallelize a list of lists with Spark? - java

Related

Combine List of streams into one single stream

How to effectively process million records using JavaSpark

Can we collect two lists from Java 8 streams?

Java Stream. Extracting distinct values of multiple maps

what is an efficient way to get arraylist of arraylist?

Categories

Resources