Java 8 variable scope in lambda (Spark specific) - java

I would like to fill a map with a String as key and Row as value, my code:
private Map<String,Row> getMapFromDataset(Dataset<Row> dataset, List<String> mapColumns) {
Map<String, Row> map = new HashMap<>();
dataset.foreach((ForeachFunction<Row>) row ->
map.put(getKey(mapColumns,row),row) //This works
);
return map; //Map is empty when returning!
}
My getKey() method (although i think is not the cause of the issue):
private String getKey(List<String> mapColumns, Row row) {
StringBuffer sb = new StringBuffer(256);
for(String col : mapColumns){
sb.append((String)row.getAs(col));
}
return sb.toString();
}
Although it compiles and runs without errors, the map is always empty.
What i have noticed is that if i check the size of the map right after the first insertion, the map has size 1, so the items insertion works, but the returned map is empty
I also read that variables used within lambda should be final, this might explains the problem.
Any hint?

I found out that map initialization happens in Driver, while the lambdas foreach is sent to executors.

Related

How to collect data from a stream in different lists based on a condition?

I have a stream of data as shown below and I wish to collect the data based on a condition.
Stream of data:
452857;0;L100;csO;20220411;20220411;EUR;000101435;+; ;F;1;EUR;000100000;+;
452857;0;L120;csO;20220411;20220411;EUR;000101435;+; ;F;1;EUR;000100000;+;
452857;0;L121;csO;20220411;20220411;EUR;000101435;+; ;F;1;EUR;000100000;+;
452857;0;L126;csO;20220411;20220411;EUR;000101435;+; ;F;1;EUR;000100000;+;
452857;0;L100;csO;20220411;20220411;EUR;000101435;+; ;F;1;EUR;000100000;+;
452857;0;L122;csO;20220411;20220411;EUR;000101435;+; ;F;1;EUR;000100000;+;
I wish to collect the data based on the index = 2 (L100,L121 ...) and store it in different lists of L120,L121,L122 etc using Java 8 streams. Any suggestions?
Note: splittedLine array below is my stream of data.
For instance: I have tried the following but I think there's a shorter way:
List<String> L100_ENTITY_NAMES = Arrays.asList("L100", "L120", "L121", "L122", "L126");
List<List<String>> list= L100_ENTITY_NAMES.stream()
.map(entity -> Arrays.stream(splittedLine)
.filter(line -> {
String[] values = line.split(String.valueOf(DELIMITER));
if(values.length > 0){
return entity.equals(values[2]);
}
else{
return false;
}
}).collect(Collectors.toList())).collect(Collectors.toList());
I'd rather change the order and also collect the data into a Map<String, List<String>> where the key would be the entity name.
Assuming splittedLine is the array of lines, I'd probably do something like this:
Set<String> L100_ENTITY_NAMES = Set.of("L100", ...);
String delimiter = String.valueOf(DELIMITER);
Map<String, List<String>> result =
Arrays.stream(splittedLine)
.map(line -> {
String[] values = line.split(delimiter );
if( values.length < 3) {
return null;
}
return new AbstractMap.SimpleEntry<>(values[2], line);
})
.filter(Objects::nonNull)
.filter(tempLine -> L100_ENTITY_NAMES.contains(tempLine.getEntityName()))
.collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.mapping(Map.Entry::getValue, Collectors.toList());
Note that this isn't necessarily shorter but has a couple of other advantages:
It's not O(n*m) but rather O(n * log(m)), so it should be faster for non-trivial stream sizes
You get an entity name for each list rather than having to rely on the indices in both lists
It's easier to understand because you use distinct steps:
split and map the line
filter null values, i.e. lines that aren't valid in the first place
filter lines that don't have any of the L100 entity names
collect the filtered lines by entity name so you can easily access the sub lists
I would convert the semicolon-delimited lines to objects as soon as possible, instead of keeping them around as a serialized bunch of data.
First, I would create a model modelling our data:
public record LBasedEntity(long id, int zero, String lcode, …) { }
Then, create a method to parse the line. This can be as well an external parsing library, for this looks like CSV with semicolon as delimiter.
private static LBasedEntity parse(String line) {
String[] parts = line.split(";");
if (parts.length < 3) {
return null;
}
long id = Long.parseLong(parts[0]);
int zero = Integer.parseInt(parts[1]);
String lcode = parts[2];
…
return new LBasedEntity(id, zero, lcode, …);
}
Then the mapping is trivial:
Map<String, List<LBasedEntity>> result = Arrays.stream(lines)
.map(line -> parse(line))
.filter(Objects::nonNull)
.filter(lBasedEntity -> L100_ENTITY_NAMES.contains(lBasedEntity.lcode()))
.collect(Collectors.groupingBy(LBasedEntity::lcode));
map(line -> parse(line)) parses the line into an LBasedEntity object (or whatever you call it);
filter(Objects::nonNull) filters out all null values produced by the parse method;
The next filter selects all entities of which the lcode property is contained in the L100_ENTITY_NAMES list (I would turn this into a Set, to speed things up);
Then a Map is with key-value pairs of L100_ENTITY_NAME → List<LBasedEntity>.
You're effectively asking for what languages like Scala provide on collections: groupBy. In Scala you could write:
splitLines.groupBy(_(2)) // Map[String, List[String]]
Of course, you want this in Java, and in my opinion, not using streams here makes sense due to Java's lack of a fold or groupBy function.
HashMap<String, ArrayList<String>> map = new HashMap<>();
for (String[] line : splitLines) {
if (line.length < 2) continue;
ArrayList<String> xs = map.getOrDefault(line[2], new ArrayList<>());
xs.addAll(Arrays.asList(line));
map.put(line[2], xs);
}
As you can see, it's very easy to understand, and actually shorter than the stream based solution.
I'm leveraging two key methods on a HashMap.
The first is getOrDefault; basically if the value associate with our key doesn't exist, we can provide a default. In our case, an empty ArrayList.
The second is put, which actually acts like a putOrReplace because it lets us override the previous value associated with the key.
I hope that was helpful. :)
you're asking for a shorter way to achieve the same, actually your code is good. I guess the only part that makes it look lengthy is the if/else check in the stream.
if (values.length > 0) {
return entity.equals(values[2]);
} else {
return false;
}
I would suggest introduce two tiny private methods to improve the readability, like this:
List<List<String>> list = L100_ENTITY_NAMES.stream()
.map(entity -> getLinesByEntity(splittedLine, entity)).collect(Collectors.toList());
private List<String> getLinesByEntity(String[] splittedLine, String entity) {
return Arrays.stream(splittedLine).filter(line -> isLineMatched(entity, line)).collect(Collectors.toList());
}
private boolean isLineMatched(String entity, String line) {
String[] values = line.split(DELIMITER);
return values.length > 0 && entity.equals(values[2]);
}

Accessing HashMap inside flatMapToPair

Edit: Already solved using RDD.collectAsMap()
I am trying to replicate the solution to the problem from pages 28-30 of http://on-demand.gputechconf.com/gtc/2016/presentation/S6424-michela-taufer-apache-spark.pdf
I have a HashMap that I instantiate outside of the map function. The HashMap contains the following data:
{1:2, 2:3, 3:2, 4:2, 5:3}
A previously defined RDD previousRDD was has the type:
JavaPairRDD<Integer, Iterable<Tuple2<Integer, Integer>>>
has the data:
1: [(1,2), (1,5)]
2: [(2,1), (2,3), (2,5)]
3: [(3,2), (3,4)]
4: [(4,3), (4,5)]
5: [(5,1), (5,2), (5,4)]
I try to create a new RDD with a flatMapToPair:
JavaPairRDD<Integer, Integer> newRDD = previousRDD.flatMapToPair(new PairFlatMapFunction<Tuple2<Integer, Iterable<Tuple2<Integer, Integer>>>, Integer, Integer>() {
#Override
public Iterator<Tuple2<Integer, Integer>> call(Tuple2<Integer, Iterable<Tuple2<Integer, Integer>>> integerIterableTuple2) throws Exception {
Integer count;
ArrayList<Tuple2<Integer, Integer>> list = new ArrayList<>();
count = hashMap.get(integerIterableTuple2._1);
for (Tuple2<Integer, Integer> t : integerIterableTuple2._2) {
Integer tcount = hashMap.get(t._2);
if (count < tcount || (count.equals(tcount) && integerIterableTuple2._1 < t._2)) {
list.add(t);
}
}
return list.iterator();
}
});
But in this, the hashMap.get(t._2) inside the for loop gets NULLs most of the time. I have checked that the proper values are inside the HashMap.
Is there a way to properly get the values of a HashMap inside a Spark function?
It should work. Spark should capture your variable, serialize it and send to each worker with each task. You might try broadcasting this map
sc.broadcast(hashMap)
and use the result instead of hashMap. It is more efficient memory-wise too (shared storage per executor).
I had similar problem with class variables. You can try make your variable local or declare one more, like this:
Map localMap = hashMap;
JavaPairRDD<Integer, Integer> newRDD = previousRDD.flatMapToPair(
...
Integer tcount = localMap.get(t._2);
...
);
I think this is due to spark serialization mechanism. You can read more about it here.

Optimising the java code for approach of Map computeIfPresent

I have the below method, in which I am extracting the value from the entity and then setting it in map as a value of that map but my point is that for each key I am setting the value explicitly so if the count of keys grows that method code will also grow , can I make a common method based on approach Map.computeIfPresent, please advise how can I achieve both the things
private void setMap(AbcLoginDTO abcLoginDTO, Map<String, Object> getMap) {
getMap.put("XXNAME", abcLoginDTO.getUsername());
getMap.put("XX_ID", abcLoginDTO.getClientId());
getMap.put("RR_ID", abcLoginDTO.getUserId());
getMap.put("QQ_TIME", abcuserLoginDTO.getLocktime());
}
something like in this below approach I am thinking
static <E> void setIfPresent(Map<String, Object> map, String key, Consumer<E> setter, Function<Object, E> mapper) {
Object value = map.get(key);
if (value != null) {
setter.accept(mapper.apply(value));
}
}
but my point is that for each key I am setting the value explicitly so
if the count of keys grows that method code will also grow
You need to populate the Map with different values from the DTO, so you don't have other choices.
The method is long because you don't have a mapping between the key to add in the Map and the value to retrieve from the DTO.
You could write your code with a function such as :
static void setValueInMap(Map<String, Object> map, String key, Supplier<Object> mapper) {
map.put(key, mapper.get());
}
And use that :
Map<String, Object> map = ...;
AbcLoginDTO dto = ...;
setIfPresent(map, "keyUserName", dto::getUserName);
// and so for
But no real advantages.
Your second snippet has not at all relationship with the first one.
If i understand correctly, what you want to do is iterate over all of the object's members, get their value, and set them to a map according to their name. If so, then what you're looking for is called Reflection.
Every object can give you an array of its fields or methods (even private ones!) and then you can manipulate them using the Field / Method object.
Field[] members = AbcLoginDTO.class.getDeclaredFields();
Map<String, Object> values = new HashMap<>();
for(Field member : members) {
member.setAccessible(true);
values.put(member.getName(), member.get(abcLoginDTO));
}
What you end up with here, is a "Map representation" of your AbcLoginDTO instance. from here you can do with it what you want...
notice that i am "inspecting" the class itself in line 1, and then using the instance at line 6.
this code is not complete, but it's a start, and this can also be adapted to work for ANY object.
I don't know if I understood correctly, but if I did then that means all you need is a way to manually set different keys for the methods of your AbcLoginDTO class
If so then that can be done easily,
let's consider that your abcLoginDTO.getClientId() is always different for every AbcLoginDTO object:
private void setMap(AbcLoginDTO abcLoginDTO, Map<String, Object> getMap) {
getMap.put(Integer.toString(abcLoginDTO.getClientId())+"_NAME", abcLoginDTO.getUsername());
getMap.put(Integer.toString(abcLoginDTO.getClientId())+"_ID", abcLoginDTO.getClientId());
getMap.put(Integer.toString(abcLoginDTO.getClientId())+"_ID", abcLoginDTO.getUserId());
getMap.put(Integer.toString(abcLoginDTO.getClientId())+"_TIME", abcuserLoginDTO.getLocktime());
}

Size of HashMap not increasing

I go through a loop three times and call this method each time:
// class variable
private HashMap<String, ffMotorskillsSession> tempMap;
tempMap = new HashMap<String, ffMotorskillsSession>();
for loop {
addMotorskillsSession(session);
}
private void addMotorskillsSession(ffMotorskillsSession pSession) {
StringBuilder sb = new StringBuilder();
sb.append(pSession.period).append(":").append(pSession.section)
.append(":").append(pSession.class_name).append(":")
.append(pSession.semester).append(":").append(pSession.grade);
tempMap.put(sb.toString(), pSession);
Log.d("Size: ", String.valueOf(tempMap.size()));
}
Everytime I Log the size each time it passes thru it stays at one.
Can anyone see why?
A Map stores key/value pairs, with only one value per key. So if you're calling put with the same key multiple times, then it will correctly stick to the same size, only having a single entry for that key.

Changing LinkedHashMapValues

Below is data from 2 linkedHashMaps:
valueMap: { y=9.0, c=2.0, m=3.0, x=2.0}
formulaMap: { y=null, ==null, m=null, *=null, x=null, +=null, c=null, -=null, (=null, )=null, /=null}
What I want to do is input the the values from the first map into the corresponding positions in the second map. Both maps take String,Double as parameters.
Here is my attempt so far:
for(Map.Entry<String,Double> entryNumber: valueMap.entrySet()){
double doubleOfValueMap = entryNumber.getValue();
for(String StringFromValueMap: strArray){
for(Map.Entry<String,Double> entryFormula: formulaMap.entrySet()){
String StringFromFormulaMap = entryFormula.toString();
if(StringFromFormulaMap.contains(StringFromValueMap)){
entryFormula.setValue(doubleOfValueMap);
}
}
}
}
The problem with doing this is that it will set all of the values i.e. y,m,x,c to the value of the last double. Iterating through the values won't work either as the values are normally in a different order those in the formulaMap. Ideally what I need is to say is if the string in formulaMap is the same as the string in valueMap, set the value in formulaMap to the same value as in valueMap.
Let me know if you have any ideas as to what I can do?
This is quite simple:
formulaMap.putAll(valueMap);
If your value map contains key which are not contained in formulaMap, and you don't want to alter the original, do:
final Map<String, Double> map = new LinkedHashMap<String, Double>(valueMap);
map.keySet().retainAll(formulaMap.keySet());
formulaMap.putAll(map);
Edit due to comment It appears the problem was not at all what I thought, so here goes:
// The result map
for (final String key: formulaMap.keySet()) {
map.put(formulaMap.get(key), valueMap.get(key));
// Either return the new map, or do:
valueMap.clear();
valueMap.putAll(map);
for(Map.Entry<String,Double> valueFormula: valueMap.entrySet()){
formulaMap.put(valueFormula.getKey(), valueFormula.value());
}

Categories

Resources