Running computation in Spark in a single node - java

I have an RDD like so:
JavaPairRDD<PointFlag, Point> keyValuePair = ...
I want to output an RDD like so:
JavaRDD<Point> valuesRemainingAfterProcessing = processAndOutputSkylinePoints(keyValuePair)
The processing will take place in a single node because all the values are needed for the processing to occur. (doing comparisons between them and their flags)
What I thought of doing is:
Map everything to a single ID: JavaPairRDD<Integer, Tuple2<PointFlag, Point>> singleIdRDD = keyValuePair.mapToPair(fp -> new Tuple2(0, fp));
Do the processing: JavaRDD<Iterable<Point>> iterableGlobalSkylines = singleIdRDD.map(ifp -> calculateGlobalSkyline(ifp)); (calculateGlobalSkyline() returns a List<Point>)
Convert to JavaRDD<Point>: JavaRDD<Point> globalSkylines = iterableGlobalSkylines.flatMap(p -> p);
This all though looks like a dirty hack to me and I would like to know if there is a better way of doing this.

A good solution I found (definitely way less verbose) is to use the glom() function from the Spark API. This function returns a single List of all the elements of the previous RDD or in official terms:
Return an RDD created by coalescing all elements within each partition into a list.
First though you have to reduce the RDD to a single partition. Here is the solution:
JavaPairRDD<PointFlag, Point> keyValuePair = ...;
JavaPairRDD<PointFlag, Point> singlePartition = keyValuePair.coalesce(1);
JavaRDD<List<Tuple2<PointFlag, Point>>> groupedOnASingleList = keyValuePair.glom();
JavaRDD<Point> globalSkylinePoints = groupedOnASingleList.flatMap(singleList -> getGlobalSkylines(singleList));
If anyone has a better answer feel free to post it.

Related

Populate additional/extra fields only in list from another collection (with lambdas)

I have an original collection, List<Reviewers>, and a new one List<ReviewPerson> where some fields from Reviewers will be copied to ReviewPerson.
The new list is constructed in a special way and not directly from reviewers.stream().map(...). But at the end, I need to copy 2 additional columns that exist in each bean, status and comments.
List<Reviewers> originalList = ... // from DAO
if (!originalList.isEmpty()) {
List<ReviewPerson> newList = new ArrayList<ReviewPerson>();
// this fills out some columns of the ReviewPerson, not all;
// must use this partial construction from service class
newList.addAll(service.initialPopulation());
// At the end, need to copy: (1) status, (2) comments
// ...
}
The problem is I can't do this,
originalList.stream()
.map(obj -> new ReviewPerson(obj.getField1(), obj.getField2(),
// ...
obj.getStatus(), obj.getComments()))
.collect(Collectors.toList());
because I'm not constructing new objects in the collection. What should I do?
One common solution is to stream the (presumably corresponding) indexes of the lists and use the same index to access both lists:
IntStream.range(0, originalList.size()).forEach(i -> {
newList.get(i).setFieldA(originalList.get(i).getFieldA();
newList.get(i).setFieldB(originalList.get(i).getFieldB();
// etc...
});
But to be honest, this may be streaming for the sake of streaming. Sometimes a good old-fashioned straight-forward for loop is just a better solution.
If you don't mind an extra dependency, and if Reviewers and ReviewPersons are indeed corresponding, I'd suggest using the jOOλ library and its Seq.zip() method (Seq is a subtype of Stream) together with a Tuple.consumer overload.
With the above, you can end up with such a concise piece of code:
Seq.zip(newList, originalList).forEach(Tuple.consumer((reviewPerson, reviewer) -> {
reviewPerson.setStatus(reviewer.getStatus());
reviewPerson.setComments(reviewer.getComments());
}));

Converting Linq queries to Java 8

Im traslating a old enterprise App who uses C# with Linq queries to Java 8. I have some of those queries who I'm not able to reproduce using Lambdas as I dont know how C# works with those.
For example, in this Linq:
from register in registers
group register by register.muleID into groups
select new Petition
{
Data = new PetitionData
{
UUID = groups.Key
},
Registers = groups.ToList<AuditRegister>()
}).ToList<Petition>()
I undestand this as a GroupingBy on Java 8 Lambda, but what's the "select new PetitionData" inside of the query? I don't know how to code it in Java.
I have this at this moment:
Map<String, List<AuditRegister>> groupByMuleId =
registers.stream().collect(Collectors.groupingBy(AuditRegister::getMuleID));
Thank you and regards!
The select LINQ operation is similar to the map method of Stream in Java. They both transform each element of the sequence into something else.
collect(Collectors.groupingBy(AuditRegister::getMuleID)) returns a Map<String, List<AuditRegister>> as you know. But the groups variable in the C# version is an IEnumerable<IGrouping<string, AuditRegister>>. They are quite different data structures.
What you need is the entrySet method of Map. It turns the map into a Set<Map.Entry<String, List<AuditRegister>>>. Now, this data structure is more similar to IEnumerable<IGrouping<string, AuditRegister>>. This means that you can create a stream from the return value of entry, call map, and transform each element into a Petition.
groups.Key is simply x.getKey(), groups.ToList() is simply x.getValue(). It should be easy.
I suggest you to create a separate method to pass into the map method:
// you can probably came up with a more meaningful name
public static Petition mapEntryToPetition(Map.Entry<String, List<AuditRegister>> entry) {
Petition petition = new Petition();
PetitionData data = new PetitionData();
data.setUUID(entry.getKey());
petition.setData(data);
petition.setRegisters(entry.getValue());
return petition;
}

Can Java 8 Streams use multiple items from mapping pipeline

I have some data stored in a JPA Repository that I am trying to process. I would like to be able to use Java 8 Streams to do so, but can not figure out how to get the required information. This particular 'Entity' is actually only for recovery, so it holds items that would need to be processed after something like a power-fail/restart.
Using pre-Java 8 for-loops the code would look like:
List<MyEntity> deletes = myEntityJpaRepository.findByDeletes();
for (MyEntity item : deletes) {
String itemJson = item.getData();
// use a Jackson 'objectMapper' already setup to de-serialize
MyEventClass deleteEvent = objectMapper.readValue(itemJson, MyEventClass.class);
processDelete(deleteEvent, item.getId());
}
The problem arises from the two parameter method called at the very end. Using Streams, I believe I would do:
// deletes.stream()
// .map(i -> i.getData())
// .map(event -> objectMapper.readValue(event, MyEventClass.class))
// .forEach(??? can't get 'id' here to invoke 2 parameter method);
I have a solution (without Streams) that I can live with. However I would think this problem comes up a lot, thus my question is: IN GENERAL, is there a way using Streams to accomplish what I am trying to do?
Why not a Pair return on your map operation:
.map(i -> new Pair<>(i.getData(), i.getId()))
.map(pair -> new Pair<>(objectMapper.readValue(pair.getLeft(), MyEventClass.class), pair.getRight())
.forEach(p -> processDelete(pair.getLeft(), pair.getRight()))
I did not compile this, so there might be minor things to fix. But in general, you would need a Holder to pass your objects to the next stage in such a case. Either a Pair or some type or even a array.
Why not doing it simply this way?
deletes.forEach(item ->
processDelete(objectMapper.readValue(item.getData(), MyEventClass.class),
item.getId()));
This is a start at least, I guess it is dependent on why you want to use stream and how much you want to make it more functional
List<MyEntity> deletes = myEntityJpaRepository.findByDeletes();
deletes.stream().foreach(item -> {
String itemJson = item.getData();
// use a Jackson 'objectMapper' already setup to de-serialize
MyEventClass deleteEvent = objectMapper.readValue(itemJson, MyEventClass.class);
processDelete(deleteEvent, item.getId());
});

Create a stream of the values in maps that are values in another map in Java

Sorry about the title of the question; it was kind of hard for me to make sense of it. If you guys have a better title, let me know and I can change it.
I have two types of objects, Bookmark and Revision. I have one large Map, like so:
Map<Long, Bookmark> mapOfBookmarks;
it contains key: value pairs like so:
1L: Bookmark1,
2L: Bookmark2,
...
Each Bookmark has a 'getRevisions()' method that returns a Map
public Map<Long, Revision> getRevisions();
I want to create a Stream that contains all revisions that exist under mapOfBookmarks. Essentially I want to do this:
List<Revision> revisions = new ArrayList<>();
for (Bookmark bookmark : mapOfBookmarks.values()) { // loop through each bookmark in the map of bookmarks ( Map<Long, Bookmark> )
for (Revision revision : bookmark.getRevisions().values()) { // loop through each revision in the map of revisions ( Map<Long, Revision> )
revisions.add(revision); // add each revision of each map to the revisions list
}
}
return revisions.stream(); // return a stream of revisions
However, I'd like to do it using the functionality of Stream, so more like:
return mapOfBookmarks.values().stream().everythingElseThatIsNeeded();
Which would essentially be like saying:
return Stream.of(revision1, revision2, revision3, revision4, ...);
How would I write that out? Something to note is that the dataset that it is looping through can be huge, making the list method a poor approach.
I'm using Windows 7 and Java 8
A flatmap is what you looking for. When you have streams contained within a stream that you wish to flatten, then flatmap is the answer,
List<Revision> all =
mapOfBookmarks.values().stream()
.flatMap(c -> c.getRevisions().values().stream())
.collect(Collectors.toList());
You are looking for the flatMap(mapper) operation:
Returns a stream consisting of the results of replacing each element of this stream with the contents of a mapped stream produced by applying the provided mapping function to each element.
In this case, we're making a Stream<Bookmark> by calling stream(), flat mapping it to the revisions of each bookmark and, finally, collecting that into a list with toList().
List<Revision> revisions =
mapOfBookmarks.values()
.stream()
.flatMap(bookmark -> boormark.getRevisions().values().stream())
.collect(Collectors.toList());
Note that your current code could also be improved by calling addAll instead of looping over each revisions:
for (Bookmark bookmark : mapOfBookmarks.values()) { // loop through each bookmark in the map of bookmarks ( Map<Long, Bookmark> )
revisions.addAll(bookmark.getRevisions().values());
}

How to get Spark MLlib RandomForestModel.predict response as text value YES/NO?

I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the CSV format with the following features:
DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO
I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.
I am following code from GitHub to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint> - is any of that wrong?
// Load and parse the data file.
JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");
// I have 14 features so giving 14 as arg to the following
final HashingTF tf = new HashingTF(14);
// Create LabeledPoint datasets for Actionable and nonactionable
JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
#Override public LabeledPoint call(String alert) {
List<String> featureList = Arrays.asList(alert.trim().split(","));
String actionType = featureList.get(featureList.size() - 1).toLowerCase();
return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
}
});
Similarly above I create testdata and use in the following code to do prediction
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint? I am new to machine learning and Spark MLlib.
I am more familiar with the scala version but I'll try to help.
You need to map the target variable (Action) and all categorical features into levels starting in 0 like 0,1,2,3... For example router1, router2, ... router5 into 0,1,2...4. The same with your target variable which I think was the only one you actually mapped, yes/no to 1/0 (I am not sure what your tf.transform(featureList) is actually doing).
Once you have done this you can train your Randomforest classifier specifying the map for categorical features. Basically it needs you to tell which features are categorical and how many levels do they have, this is the scala version but you can easily translate it into java:
val categoricalFeaturesInfo = Map[Int, Int]((2,2),(3,5))
this is basically saying that in your list of features the 3rd one (2) has 2 levels (2,2) and the 4th one (3) has 5 levels (3,5). The rest are considered Doubles.
Now you pass the categoricalFeaturesInfo when training the classifier together with the other parameters as:
val modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
Now when you need to evaluate it, the predict function will return a double 0,1 and you can use that to compute accuracy, precision or any metric needed.
This is the example (sorry scala again) if you have a testData where you did the same transformations as before:
val predictionAndLabels = testData.map { point =>
val prediction = modelRF.predict(point.features)
(point.label, prediction)
}
Here your results are clear, the label as 1/0 and the predicted value is also 1/0, any computation of Accuracy, Precision and Recall is straightforward.
I hope it helps!!
You're heading in the correct direction, and you've already managed to train a model which is great.
For binary clasification it will return either a 0.0 or a 1.0, and its up to you to map this back to your string values.

Categories

Resources