I have a use case in which I get a MLlib model and a stream and want to get score (predict) a stream of data.
There are some examples and material on this issue using Scala but I cant translate it to Java.
Trying to run predict inside the map functions (as shown in the spark documentation)
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(
new Function<LabeledPoint, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(LabeledPoint p) {
Double score = model.predict(p.features());
return new Tuple2<Object, Object>(score, p.label());
}
}
);
results in error:
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation
My input is a coma separated two integers which I map into:
JavaDStream<Tuple2<Integer, Integer>> pairs
Then I want to transform it into:
JavaPairDStream<Integer, Double> scores
Where Double is the predict result and the Integer is a key so I will be able to reduce by the key.
This method results in a need to create a new DStream inside an existing one which I failed to do.
The predict method can be applied on RDD but I couldn't create a DStream back from it (must return void):
pairs.foreachRDD(new Function<JavaRDD<Tuple2<Object, Object>>, Void >(){
#Override
public Void call(JavaRDD<Tuple2<Object, Object>> arg0) throws Exception {
// TODO Auto-generated method stub
RDD<Rating> a = sameModel.predict(arg0.rdd());
}
});
Any ideas on how this might be achieved?
As far as I can tell problem here is not really a translation to Java but a specific model you use. MLlib provides two types of models - local and distributed. Local models can be serialized and used inside the map.
MatrixFactorizationModel model falls into the second category. It means it is internally using distributed data structures for predictions hence cannot be used from an action or transformation. If you want to use it for predictions on a whole RDD you have to pass it in the predict method like this:
model.predict(JavaRDD.toRDD(test))
See Java examples in Collaborative Filtering documentaion for details about required format of test data.
Related
Im trying to understand what Flinks Collector.collect() does and how it handles incoming/outgoing data:
Example taken from Flink DataSet API:
The following code transforms a DataSet of text lines into a DataSet of words:
DataSet<String> output = input.flatMap(new Tokenizer());
public class Tokenizer implements FlatMapFunction<String, String> {
#Override
public void flatMap(String value, Collector<String> out) {
for (String token : value.split("\\W")) {
out.collect(token);
}
}
}
So the text Lines get split into tokens and each of them gets "collected". As intuitive as it might sound but im missing the actual dynamics behind Collector.collect(). Where is the collected data stored before it gets assigned to output i.e does Flink put them in some sort of Buffer? And if yes, how is the data transferred to the network?
from the official source code documentation.
Collects a record and forwards it. The collector is the "push"
counterpart of the {#link java.util.Iterator}, which "pulls" data in.
So, it receives a value and stores one or more values into the Iterator. Then pushes to the next operator. But this is a matter of the network stack/ buffers.
I am using Apache Flink trying to get JSON records from Kafka to InfluxDB, splitting them from one JSON record into multiple InfluxDB points in the process.
I found the flatMap transform and it feels like it fits the purpose. Core code looks like this:
DataStream<InfluxDBPoint> dataStream = stream.flatMap(new FlatMapFunction<JsonConsumerRecord, InfluxDBPoint>() {
#Override
public void flatMap(JsonConsumerRecord record, Collector<InfluxDBPoint> out) throws Exception {
Iterator<Entry<String, JsonNode>> iterator = //...
while (iterator.hasNext()) {
// extract point from input
InfluxDBPoint point = //...
out.collect(point);
}
}
});
For some reason, I only get one of those collected points streamed into the database.
Even when I print out all mapped entries, it seems to work just fine: dataStream.print() yields:
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#144fd091
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#57256d1
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#28c38504
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#2d3a66b3
Am I misunderstanding flatMap or might there be some bug in the Influx connector?
The problem was actually related to the fact that a series (defined by its tagset and measurement as seen here) in Influx can only have one point per time, therefore even though my fields differed, the final point overwrote all previous points with the same time value.
I am trying to generate a class and methods in it, using Byte Buddy, based on some configuration that is available at runtime. The class is trying to create a Hazelcast Jet pipeline to join multiple IMaps.
Based on the provided configuration, the no. of IMaps to join can vary. In the sample below, I am trying to join three IMaps.
private Pipeline getPipeline(IMap<String, Object1> object1Map, IMap<String, Object2> object2Map,
IMap<String, Object3> object3Map) {
Pipeline p = Pipeline.create();
BatchStage<Entry<String, Object1>> obj1 = p.drawFrom(Sources.map(object1Map));
BatchStage<Entry<String, Object2>> obj2 = p.drawFrom(Sources.map(object2Map));
BatchStage<Entry<String, Object3>> obj3 = p.drawFrom(Sources.map(object3Map));
DistributedFunction<Tuple2<Object1, Object2>, String> obj1Obj2JoinFunc = entry -> entry.f1().getField31();
DistributedBiFunction<Tuple2<Object1, Object2>, Object3, Tuple2<Tuple2<Object1, Object2>, Object3>> output = (
in1, in2) -> (Tuple2.tuple2(in1, in2));
BatchStage<Tuple2<Object1, Object2>> obj1_obj2 = obj1.map(entry -> entry.getValue())
.hashJoin(obj2.map(entry -> entry.getValue()),
JoinClause.onKeys(Object1::getField11, Object2::getField21), Tuple2::tuple2).filter(entry -> entry.getValue() != null);
BatchStage<Tuple2<Tuple2<Object1, Object2>, Object3>> obj1_obj2_obj3 = obj1_obj2.hashJoin(
obj3.map(entry -> entry.getValue()),
JoinClause.onKeys(obj1Obj2JoinFunc, Object3::getField31), output)
.filter(entry -> entry.getValue() != null);
// the transformResult method will get the required fields from above operation and create object of AllObjectJoinClass
BatchStage<Entry<String, AllObjectJoinClass>> result = transformResult(obj1_obj2_obj3);
result.drainTo(Sinks.map("obj1_obj2_obj3"));
return p;
}
The problem here is that the no. of arguments to my method depend on the runtime configuration and that determines the method body as well.
I am able to generate the method signature using TypeDescription.Generic.Builder.parameterizedType.
But, I am having trouble generating the method body. I tried using MethodDelegation.to so that the method resides in a separate class. The trouble with this approach is that the method in the separate class needs to be very generic so that it can take arbitrary no. of arguments of different types and also needs to know about the fields of each of the objects in the IMap.
I wonder if there's an an alternate approach to achieving this with maybe templates of some type so that a separate class can be generated for each pipeline with this body. I didn't find any documentation for generating a method with a defined body (maybe I missed something).
-- Anoop
It very much depends on what you are trying to do:
With Advice, you can write a template as byte code that is inlined into your method.
With StackManipulations you can compose individual byte code instructions.
It seems to me that option (2) is what you are aiming for. For individually composed code, this is often the easiest option.
Writing individual byte code is of course not the most convenient option but if you can easily compose handling of each input, you might be able to compose multiple Advice classes to avoid using byte code instructions directly.
We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_collect_large_rdds.html
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks
Please use df.foreachPartition to execute for each partition independently and won't returns to driver. You can save the matching results into DB in each executor level. If you want to collect the results in driver, use mappartitions which is not recommended for your case.
Please refer the below link
Spark - Java - foreachPartition
dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> r) throws Exception {
while (t.hasNext()){
Row row = r.next();
System.out.println(row.getString(1));
}
// do your business logic and load into MySQL.
}
});
For mappartitions:
// You can use the same as Row but for clarity I am defining this.
public class ResultEntry implements Serializable {
//define your df properties ..
}
Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {
#Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
while (it.hasNext()) {
Row row = it.next()
if(somecondition)
filteredResult.add(convertToResultEntry(row));
}
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));
Hope this helps.
Ravi
JavaRDD<Text> tx= counts2.map(new Function<Object, Text>() {
#Override
public Text call(Object o) throws Exception {
// TODO Auto-generated method stub
if (o.getClass() == Dict.class) {
Dict rkd = (Dict) o;
return new Text(rkd.getId());
} else {
return null ;
}
}
});
tx.saveAsTextFile("/rowkey/Rowkey_new");
I am new to Spark, I want to save this file, but I got the Null exception. I don't want to use return new Text() to replace return null,because it will insert a blank line to my file. So how can I solve this problem?
Instead of putting an if condition in your map, you simply use that if condition to build a RDD filter. The Spark Quick Start is a good place to start. There is also a nice overview of other transformations and actions.
Basically your code can look as follows (if you are using Java 8):
counts2
.filter((o)->o instanceof Dict)
.map(o->new Text(((Dict)o).getId()))
.saveAsTextFile("/rowkey/Rowkey_new");
You had the intention to map one incoming record to either zero or one outgoing record. This cannot be done with a map. However, filter maps to zero or one records with incoming record matches outgoing record, and flatMap gives you some more flexibility by allowing to map to zero or more outgoing records of any type.
It is strange, but not inconceivable, you create non-Dict objects that are going to be filters out further downstream anyhow. Possibly you can consider to push your filter even further upstream to make sure you only create Dict instances. Without knowing the rest of your code, this is only a assumption of course, and is not part of your original question anyhow.