Java SPARK saveAsTextFile NULL - java

JavaRDD<Text> tx= counts2.map(new Function<Object, Text>() {
#Override
public Text call(Object o) throws Exception {
// TODO Auto-generated method stub
if (o.getClass() == Dict.class) {
Dict rkd = (Dict) o;
return new Text(rkd.getId());
} else {
return null ;
}
}
});
tx.saveAsTextFile("/rowkey/Rowkey_new");
I am new to Spark, I want to save this file, but I got the Null exception. I don't want to use return new Text() to replace return null,because it will insert a blank line to my file. So how can I solve this problem?

Instead of putting an if condition in your map, you simply use that if condition to build a RDD filter. The Spark Quick Start is a good place to start. There is also a nice overview of other transformations and actions.
Basically your code can look as follows (if you are using Java 8):
counts2
.filter((o)->o instanceof Dict)
.map(o->new Text(((Dict)o).getId()))
.saveAsTextFile("/rowkey/Rowkey_new");
You had the intention to map one incoming record to either zero or one outgoing record. This cannot be done with a map. However, filter maps to zero or one records with incoming record matches outgoing record, and flatMap gives you some more flexibility by allowing to map to zero or more outgoing records of any type.
It is strange, but not inconceivable, you create non-Dict objects that are going to be filters out further downstream anyhow. Possibly you can consider to push your filter even further upstream to make sure you only create Dict instances. Without knowing the rest of your code, this is only a assumption of course, and is not part of your original question anyhow.

Related

How to flatMap to database in Apache Flink?

I am using Apache Flink trying to get JSON records from Kafka to InfluxDB, splitting them from one JSON record into multiple InfluxDB points in the process.
I found the flatMap transform and it feels like it fits the purpose. Core code looks like this:
DataStream<InfluxDBPoint> dataStream = stream.flatMap(new FlatMapFunction<JsonConsumerRecord, InfluxDBPoint>() {
#Override
public void flatMap(JsonConsumerRecord record, Collector<InfluxDBPoint> out) throws Exception {
Iterator<Entry<String, JsonNode>> iterator = //...
while (iterator.hasNext()) {
// extract point from input
InfluxDBPoint point = //...
out.collect(point);
}
}
});
For some reason, I only get one of those collected points streamed into the database.
Even when I print out all mapped entries, it seems to work just fine: dataStream.print() yields:
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#144fd091
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#57256d1
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#28c38504
org.apache.flink.streaming.connectors.influxdb.InfluxDBPoint#2d3a66b3
Am I misunderstanding flatMap or might there be some bug in the Influx connector?
The problem was actually related to the fact that a series (defined by its tagset and measurement as seen here) in Influx can only have one point per time, therefore even though my fields differed, the final point overwrote all previous points with the same time value.

Using ComputeIfAbsent in HashMap

I have read similar posts on this, but is this a right way to use computeIfAbsent function? cookieMap is a HashMap and responses is an Object which contains all the headers, cookies, responses, status Code etc...
cookieMap.computeIfAbsent("Varlink", varLink -> {
if (responses.getCookie("VARLINK").length() < 1) {
throw new ProviderException("Varlink not present in response, check response status!!!");
}
return responses.getCookie("VARLINK");
});
I will need to add multiple keys like this to the cookieMap. My initial thought was to put everything inside an If condition, but due to certain restrictions we are not supposed to have nested if-else conditions (I guess the Code Reviewer took the book Clean Code too seriously)
If responses and cookieMap are two different sources of data, then your snippet is correct. The only concern is calling cookieMap::getCookie twice which might be resolved using a variable as someone has suggested in the comments.
I'd shorten the entire expression using Optional to:
cookieMap.computeIfAbsent("Varlink", v -> {
Optional.of(respones.getCookie("VARLINK")) // Gets a cookie
.filter(c -> c.length() >= 1) // Filters the length
.orElseThrow(() -> new ProviderException("...")); // Returns only if present
});

how to run multiple synchronous functions asynchronously?

I am writing in Java on the Vertx framework, and I have an architecture question regarding blocking code.
I have a JsonObject which consists of 10 objects, like so:
{
"system":"CD0",
"system":"CD1",
"system":"CD2",
"system":"CD3",
"system":"CD4",
"system":"CD5",
"system":"CD6",
"system":"CD7",
"system":"CD8",
"system":"CD9"
}
I also have a synchronous function which gets an object from the JsonObject, and consumes a SOAP web service, while sending the object to it.
the SOAP Web service gets the content (e.g. CD0), and after a few seconds returns an Enum.
I then want to take that enum value returned, and save it in some sort of data variable(like hash table).
What I ultimately want is a function that will iterate over all the JsonObject's objects, and for each one, run the blocking code, in parallel.
I want it to run in parallel so even if one of the calls to the function needs to wait 20 seconds, it won't stuck the other calls.
how can I do such a thing in vertx?
p.s: I will appreciate if you will correct mistakes I wrote.
Why not to use rxJava and "zip" separate calls? Vertx has great support for rxJava too. Assuming that you are calling 10 times same method with different String argument and returning another String you could do something like this:
private Single<String> callWs(String arg) {
return Single.fromCallable(() -> {
//DO CALL WS
return "yourResult";
});
}
and then just use it with some array of arguments:
String[] array = new String[10]; //get your arguments
List<Single<String>> wsCalls = new ArrayList<>();
for (String s : array) {
wsCalls.add(callWs(s));
}
Single.zip(wsCalls, r -> r).subscribe(allYourResults -> {
// do whatever you like with resutls
});
More about zip function and reactive programming in general: reactivex.io

Using Java for running MLlib model with streaming

I have a use case in which I get a MLlib model and a stream and want to get score (predict) a stream of data.
There are some examples and material on this issue using Scala but I cant translate it to Java.
Trying to run predict inside the map functions (as shown in the spark documentation)
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(
new Function<LabeledPoint, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(LabeledPoint p) {
Double score = model.predict(p.features());
return new Tuple2<Object, Object>(score, p.label());
}
}
);
results in error:
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation
My input is a coma separated two integers which I map into:
JavaDStream<Tuple2<Integer, Integer>> pairs
Then I want to transform it into:
JavaPairDStream<Integer, Double> scores
Where Double is the predict result and the Integer is a key so I will be able to reduce by the key.
This method results in a need to create a new DStream inside an existing one which I failed to do.
The predict method can be applied on RDD but I couldn't create a DStream back from it (must return void):
pairs.foreachRDD(new Function<JavaRDD<Tuple2<Object, Object>>, Void >(){
#Override
public Void call(JavaRDD<Tuple2<Object, Object>> arg0) throws Exception {
// TODO Auto-generated method stub
RDD<Rating> a = sameModel.predict(arg0.rdd());
}
});
Any ideas on how this might be achieved?
As far as I can tell problem here is not really a translation to Java but a specific model you use. MLlib provides two types of models - local and distributed. Local models can be serialized and used inside the map.
MatrixFactorizationModel model falls into the second category. It means it is internally using distributed data structures for predictions hence cannot be used from an action or transformation. If you want to use it for predictions on a whole RDD you have to pass it in the predict method like this:
model.predict(JavaRDD.toRDD(test))
See Java examples in Collaborative Filtering documentaion for details about required format of test data.

How to deal with code that runs before foreach block in Apache Spark?

I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr.
This works perfectly fine when I run the following code in stand-alone mode of Spark, but does not work when the same code is run on a cluster. When I run the same code on a cluster, it is like "send to Solr" part of the code is executed before the list to be sent to Solr is filled with items. I try to force the execution by solrInputDocumentJavaRDD.collect(); after foreach, but it seems like it does not have any effect.
// For each RDD
solrInputDocumentJavaDStream.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
#Override
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
// For each item in a single RDD
solrInputDocumentJavaRDD.foreach(
new VoidFunction<SolrInputDocument>() {
#Override
public void call(SolrInputDocument solrInputDocument) {
// Add the solrInputDocument to the list of SolrInputDocuments
SolrIndexerDriver.solrInputDocumentList.add(solrInputDocument);
}
});
// Try to force execution
solrInputDocumentJavaRDD.collect();
// After having finished adding every SolrInputDocument to the list
// add it to the solrServer, and commit, waiting for the commit to be flushed
try {
if (SolrIndexerDriver.solrInputDocumentList != null
&& SolrIndexerDriver.solrInputDocumentList.size() > 0) {
SolrIndexerDriver.solrServer.add(SolrIndexerDriver.solrInputDocumentList);
SolrIndexerDriver.solrServer.commit(true, true);
SolrIndexerDriver.solrInputDocumentList.clear();
}
} catch (SolrServerException | IOException e) {
e.printStackTrace();
}
return null;
}
}
);
What should I do, so that sending-to-Solr part executes after the list of SolrDocuments are added to solrInputDocumentList (and works also in cluster mode)?
As I mentioned on the Spark Mailing list:
I'm not familiar with the Solr API but provided that 'SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to:
SolrIndexerDriver.solrInputDocumentList.add(elem)
is happening on different singleton instances of the SolrIndexerDriver on different JVMs while
SolrIndexerDriver.solrServer.commit
is happening on the driver.
In practical terms, the lists on the executors are being filled-in but they are never committed and on the driver the opposite is happening.
The recommended way to handle this is to use foreachPartition like this:
rdd.foreachPartition{iter =>
// prepare connection
Stuff.connect(...)
// add elements
iter.foreach(elem => Stuff.add(elem))
// submit
Stuff.commit()
}
This way you can add the data of each partition and commit the results in the local context of each executor. Be aware that this add/commit must be thread safe in order to avoid data loss or corruption.
have you checked under the spark UI to see the execution plan of this job.
Check how it is getting split into stages and their dependencies. That should give you an idea hopefully.

Categories

Resources