Does Spark filtering reload the data?

Does Spark filtering reload the data? - java

This is the main body of my really simple Spark job...
def hBaseRDD = sc.newAPIHadoopRDD(config, TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
println "${hBaseRDD.count()} records counted"
def filteredRDD = hBaseRDD.filter({ scala.Tuple2 result ->
def val = result._2.getValue(family, qualifier)
val ? new String(val) == 'twitter' : false
} as Function<Result, Boolean>)
println "${filteredRDD.count()} counted from twitter."
println "Done!"
I noticed in the spark-submit output, that it appeared to go to HBase twice. The first time was when it called count on hBaseRDD and the second was when it called filter to create filteredRDD. Is there a way to get it to cache the results of the newAPIHadoopRDD call in hBaseRDD so that filter works on an in-memory only copy of the data?

hbaseRDD.cache() before counting will do the trick.
The docs cover the options in detail: http://spark.apache.org/docs/1.2.0/programming-guide.html#rdd-persistence

Related

Zip mono/flux with another that depends on the one I own

I want to zip two monos/fluxes, but the second one (the one I'll zip) is dependand from the first that I already own.
For example:
//...
fun addGroup(
input: GroupInput
): Mono<Group> = Mono.just(Group(title = input.title, description = input.description))
.flatMap { g -> groupsRepository.save(g) } // Gives me back the new db ID
.zipWith(Mono.just(GroupMember(g.id /* <-- ??*/, input.ownerId, true)))
//...
// ...
Is it possible?

I would say no. You can only zip things that can run in parallel.

you could use map or flatmap in this case, is there any reason you need to use Zip?

No, your code cannot even compile because variable g belongs to the lambda inside the flatMap() call.
zipWith() is not intended for this use case, use map() or flatMap()

HBase put[util.List[Put]) does not work

I have been trying to insert List of Records into HBase using HBase client library.
it works for a single Put in either Table or HTable(deprecated) but does not recognize List(Puts)
Error Says: Expected: util.List, but was List
Could not understand the meaning of this error. Tried converting to JavaList but did not succeed.
Any quick advice would be higher appreciable.
`
val quoteRecords = new ListBuffer[Put]()
val listQuotes = lines.foreachRDD(rdd => {
rdd.foreach(record => addPutToList(buildPut(record)))
})
table.put(quoteRecords.toList)
quoteRecords.foreach(table.put)
println(listQuotes)
`

listQuotes.toList returns a type of a scala.List. You have to convert that into a java.util.List type. this sof thread will give you some insight.

set / get input / output parameters of a Camunda user task using Java API

I have simple workflow:
[start_workflow] -> [user_task] ->
-> [exclusive_gateway] -> (two routes see below) -> [end_workflow]
The [exclusive_gateway] has two outgoing routes:
1.) ${if user_task output paramterer == null} -> [NULL_service_task] -> [end_workflow]
2.) ${if user_task output paramterer != null} -> [NOT_null_service_task] -> [end_workflow]
In Camunda Modeler, I've added an output parameter (named out) to the [user_task].
Q:
How do I set thet output parameter through Java API before completing the task via:
taskService.complete(taskId);
On the [exclusive_gateway] arrows, I've set this:
Condition type = expression
Expression = ${out != null}
But there's more:
If I delete the output parameter of the [user_task] and set a runtimeService variable before completing the task:
runtimeService.setVariable(processInstanceId, "out", name);
The [exclusive_gateway] does handle the parameter, and routes the flow as expected.
Without deleting the output parameter of the [user_task] it seems like:
1. it is never set (so == null)
2. this null value overwrites the value set by
runtimeService.setVariable(processInstanceId, "out", name);
So can I set a task's output parameter via Java API or I can only use process variables?

I guess you are looking for
taskService.complete(<taskId>, Variables.putValue("out", <name>);
the communication between task and gateway (forwarding of the value) happens through setting the process-variable "out" on complete.
for more info, check the javadoc.

Why does SparkSession execute twice for one action?

Recently upgraded to Spark 2.0 and I'm seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here's a simple test case:
SparkSession spark = SparkSession.builder().appName("test").master("local[1]").getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
JavaRDD<String> rdd = sc.parallelize(Arrays.asList(
"{\"name\":\"tom\",\"title\":\"engineer\",\"roles\":[\"designer\",\"developer\"]}",
"{\"name\":\"jack\",\"title\":\"cto\",\"roles\":[\"designer\",\"manager\"]}"
));
JavaRDD<String> mappedRdd = rdd.map(json -> {
System.out.println("mapping json: " + json);
return json;
});
Dataset<Row> data = spark.read().json(mappedRdd);
data.show();
And the output:
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
+----+--------------------+--------+
|name| roles| title|
+----+--------------------+--------+
| tom|[designer, develo...|engineer|
|jack| [designer, manager]| cto|
+----+--------------------+--------+
It seems that the "map" function is being executed twice even though I'm only performing one action. I thought that Spark would lazily build an execution plan, then execute it when needed, but this makes it seem that in order to read data as JSON and do anything with it, the plan will have to be executed at least twice.
In this simple case it doesn't matter, but when the map function is long running, this becomes a big problem. Is this right, or am I missing something?

It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.
Since mappedRdd is not cached it will be evaluated twice:
once for schema inference
once when you call data.show
If you want to prevent you should provide schema for reader (Scala syntax):
val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)

Running computation in Spark in a single node

I have an RDD like so:
JavaPairRDD<PointFlag, Point> keyValuePair = ...
I want to output an RDD like so:
JavaRDD<Point> valuesRemainingAfterProcessing = processAndOutputSkylinePoints(keyValuePair)
The processing will take place in a single node because all the values are needed for the processing to occur. (doing comparisons between them and their flags)
What I thought of doing is:
Map everything to a single ID: JavaPairRDD<Integer, Tuple2<PointFlag, Point>> singleIdRDD = keyValuePair.mapToPair(fp -> new Tuple2(0, fp));
Do the processing: JavaRDD<Iterable<Point>> iterableGlobalSkylines = singleIdRDD.map(ifp -> calculateGlobalSkyline(ifp)); (calculateGlobalSkyline() returns a List<Point>)
Convert to JavaRDD<Point>: JavaRDD<Point> globalSkylines = iterableGlobalSkylines.flatMap(p -> p);
This all though looks like a dirty hack to me and I would like to know if there is a better way of doing this.

A good solution I found (definitely way less verbose) is to use the glom() function from the Spark API. This function returns a single List of all the elements of the previous RDD or in official terms:
Return an RDD created by coalescing all elements within each partition into a list.
First though you have to reduce the RDD to a single partition. Here is the solution:
JavaPairRDD<PointFlag, Point> keyValuePair = ...;
JavaPairRDD<PointFlag, Point> singlePartition = keyValuePair.coalesce(1);
JavaRDD<List<Tuple2<PointFlag, Point>>> groupedOnASingleList = keyValuePair.glom();
JavaRDD<Point> globalSkylinePoints = groupedOnASingleList.flatMap(singleList -> getGlobalSkylines(singleList));
If anyone has a better answer feel free to post it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Does Spark filtering reload the data? - java

hbaseRDD.cache() before counting will do the trick. The docs cover the options in detail: http://spark.apache.org/docs/1.2.0/programming-guide.html#rdd-persistence

Related

Zip mono/flux with another that depends on the one I own

HBase put[util.List[Put]) does not work

set / get input / output parameters of a Camunda user task using Java API

Why does SparkSession execute twice for one action?

Running computation in Spark in a single node

Categories

Resources