Serializing RDD

Serializing RDD - java

I have an RDD which I am trying to serialize and then reconstruct by deserializing. I am trying to see if this is possible in Apache Spark.
static JavaSparkContext sc = new JavaSparkContext(conf);
static SerializerInstance si = SparkEnv.get().closureSerializer().newInstance();
static ClassTag<JavaRDD<String>> tag = scala.reflect.ClassTag$.MODULE$.apply(JavaRDD.class);
..
..
JavaRDD<String> rdd = sc.textFile(logFile, 4);
System.out.println("Element 1 " + rdd.first());
ByteBuffer bb= si.serialize(rdd, tag);
JavaRDD<String> rdd2 = si.deserialize(bb, Thread.currentThread().getContextClassLoader(),tag);
System.out.println(rdd2.partitions().size());
System.out.println("Element 0 " + rdd2.first());
I get an exception on the last line when I perform an action on the newly created RDD. The way I am serializing is similar to how it is done internally in Spark.
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.take(RDD.scala:1177)
at org.apache.spark.rdd.RDD.first(RDD.scala:1189)
at org.apache.spark.api.java.JavaRDDLike$class.first(JavaRDDLike.scala:477)
at org.apache.spark.api.java.JavaRDD.first(JavaRDD.scala:32)
at SimpleApp.sparkSend(SimpleApp.java:63)
at SimpleApp.main(SimpleApp.java:91)
The RDD is created and loaded within the same process, so I don't understand how this error happens.

I'm the author of this warning message.
Spark does not support performing actions and transformations on copies of RDDs that are created via deserialization. RDDs are serializable so that certain methods on them can be invoked in executors, but end users shouldn't try to manually perform RDD serialization.
When an RDD is serialized, it loses its reference to the SparkContext that created it, preventing jobs from being launched with it (see here). In earlier versions of Spark, your code would result in a NullPointerException when Spark tried to access the private, null RDD.sc field.
This error message was worded this way because users were frequently running into confusing NullPointerExceptions when trying to do things like rdd1.map { _ => rdd2.count() }, which caused actions to be invoked on deserialized RDDs on executor machines. I didn't anticipate that anyone would try to manually serialize / deserialize their RDDs on the driver, so I can see how this error message could be slightly misleading.

Related

Spark: show dataframe content in logging (Java)

I want to know how to show dataframe content (rows) in Java ?
I tried using log.info(df.showString()), but it prints unreadable characters. I want to use df.collectAsList() but I have to do filter afterwards so I can't do this.
Thank you.

There are several options to log the data:
Collecting the data to the driver
You can call collectAsList() and continue processing afterwards. Spark datasets are immutable, so collecting them to the driver will trigger the execution, but you can re-use the dataset afterwards for further processing steps:
Dataset<Data> ds = ... //1
List<Data> collectedDs = ds.collectAsList(); //2
doSomeLogging(collectedDs);
ds = ds.filter(<filter condition>); //3
ds.show();
The code above will collect the data in line //2, log it and then continue processing in line //3.
Depending on how complex the creation of the dataset in line //1 was, you would want to cache the dataset, so that the processing in line //1 is run only once.
Dataset<Data> ds = ... //1
ds = ds.cache();
List<Data> collectedDs = ds.collectAsList(); //2
....
Using map
Calling collectAsList() will send all your data to the driver. Usually you use Spark in order to distribute the data over several executor nodes, so your driver might not be large enough to hold all of the data at the same time. In this case, you can log the data in a map call:
Dataset<Data> ds = ... //1
ds = ds.map(d -> {
System.out.println(d); //2
return d; //3
}, Encoders.bean(Data.class));
ds = ds.filter(<filter condition>);
ds.show();
In this example, line //2 does the logging and line //3 simply returns the original object, so that the dataset remains unchanged. I assume that the Data class comes with a readable toString() implementation. Otherwise, line //2 needs some more logic. It always might be helpful to a log library (like log4j) instead of writing directly to standard out.
In this second approach, the logs will not be written on the driver but on each executor. You would have to collect the logs after the Spark job has finished and combine them into one file.
If you have an untyped dataframe instead of a dataset like above, the same code would work. You only would have to operate directly on a Row object using the getXXX methods for creating logging output instead of the Data class.
All logging operations will have an impact on the performance of your code.

Apache Beam in Dataflow Large Side Input

This is most similar to this question.
I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and have all the relevant values attached to it (based on a key) before being written to a database.
The trouble is that the mapping dataset from BigQuery is very large - any attempt to use it as a side input fails with the Dataflow runners throwing the error "java.lang.IllegalArgumentException: ByteString would be too long". I have attempted the following strategies:
1) Side input
As stated,the mapping data is (apparently) too large to do this. If I'm wrong here or there is a work-around for this, please let me know because this would be the simplest solution.
2) Key-Value pair mapping
In this strategy, I read the BigQuery data and Pubsub message data in the first part of the pipeline, then run each through ParDo transformations that change every value in the PCollections to KeyValue pairs. Then, I run a Merge.Flatten transform and a GroupByKey transform to attach the relevant mapping data to each message.
The trouble here is that streaming data requires windowing to be merged with other data, so I have to apply windowing to the large, bounded BigQuery data as well. It also requires that the windowing strategies are the same on both datasets. But no windowing strategy for the bounded data makes sense, and the few windowing attempts I've made simply send all the BQ data in a single window and then never send it again. It needs to be joined with every incoming pubsub message.
3) Calling BQ directly in a ParDo (DoFn)
This seemed like a good idea - have each worker declare a static instance of the map data. If it's not there, then call BigQuery directly to get it. Unfortunately this throws internal errors from BigQuery every time (as in the entire message just says "Internal error"). Filing a support ticket with Google resulted in them telling me that, essentially, "you can't do that".
It seems this task doesn't really fit the "embarrassingly parallelizable" model, so am I barking up the wrong tree here?
EDIT :
Even when using a high memory machine in dataflow and attempting to make the side input into a map view, I get the error java.lang.IllegalArgumentException: ByteString would be too long
Here is an example (psuedo) of the code I'm using:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<String, TableRow>> mapData = pipeline
.apply("ReadMapData", BigQueryIO.read().fromQuery("SELECT whatever FROM ...").usingStandardSql())
.apply("BQToKeyValPairs", ParDo.of(new BQToKeyValueDoFn()))
.apply(View.asMap());
PCollection<PubsubMessage> messages = pipeline.apply(PubsubIO.readMessages()
.fromSubscription(String.format("projects/%1$s/subscriptions/%2$s", projectId, pubsubSubscription)));
messages.apply(ParDo.of(new DoFn<PubsubMessage, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
JSONObject data = new JSONObject(new String(c.element().getPayload()));
String key = getKeyFromData(data);
TableRow sideInputData = c.sideInput(mapData).get(key);
if (sideInputData != null) {
LOG.info("holyWowItWOrked");
c.output(new TableRow());
} else {
LOG.info("noSideInputDataHere");
}
}
}).withSideInputs(mapData));
The pipeline throws the exception and fails before logging anything from within the ParDo.
Stack trace:
java.lang.IllegalArgumentException: ByteString would be too long: 644959474+1551393497
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.concat(ByteString.java:524)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:576)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.copyFrom(ByteString.java:559)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString$Output.toByteString(ByteString.java:1006)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:951)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1000)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Check out the section called "Pattern: Streaming mode large lookup tables" in Guide to common Cloud Dataflow use-case patterns, Part 2. It might be the only viable solution since your side input doesn't fit into memory.
Description:
A large (in GBs) lookup table must be accurate, and changes often or
does not fit in memory.
Example:
You have point of sale information from a retailer and need to
associate the name of the product item with the data record which
contains the productID. There are hundreds of thousands of items
stored in an external database that can change constantly. Also, all
elements must be processed using the correct value.
Solution:
Use the "Calling external services for data enrichment" pattern
but rather than calling a micro service, call a read-optimized NoSQL
database (such as Cloud Datastore or Cloud Bigtable) directly.
For each value to be looked up, create a Key Value pair using the KV
utility class. Do a GroupByKey to create batches of the same key type
to make the call against the database. In the DoFn, make a call out to
the database for that key and then apply the value to all values by
walking through the iterable. Follow best practices with client
instantiation as described in "Calling external services for data
enrichment".
Other relevant patterns are described in Guide to common Cloud Dataflow use-case patterns, Part 1:
Pattern: Slowly-changing lookup cache
Pattern: Calling external services for data enrichment

Mapping the values of a RDD to their dictionary values

I've this piece of code:
List tmp = colRDD.collect();
int ctr = 0;
for(Object o : tmp){
if (!dictionary.containsKey(o)) {
dictionary.put(o, ctr++);
}
}
revDictionary = dictionary.entrySet().stream()
.collect(Collectors.toMap(Entry::getValue, c -> c.getKey()));
colRDD = colRDD.map(x -> {return dictionary.get(x);});
At the start, I materialize the RDD and put each value in a hash table where the RDD values are keys.
Then, I sımple want to map each value in the RDD to their dictionary value.
However, I get a Task not serializable error. Why is that ?

This will be caused by trying to access a variable scoped to the driver, from within code that is evaluated by an executor.
Given your sample code, the most likely culprit is dictionary in this line of code:
colRDD = colRDD.map(x -> {return dictionary.get(x);});
However the issue could also be coming from further up in your code than you have supplied here, so you might need to check that too.
The reason for this is because dictionary resides in memory of your driver, which is likely running in a separate JVM instance than your executors. The lambda you have passed to colRDD.map is evaluated by an executor, not the driver. The function is serialised as the task to be executed, sent to an executor to be run. But the Spark engine is unable to serialise the task because of the 'closure' around dictionary and hence, exception.

Is it possible to execute a command on all workers within Apache Spark?

I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.
I'm on Spark 2.0.2 and using dynamic allocation.

You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)
object ProcessManager {
lazy val start = // start your process here.
}
You can broadcast this object at the start of your application before you do any transformations.
val pm = sc.broadcast(ProcessManager)
Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.
rdd.mapPartition(itr => {
pm.value.start
// Other stuff here.
}

An object with static initialization which invokes your system process should do the trick.
object SparkStandIn extends App {
object invokeSystemProcess {
import sys.process._
val errorCode = "echo Whatever you put in this object should be executed once per jvm".!
def doIt(): Unit = {
// this object will construct once per jvm, but objects are lazy in
// another way to make sure instantiation happens is to check that the errorCode does not represent an error
}
}
invokeSystemProcess.doIt()
invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}

A specific answer for a specific use case, I have a cluster with 50 nodes and I wanted to know which ones have CET timezone set:
(1 until 100).toSeq.toDS.
mapPartitions(itr => {
sys.process.Process(
Seq("bash", "-c", "echo $(hostname && date)")
).
lines.
toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)
Notice I don't think it's guaranteed 100% you'll get a partition for every node so the command might not get run on every node, even using using a 100 elements Dataset in a cluster with 50 nodes like the previous example.

Save a spark RDD using mapPartition with iterator

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt

A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.

Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Serializing RDD - java

Related

Spark: show dataframe content in logging (Java)

Apache Beam in Dataflow Large Side Input

Mapping the values of a RDD to their dictionary values

Is it possible to execute a command on all workers within Apache Spark?

Save a spark RDD using mapPartition with iterator

Categories

Resources