Mapping the values of a RDD to their dictionary values - java

I've this piece of code:
List tmp = colRDD.collect();
int ctr = 0;
for(Object o : tmp){
if (!dictionary.containsKey(o)) {
dictionary.put(o, ctr++);
}
}
revDictionary = dictionary.entrySet().stream()
.collect(Collectors.toMap(Entry::getValue, c -> c.getKey()));
colRDD = colRDD.map(x -> {return dictionary.get(x);});
At the start, I materialize the RDD and put each value in a hash table where the RDD values are keys.
Then, I sımple want to map each value in the RDD to their dictionary value.
However, I get a Task not serializable error. Why is that ?

This will be caused by trying to access a variable scoped to the driver, from within code that is evaluated by an executor.
Given your sample code, the most likely culprit is dictionary in this line of code:
colRDD = colRDD.map(x -> {return dictionary.get(x);});
However the issue could also be coming from further up in your code than you have supplied here, so you might need to check that too.
The reason for this is because dictionary resides in memory of your driver, which is likely running in a separate JVM instance than your executors. The lambda you have passed to colRDD.map is evaluated by an executor, not the driver. The function is serialised as the task to be executed, sent to an executor to be run. But the Spark engine is unable to serialise the task because of the 'closure' around dictionary and hence, exception.

Related

Optimizing method with list of 500k+ elements

I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.
As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}

Spark - Collect partitions using foreachpartition

We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_collect_large_rdds.html
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks
Please use df.foreachPartition to execute for each partition independently and won't returns to driver. You can save the matching results into DB in each executor level. If you want to collect the results in driver, use mappartitions which is not recommended for your case.
Please refer the below link
Spark - Java - foreachPartition
dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> r) throws Exception {
while (t.hasNext()){
Row row = r.next();
System.out.println(row.getString(1));
}
// do your business logic and load into MySQL.
}
});
For mappartitions:
// You can use the same as Row but for clarity I am defining this.
public class ResultEntry implements Serializable {
//define your df properties ..
}
Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {
#Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
while (it.hasNext()) {
Row row = it.next()
if(somecondition)
filteredResult.add(convertToResultEntry(row));
}
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));
Hope this helps.
Ravi

Is it possible to execute a command on all workers within Apache Spark?

I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.
I'm on Spark 2.0.2 and using dynamic allocation.
You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)
object ProcessManager {
lazy val start = // start your process here.
}
You can broadcast this object at the start of your application before you do any transformations.
val pm = sc.broadcast(ProcessManager)
Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.
rdd.mapPartition(itr => {
pm.value.start
// Other stuff here.
}
An object with static initialization which invokes your system process should do the trick.
object SparkStandIn extends App {
object invokeSystemProcess {
import sys.process._
val errorCode = "echo Whatever you put in this object should be executed once per jvm".!
def doIt(): Unit = {
// this object will construct once per jvm, but objects are lazy in
// another way to make sure instantiation happens is to check that the errorCode does not represent an error
}
}
invokeSystemProcess.doIt()
invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}
A specific answer for a specific use case, I have a cluster with 50 nodes and I wanted to know which ones have CET timezone set:
(1 until 100).toSeq.toDS.
mapPartitions(itr => {
sys.process.Process(
Seq("bash", "-c", "echo $(hostname && date)")
).
lines.
toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)
Notice I don't think it's guaranteed 100% you'll get a partition for every node so the command might not get run on every node, even using using a 100 elements Dataset in a cluster with 50 nodes like the previous example.

Aerospike Exception Error Code 4 Parameter Error when doing a put through java client

I am processing an avro file with a list of records and doing a client.put for each record to my local Aerospike store.
For some reason, put for a certain number of records is succeeding and it's not for the rest. I am doing this -
client.put(writePolicy, recordKey, bins);
The related values for the failed call are -
namespace = test
setname = test_set
userkey = some_string
write policy = null
Bins -
is_user:1
prof_loc:530049,530046,530032,530031,530017,530016,500046
rfm:Platinum
store_browsed:some_string
store_purch:some_string
city_id:null
Log Snippet -
com.aerospike.client.AerospikeException: Error Code 4: Parameter error
at com.aerospike.client.command.WriteCommand.parseResult(WriteCommand.java:72)
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:56)
at com.aerospike.client.AerospikeClient.put(AerospikeClient.java:338)
What could possibly be the issue?
Finally. Resolved!
I was using the REPLACE RecordsExistsAction in this case. Any bin with null value will fail in this configuration. Aerospike treats a null value in a bin as equivalent to removing that bin value from the record for a key. Thus REPLACE configuration doesn't make sense for such an operation, and hence a parameter error - Invalid DB operation.
UPDATE config on the other hand will work perfectly fine.
Aerospike allows reading and writing with great flexibility. For developers to harness this functionality, Aerospike exposes great number of variables on both Policy and WritePolicy, which at times can be intimidating and error prone to beginners. Parameter error simply means that some of the configurations are not in coherence. An easy start would be to use the default read or write policy, which can be obtained by passing null as the policy.
Eg:
aeroClient.put(null, key, new Bin("binName", object));
Below is aerospike put method code snippet
public final void put(WritePolicy policy, Key key, Bin... bins) throws AerospikeException {
if (policy == null) {
policy = writePolicyDefault;
}
WriteCommand command = new WriteCommand(cluster, policy, key, bins, Operation.Type.WRITE);
command.execute();
}
I recently got this error, because the expiration value that I was using in writePolicy was more than the default expiration time for the cache

Serializing RDD

I have an RDD which I am trying to serialize and then reconstruct by deserializing. I am trying to see if this is possible in Apache Spark.
static JavaSparkContext sc = new JavaSparkContext(conf);
static SerializerInstance si = SparkEnv.get().closureSerializer().newInstance();
static ClassTag<JavaRDD<String>> tag = scala.reflect.ClassTag$.MODULE$.apply(JavaRDD.class);
..
..
JavaRDD<String> rdd = sc.textFile(logFile, 4);
System.out.println("Element 1 " + rdd.first());
ByteBuffer bb= si.serialize(rdd, tag);
JavaRDD<String> rdd2 = si.deserialize(bb, Thread.currentThread().getContextClassLoader(),tag);
System.out.println(rdd2.partitions().size());
System.out.println("Element 0 " + rdd2.first());
I get an exception on the last line when I perform an action on the newly created RDD. The way I am serializing is similar to how it is done internally in Spark.
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.take(RDD.scala:1177)
at org.apache.spark.rdd.RDD.first(RDD.scala:1189)
at org.apache.spark.api.java.JavaRDDLike$class.first(JavaRDDLike.scala:477)
at org.apache.spark.api.java.JavaRDD.first(JavaRDD.scala:32)
at SimpleApp.sparkSend(SimpleApp.java:63)
at SimpleApp.main(SimpleApp.java:91)
The RDD is created and loaded within the same process, so I don't understand how this error happens.
I'm the author of this warning message.
Spark does not support performing actions and transformations on copies of RDDs that are created via deserialization. RDDs are serializable so that certain methods on them can be invoked in executors, but end users shouldn't try to manually perform RDD serialization.
When an RDD is serialized, it loses its reference to the SparkContext that created it, preventing jobs from being launched with it (see here). In earlier versions of Spark, your code would result in a NullPointerException when Spark tried to access the private, null RDD.sc field.
This error message was worded this way because users were frequently running into confusing NullPointerExceptions when trying to do things like rdd1.map { _ => rdd2.count() }, which caused actions to be invoked on deserialized RDDs on executor machines. I didn't anticipate that anyone would try to manually serialize / deserialize their RDDs on the driver, so I can see how this error message could be slightly misleading.

Categories

Resources