It says in Apache Spark documentation "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads". Can someone explain how to achieve this concurrency for the following sample code?
SparkConf conf = new SparkConf().setAppName("Simple_App");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
System.out.println(file1.count());
System.out.println(file2.count());
These two jobs are independent and must run concurrently.
Thank You.
Try something like this:
final JavaSparkContext sc = new JavaSparkContext("local[2]","Simple_App");
ExecutorService executorService = Executors.newFixedThreadPool(2);
// Start thread 1
Future<Long> future1 = executorService.submit(new Callable<Long>() {
#Override
public Long call() throws Exception {
JavaRDD<String> file1 = sc.textFile("/path/to/test_doc1");
return file1.count();
}
});
// Start thread 2
Future<Long> future2 = executorService.submit(new Callable<Long>() {
#Override
public Long call() throws Exception {
JavaRDD<String> file2 = sc.textFile("/path/to/test_doc2");
return file2.count();
}
});
// Wait thread 1
System.out.println("File1:"+future1.get());
// Wait thread 2
System.out.println("File2:"+future2.get());
Using scala parallel collections feature
Range(0,10).par.foreach {
project_id =>
{
spark.table("store_sales").selectExpr(project_id+" as project_id", "count(*) as cnt")
.write
.saveAsTable(s"counts_$project_id")
}
}
PS. Above would launch up to 10 parallel Spark jobs but it could be less depending on number of available cores on Spark Driver. Above method by GQ using Futures is more flexible in this regard.
Related
I have an application that subscribes to a topic in GCP and when there is some messages over there it downloads them and sends them to a queue on ActiveMQ.
In order to make this process fast, I am using executorService and launching multiple threads for sending messages to activeMQ. Since this the subscription is supposed to be an ongoing task I am putting the code in a while(true) loop, and hence I can't shutdown the executorService in a normal fashion, as I will be creating and shutting down the executor service in every loop.
I am searching for an elegant way to shutdown the executorService when the subscription is empty (no data in the topic) for like 2 or 3 minutes or some inactivity window. and then of course it starts again when there is some new data.
The following is my idea which I don't like, which is just a counter that I am incrementing when the subscription retrieves no data.
I am looking for a more elegant way of doing that.
#Service
#Slf4j
public class PubSubSubscriberService {
private static final int EMPTY_SUBSCRIPTION_COUNTER = 4;
private static final Logger businessLogger = LoggerFactory.getLogger("BusinessLogger");
private Queue<PubsubMessage> messages = new ConcurrentLinkedQueue<>();
public void pullMessagesAndSendToBroker(CompositeConfigurationElement cce) {
var patchSize = cce.getSubscriber().getPatchSize();
var nThreads = cce.getSubscriber().getSendingParallelThreads();
var scheduledTasks = 0;
var subscribeCounter = 0;
ThreadPoolExecutor threadPoolExecutor = null;
while (true) {
try {
if (subscribeCounter < EMPTY_SUBSCRIPTION_COUNTER) {
log.info("Creating Executor Service for uploading to broker with a thread pool of Size: " + nThreads);
threadPoolExecutor = getThreadPoolExecutor(nThreads);
}
var subscriber = this.getSubscriber(cce);
this.startSubscriber(subscriber, cce);
this.checkActivity(threadPoolExecutor, subscribeCounter++);
// send patches of {{ messagesPerIteration }}
while (this.messages.size() > patchSize) {
if (poolIsReady(threadPoolExecutor, nThreads)) {
UploadTask task = new UploadTask(this.messages, cce, cf, patchSize);
threadPoolExecutor.submit(task);
scheduledTasks ++;
}
subscribeCounter = 0;
}
// send the rest
if (this.messages.size() > 0) {
UploadTask task = new UploadTask(this.messages, cce, cf, patchSize);
threadPoolExecutor.submit(task);
scheduledTasks ++;
subscribeCounter = 0;
}
if (scheduledTasks > 0) {
businessLogger.info("Scheduled " + scheduledTasks + " upload tasks of size upto: " + patchSize + ", preparing to start subscribing for 30 more sec") ;
scheduledTasks = 0;
}
} catch ( Exception e) {
e.printStackTrace();
businessLogger.error(e.getMessage());
}
}
Your pool take few space and memory and consume almost no CPU when it's not used. Set a max limit to your Pool capacity and use it with trying to downscale it. If you have too much messages to process, the task are queued waiting a free executor pool to complete the task.
If you have scalability up and down concerne, you design could be reviewed. Instead of executorPool internal to the pod, you could trigger an event in your cluster and process them in parallel, on other pods. These pods will be able to scale up and down according to the traffic (have a look to Knative)
As a Kafka learning exercise, I have written a Java program TsdbMetricToKafkaTopic to copy data from openTSDB to a Kafka topic, and another Java program DumpKafkaTopic to print out the results; below is the key method of DumpKafkaTopic.
I have confirmed, by using the Kafka utility kafka-console-consumer.sh, that the data I expect are indeed getting written to the intended topic. However, the behavior of DumpKafkaTopic is strange: When I run the producer and then DumpKafkaTopic, it prints results as I'd expect. However, if I re-run it immediately, it prints nothing.
I thought that because I set auto.offset.reset to earliest, my program would be idempotent, that is, every time I run it, it should produce the same results (until I write something else to the topic). Why isn't this happening?
public void dump( String kafka_topic ) {
// Serializers/deserializers (serde) for key and value types
final Serde<Long> long_serde = Serdes.Long();
final Serde< TsdbObject > tsdb_object_serde =
Serdes.serdeFrom( new TsdbObject.TsdbObjectSerializer(),
new TsdbObject.TsdbObjectDeserializer() );
StreamsBuilder streams_builder = new StreamsBuilder();
KStream< Long, TsdbObject > kstream =
streams_builder.stream( kafka_topic, Consumed.with( long_serde, tsdb_object_serde ) );
// Add final operator, to print results to stdout:
Printed< Long, TsdbObject > printed = Printed.toSysOut();
kstream.print( printed );
Map<String, Object> kstreams_props = new HashMap<>();
kstreams_props.put(StreamsConfig.APPLICATION_ID_CONFIG, "DumpKafkaTopic");
kstreams_props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
// make sure to consume the complete topic via "auto.offset.reset = earliest"
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsConfig kstreams_config = new StreamsConfig(kstreams_props);
KafkaStreams kstreams = new KafkaStreams( streams_builder.build(), kstreams_config );
System.out.println( "Starting DumpKafkaTopic stream " );
kstreams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams (from https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/)
Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
#Override
public void run() {
System.out.println( "Stopping DumpKafkaTopic stream " );
kstreams.close();
}
}));
}
I've got an app that sends simple SQS messages to multiple queues. Previously, this sending happened serially, but now that we've got more queues we need to send to, I decided to parallelize it by doing all the sending in a thread pool (up to 10 threads).
However, I've noticed that sqs.sendMessage latency seems to increase when I throw more threads at the job!
I've created a sample program below to reproduce the problem (Note that numIterations is just to get more data, and this is just a simplified version of the code for demo purposes).
Running on EC2 instance in the same region and using 7 queues, I'm typically getting average results around 12-15ms with 1 thread, and 21-25ms with 7 threads - nearly double the latency!
Even running from my laptop remotely (when creating this demo), I'm getting average latency of ~90ms with 1 thread and ~120ms with 7 threads.
public static void main(String[] args) throws Exception {
AWSCredentialsProvider creds = new AWSStaticCredentialsProvider(new BasicAWSCredentials(A, B));
final int numThreads = 7;
final int numQueues = 7;
final int numIterations = 100;
final long sleepMs = 10000;
AmazonSQSClient sqs = new AmazonSQSClient(creds);
List<String> queueUrls = new ArrayList<>();
for (int i=0; i<numQueues; i++) {
queueUrls.add(sqs.getQueueUrl("testThreading-" + i).getQueueUrl());
}
Queue<Long> resultQueue = new ConcurrentLinkedQueue<>();
sqs.addRequestHandler(new MyRequestHandler(resultQueue));
runIterations(sqs, queueUrls, numThreads, numIterations, sleepMs);
System.out.println("Average: " + resultQueue.stream().mapToLong(Long::longValue).average().getAsDouble());
System.exit(0);
}
private static void runIterations(AmazonSQS sqs, List<String> queueUrls, int threadPoolSize, int numIterations, long sleepMs) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(threadPoolSize);
List<Future<?>> futures = new ArrayList<>();
for (int i=0; i<numIterations; i++) {
for (String queueUrl : queueUrls) {
final String message = String.valueOf(i);
futures.add(executor.submit(() -> sendMessage(sqs, queueUrl, message)));
}
Thread.sleep(sleepMs);
}
for (Future<?> f : futures) {
f.get();
}
}
private static void sendMessage(AmazonSQS sqs, String queueUrl, String messageBody) {
final SendMessageRequest request = new SendMessageRequest()
.withQueueUrl(queueUrl)
.withMessageBody(messageBody);
sqs.sendMessage(request);
}
// Use RequestHandler2 to get accurate timing metrics
private static class MyRequestHandler extends RequestHandler2 {
private final Queue<Long> resultQueue;
public MyRequestHandler(Queue<Long> resultQueue) {
this.resultQueue = resultQueue;
}
public void afterResponse(Request<?> request, Response<?> response) {
TimingInfo timingInfo = request.getAWSRequestMetrics().getTimingInfo();
Long start = timingInfo.getStartEpochTimeMilliIfKnown();
Long end = timingInfo.getEndEpochTimeMilliIfKnown();
if (start != null && end != null) {
long elapsed = end-start;
resultQueue.add(elapsed);
}
}
}
I'm sure this is some weird client configuration issue, but the default ClientConfiguration should be able to handle 50 concurrent connections.
Any suggestions?
Update: It's looking like the key to this problem is something I left out of the original simplified version - there is a delay between batches of messages being sent (relating to doing processing). The latency issue isn't there if the delay is ~2s, but it is an issue when the delay between batches is ~10s. I've tried different values for ClientConfiguration.validateAfterInactivityMillis with no effect.
I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop
directory using textFileStream() at interval of each 1 Min.
I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2>
with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory.
Problem comes when I enable checkpointing. With empty checkpoint directory, it runs fine. After running 2-3 batches I close it using ctrl+c and run it again.
On second run it throws spark exception immediately: "SPARK-5063"
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063
Following is the Block of Code of spark application:
private void compute(JavaSparkContext sc, JavaStreamingContext ssc) {
JavaRDD<String> distFile = sc.textFile(MasterFile);
JavaDStream<String> file = ssc.textFileStream(inputDir);
// Read Master file
JavaRDD<MasterParseLog> masterLogLines = distFile.flatMap(EXTRACT_MASTER_LOGLINES);
final JavaPairRDD<String, String> masterRDD = masterLogLines.mapToPair(MASTER_KEY_VALUE_MAPPER);
// Continuous Streaming file
JavaDStream<ParseLog> logLines = file.flatMap(EXTRACT_CKT_LOGLINES);
// calculate the sum of required field and generate group sum RDD
JavaPairDStream<String, Summary> sumRDD = logLines.mapToPair(CKT_GRP_MAPPER);
JavaPairDStream<String, Summary> grpSumRDD = sumRDD.reduceByKey(CKT_GRP_SUM);
//GROUP BY Operation
JavaPairDStream<String, Summary> grpAvgRDD = grpSumRDD.mapToPair(CKT_GRP_AVG);
// Join Master RDD with the DStream //This is the block causing error (without it code is working fine)
JavaPairDStream<String, Tuple2<String, String>> joinedStream = grpAvgRDD.transformToPair(
new Function2<JavaPairRDD<String, String>, Time, JavaPairRDD<String, Tuple2<String, String>>>() {
private static final long serialVersionUID = 1L;
public JavaPairRDD<String, Tuple2<String, String>> call(
JavaPairRDD<String, String> rdd, Time v2) throws Exception {
return masterRDD.value().join(rdd);
}
}
);
joinedStream.print(10);
}
public static void main(String[] args) {
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
// Create the context with a 60 second batch size
SparkConf sparkConf = new SparkConf();
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc1 = new JavaStreamingContext(sc, Durations.seconds(duration));
app.compute(sc, ssc1);
ssc1.checkpoint(checkPointDir);
return ssc1;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
// start the streaming server
ssc.start();
logger.info("Streaming server started...");
// wait for the computations to finish
ssc.awaitTermination();
logger.info("Streaming server stopped...");
}
I know that block of code which joins static dataset with DStream is causing error, But that is taken from spark-streaming
page of Apache spark website (sub heading "stream-dataset join" under "Join Operations"). Please help me to get it working even if
there is different way of doing it. I need to enable checkpointing in my streaming application.
Environment Details:
Centos6.5 :2 node Cluster
Java :1.8
Spark :1.4.1
Hadoop :2.7.1*
I have multiple resources - for the sake of understanding say 3 resources namely XResource, YResource and ZResource (Java classes - Runnables) who are able to do a certain Task. There is a List of Tasks which needs to be done in parallel among the 3 resources. I need the resources to be locked and if one of the resource is locked then the task should go to some other resource and if none of the resources are available then it should wait till one of the resource is available. I am currently trying to get a lock to a resource using a Semaphore but the thread gets assigned to one Runnable only and the other Runnables are always idle. I am very new to multithreading so I might be overlooking something obvious. I am using Java SE 1.6
Below is my code -
public class Test {
private final static Semaphore xResourceSphore = new Semaphore(1, true);
private final static Semaphore yResourceSphore = new Semaphore(1, true);
private final static Semaphore zResourceSphore = new Semaphore(1, true);
public static void main(String[] args) {
ArrayList<Task> listOfTasks = new ArrayList<Task>();
Task task1 = new Task();
Task task2 = new Task();
Task task3 = new Task();
Task task4 = new Task();
Task task5 = new Task();
Task task6 = new Task();
Task task7 = new Task();
Task task8 = new Task();
Task task9 = new Task();
listOfTasks.add(task1);
listOfTasks.add(task2);
listOfTasks.add(task3);
listOfTasks.add(task4);
listOfTasks.add(task5);
listOfTasks.add(task6);
listOfTasks.add(task7);
listOfTasks.add(task8);
listOfTasks.add(task9);
//Runnables
XResource xThread = new XResource();
YResource yThread = new YResource();
ZResource zThread = new ZResource();
ExecutorService executorService = Executors.newFixedThreadPool(3);
for (int i = 0; i < listOfTasks.size(); i++) {
if (xResourceSphore.tryAcquire()) {
try {
xThread.setTask(listOfTasks.get(i));
executorService.execute(xThread );
} finally {
xResourceSphore.release();
}
}else if (yResourceSphore.tryAcquire()) {
try {
yThread.setTask(listOfTasks.get(i));
executorService.execute(yThread );
} finally {
yResourceSphore.release();
}
}else if (zResourceSphore.tryAcquire()) {
try {
zThread.setTask(listOfTasks.get(i));
executorService.execute(zThread );
} finally {
zResourceSphore.release();
}
}
}
executorService.shutdown();
}
}
You need to move the resource locking logic to the task which is run in another thread.
By doing the locking in the current thread, you are not waiting for the task to be performed before releasing the resource. The reason you are seeing the problem you are is that you are not waiting for the task to complete (or even start) before calling setTask() on the same resource. This replaces the previous task set.
Queue<Resource> resources = new ConcurrentLinkedQueue<>();
resources.add(new XResource());
resources.add(new YResource());
resources.add(new ZResource());
ExecutorService service = Executors.newFixedThreadPool(resources.size());
ThreadLocal<Resource> resourceToUse = ThreadLocal.withInitial(() -> resources.remove());
for (int i = 1; i < 9; i++) {
service.execute(() -> {
Task task = new Task();
resourceToUse.setTask(task);
});
}
Following Peter Lawrey's suggestion I passed the Semaphore within the runnable and released it after it finished execution. However I still faced the issue that I am unable to allocate all the tasks to the threads within the for loop. So I made a while(true) loop until one of the resource is available for a task. Below is the code:
ExecutorService executorService = Executors.newFixedThreadPool(3);
for (int i = 0; i < listOfTasks.size(); i++) {
while(true){
if (xResourceSphore.tryAcquire()) {
xThread.setTask(listOfTasks.get(i));
xThread.setSemaphore(xResourceSphore);
executorService.execute(xThread );
break;
}else if (yResourceSphore.tryAcquire()) {
yThread.setTask(listOfTasks.get(i));
yThread.setSemaphore(yResourceSphore);
executorService.execute(yThread );
break;
}else if (zResourceSphore.tryAcquire()) {
zThread.setTask(listOfTasks.get(i));
zThread.setSemaphore(zResourceSphore);
executorService.execute(zThread );
break;
}
}
}
executorService.shutdown();
I don't like this solution much because it cannot be extended if my resources are doing different types of tasks and hence if I need a particular resource for a particular kind of task , my other tasks would be waiting continuously till the particular task gets done. But for now couldn't get any other way. Even after so much of research!