As continuation of this question, could you please tell me what properties I can change from SparkContext.setLocalProperties? Could I change cores, RAM etc?
As per documentation description localProperties is a protected[spark] property of a SparkContext that are the properties through which you can create logical job groups. In other hand they are Inheritable thread-local variables. Which means that they are used in preference to ordinary thread-local variables when the per-thread-attribute being maintained in the variable must be automatically transmitted to any child threads that are created.Propagating local properties to workers starts when SparkContext is requested to run or submit a Spark job that in turn passes them along to DAGScheduler.
And in general Local properties is used to group jobs into pools in FAIR job scheduler by spark.scheduler.pool per-thread property and in method SQLExecution.withNewExecutionIdto set spark.sql.execution.id.
I have no such experience assigning thread-local properties in standalone spark cluster. Worth to try and check it.
I made some testing with the property spark.executor.memory (the available properties are here), , and actually on a very simple local Spark, starting two threads each with different settings seem to be confined to the threads, with the code (probably not a code you would deploy into production) at the end of this post, making some interleaving of threads to be sure it's not through some sheer scheduling luck, I obtain the following output (cleaning spark output to my console):
Thread 1 Before sleeping mem: 512
Thread 2 Before sleeping mem: 1024
Thread 1 After sleeping mem: 512
Thread 2 After sleeping mem: 1024
Pretty neat to observe a declared property in a thread stays inside the said thread, although I am pretty sure that it can easily lead to nonsensical situation, so I'd still recommend caution before applying such techniques.
public class App {
private static JavaSparkContext sc;
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("Testing App");
sc = new JavaSparkContext(conf);
SparkThread Thread1 = new SparkThread(1);
SparkThread Thread2 = new SparkThread(2);
ExecutorService executor = Executors.newFixedThreadPool(2);
Future ThreadCompletion1 = executor.submit(Thread1);
try {
Thread.sleep(5000);
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
Future ThreadCompletion2 = executor.submit(Thread2);
try {
ThreadCompletion1.get();
ThreadCompletion2.get();
} catch (InterruptedException | ExecutionException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static class SparkThread implements Runnable{
private int i = 1;
public SparkThread(int i) {
this.i = i;
}
#Override
public void run() {
int mem = 512;
sc.setLocalProperty("spark.executor.memory", Integer.toString(mem * i));
JavaRDD<String> input = sc.textFile("test" + i);
FlatMapFunction<String, String> tt = s -> Arrays.asList(s.split(" "))
.iterator();
JavaRDD<String> words = input.flatMap(tt);
System.out.println("Thread " + i + " Before sleeping mem: " + sc.getLocalProperty("spark.executor.memory"));
try {
Thread.sleep(7000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//do some work
JavaPairRDD<String, Integer> counts = words.mapToPair(t -> new Tuple2(t, 1))
.reduceByKey((x, y) -> (int) x + (int) y);
counts.saveAsTextFile("output" + i);
System.out.println("Thread " + i + " After sleeping mem: " + sc.getLocalProperty("spark.executor.memory"));
}
}
}
LocalProperties provide an easy mechanism to pass (user defined) configurations from the driver to the executors. You can use the TaskContext on the executor to access them. An example of this is the SQL Execution ID
Related
I'm writing a console application to read json files and then do some processing with them. I have 200k json files to process, so I'm creating a thread per file. But I would like to have only 30 active threads running. I don't know how to control it in Java.
This is the piece of code I have so far:
for (String jsonFile : result) {
final String jsonFilePath = jsonFile;
Thread thread = new Thread(new Runnable() {
String filePath = jsonFilePath;
#Override
public void run() {
// Do stuff here
}
});
thread.start();
}
result is an array with the path of 200k files. From this point, I'm not sure how to control it. I thought about a List<Thread> and then in each thread implements a notifier and when they finish just remove from the list. But then I would have to make the main thread sleep and then wake-up. Which feels weird.
How can I achieve this?
I would suggest to not create one thread per file. Threads are limited resources. Creating too many can lead to starvation or even program abortion.
From what information was provided, I would use a ThreadPoolExecutor. Constructing such an Executor with a limited amount of threads is quite simple thanks to Executors::newFixedSizeThreadPool:
ExecutorService service = Executors.newFixedSizeThreadPool(30);
Looking at the ExecutorService-interface, method <T> Future<T> submit(Callable<T> task) might be fitting.
For this, some changes will be necessary. The tasks (i.e. what is currently a Runnable in the given implementation) must be converted to a Callable<T>, where T should be substituted with the return-type. The Future<T> returned should then be collected into a list and waited upon on. When all Futures have completed, the result list can be constructed, e.g. through streaming.
With parallelStreams and ForkJoinPool maybe you can get a more straightforward code, plus, an easy way to collect the results of your files after processing. For parallel processing, I prefer to directly use Threads, as a last resort, only when parallelStream can't be used.
boolean doStuff( String file){
// do your magic here
System.out.println( "The file " + file + " has been processed." );
// return the status of the processed file
return true;
}
List<String> jsonFiles = new ArrayList<String>();
jsonFiles.add("file1");
jsonFiles.add("file2");
jsonFiles.add("file3");
...
jsonFiles.add("file200000");
ForkJoinPool forkJoinPool = null;
try {
final int parallelism = 30;
forkJoinPool = new ForkJoinPool(parallelism);
forkJoinPool.submit(() ->
jsonFiles.parallelStream()
.map( jsonFile -> doStuff( jsonFile) )
.collect(Collectors.toList()) // you can collect this to a List<Boolea> results
).get();
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown();
}
}
Put your jobs (filenames) into a queue, start 30 threads to process them, then wait until all threads are done. For example:
static ConcurrentLinkedDeque<String> jobQueue = new ConcurrentLinkedDeque<String>();
private static class Worker implements Runnable {
int threadNumber;
public Worker(int threadNumber) {
this.threadNumber = threadNumber;
}
public void run() {
try {
System.out.println("Thread " + threadNumber + " started");
while (true) {
// get the next filename from job queue
String fileName;
try {
fileName = jobQueue.pop();
} catch (NoSuchElementException e) {
// The queue is empty, exit the loop
break;
}
System.out.println("Thread " + threadNumber + " processing file " + fileName);
Thread.sleep(1000); // so something useful here
System.out.println("Thread " + threadNumber + " finished file " + fileName);
}
System.out.println("Thread " + threadNumber + " finished");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) throws InterruptedException {
// Create dummy filenames for testing:
for (int i = 1; i <= 200; i++) {
jobQueue.push("Testfile" + i + ".json");
}
System.out.println("Starting threads");
// Create 30 worker threads
List<Thread> workerThreads = new ArrayList<Thread>();
for (int i = 1; i <= 30; i++) {
Thread thread = new Thread(new Worker(i));
workerThreads.add(thread);
thread.start();
}
// Wait until the threads are all finished
for (Thread thread : workerThreads) {
thread.join();
}
System.out.println("Finished");
}
}
I have two threads running parallely in a java program as below:
// Threading
new Thread(new Runnable() {
#Override
public void run() {
try {
gpTableCount = getGpTableCount();
} catch (SQLException e) {
e.printStackTrace();
} catch(Exception e) {
e.printStackTrace();
}
}
}).start();
new Thread(new Runnable() {
#Override
public void run() {
try {
hiveTableCount = getHiveTableCount();
} catch (SQLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
while(!(gpTableCount != null && gpTableCount.size() > 0 && hiveTableCount != null && hiveTableCount.size() > 0)) {
Thread.sleep(5000);
}
// Threading
Both of them have same functionality. Below is the code from getHiveTableCount(). The other method is slightly different (a line or two) from the below one but the functionality remains the same.
public Map<String, String> getHiveTableCount() throws IOException, SQLException {
hiveDataMap = new HashMap<String, String>();
hiveTableErrs = new HashMap<String, String>();
Iterator<String> hiveIterator = filteredList.iterator();
Connection hiveConnection = DbManager.getHiveConnection();
PreparedStatement hive_pstmnt = null;
String hiveExcpnMsg;
String ssn;
String hiveMaxUpdTms;
Long hiveCount;
String gpHiveRec;
String[] hiveArray;
String[] hiveDetails;
String hiveQuery;
while(hiveIterator.hasNext()) {
gpHiveRec = hiveIterator.next();
hiveArray = gpHiveRec.split(",");
hiveDetails = hiveArray[1].split("\\.");
hiveQuery = "select '" + hiveDetails[1] + "' as TableName, count(*) as Count, source_system_name, max(xx_last_update_tms) from " + hiveArray[1] + " where source_system_name='" + hiveArray[2] + "' group by source_system_name";
try {
hive_pstmnt = hiveConnection.prepareStatement(hiveQuery);
ResultSet hiveCountRs = hive_pstmnt.executeQuery();
while(hiveCountRs.next()) {
hiveCount = hiveCountRs.getLong(2);
ssn = hiveCountRs.getString(3);
hiveMaxUpdTms = hiveCountRs.getTimestamp(4).toString();
hiveDataMap.put(hiveDetails[1] + "," + ssn, hiveCount + "," + hiveMaxUpdTms);
}
} catch(org.postgresql.util.PSQLException e) {
hiveExcpnMsg = e.getMessage();
hiveTableErrs.put(hiveDetails[1] + ": for the SSN: " + hiveArray[2], hiveExcpnMsg + "\n");
} catch(SQLException e) {
hiveExcpnMsg = e.getMessage();
hiveTableErrs.put(hiveDetails[1] + ": for the SSN: " + hiveArray[2], hiveExcpnMsg + "\n");
} catch(Exception e) {
hiveExcpnMsg = e.getMessage();
hiveTableErrs.put(hiveDetails[1] + ": for the SSN: " + hiveArray[2], hiveExcpnMsg + "\n");
}
}
return hiveDataMap;
}
These two threads run concurrently. I recently read online that:
Future class represents a future result of an asynchronous computation
– a result that will eventually appear in the Future after the
processing is complete.
I understood the concept theoritically but I don't know how to apply the java.util.concurrent.Future api for the same above code instead of creating threads explicitly.
Could anyone let me know how can I implement multi threading on the methods: getGpTableCount() & getHiveTableCount using java.util.concurrent.Future api instead of creating threads creating new threads like new Thread(new Runnable() ?
You are submitting your tasks using the Runnable interface which doesn't allow your threads to return a value at the end of computation (and cause you to use a shared variable - gpTableCount and hiveTableCount).
The Callable interface is a later addition which allow your tasks to return a value (in your case, Map<String, String>).
As an alternative for working with threads directly, The Concurrency API introduces the ExecutorService as a higher level object which manages threads pools and able to execute tasks asynchronously.
When submiting a task of type Callable to an ExecutorService you're expecting the task to produce a value, but since the submiting point and the end of computaion aren't coupled, the ExecutorService will return Future which allow you to get this value, and block, if this value isn't available. Hence, Future can be used to synchronize between your different threads.
As an alternative to ExecutorService you can also take a look at FutureTask<V> which is implementation of RunnableFuture<V>:
This class provides a base implementation of Future, with methods to start and cancel a computation, query to see if the computation is complete, and retrieve the result of the computation
A FutureTask can be used to wrap a Callable or Runnable object.
if you are using Java 8+ you may use CompletableFuture.supplyAsync for that in short like:
import static java.util.concurrent.CompletableFuture.supplyAsync;
.....
Future<Map<String, String>> f= supplyAsync(()->{
try{
return getHiveTableCount();
} catch(Exception e) {
throw new RuntimeException(e);
}
}
CompletableFuture.supplyAsync will run it in default using ForkJoinPool.commonPool() it have also another overlap that taking Executorin its parameter if you want to use your own:
public class CompletableFuture<T>
extends Object
implements Future<T>, CompletionStage<T>
and it have.
public static <U> CompletableFuture<U> supplyAsync(Supplier<U> supplier)
public static <U> CompletableFuture<U> supplyAsync(Supplier<U> supplier,
Executor executor)
At first, create executor service which suits your needs the best, for example:
ExecutorService ex = Executors.newFixedThreadPool(2);
(more on executors: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Executors.html)
Instead of Runnable object, use Callable which is similar to runnable, but returns a value (more on callable : https://docs.oracle.com/javase/8/docs/api/index.html?java/util/concurrent/Callable.html):
Callable<Map<String, String>> callable1 = // your Callable class
Type parameter should be the same as as the type which you would like to return as a result.
Next create a list of your tasks:
List<Callable<Map<String, String>>> tasks = new LinkedList<>();
tasks.add(callable1);
tasks.add(callable2);
and execute them:
List<Future<Map<String, String>>> results = ex.invokeAll(tasks);
above method returns when all tasks are completed (if I understand your case correctly, this is what you would like to achieve), however completed task could have terminated either normally or by throwing an exception.
at the end close the executor service:
ex.shutdown();
I'm hoping some concurrency experts can advise as I'm not looking to rewrite something that likely exists.
Picture the problem; I have a web connection that comes calling looking for their unique computed result (with a key that they provide in order to retrieve their result) - however the result may not have been computed YET so I would like for the connection to wait (block) for UP TO n seconds before giving up and telling them I don't (yet) have their result (computation time to calculate value is non deterministic). something like;
String getValue (String key)
{
String value = [MISSING_PIECE_OF_PUZZLE].getValueOrTimeout(key, 10, TimeUnit.SECONDS)
if (value == null)
return "Not computed within 10 Seconds";
else
return "Value was computed and was " + value;
}
and then have another thread (the computation threads)that is doing the calculations - something like ;
public void writeValues()
{
....
[MISSING_PIECE_OF_PUZZLE].put(key, computedValue)
}
In this scenario, there are a number of threads working in the background to compute the values that will ultimately be picked up by a web connections. The web connections have NO control or authority over what is computed and when the computations execute - as I've said - this is being done in a pool in the background but these thread can publish when the computation has completed (how they do is the gist of this question). The publish message maybe consumed or not - depending if any subscribers are interested in this computed value.
As these are web connections that will be blocking - i could potentially have 1000s of concurrent connections waiting (subscribing) for their specific computed value so such a solution needs to be very light on blocking resources. The closest i've came to is this SO question which I will explore further but wanted to check i'm not missing something blindly obvious before writing this myself?
I think you should use a Future it gives an ability to compute data in a separate thread and block for the requested time period while waiting for an answer. Notice how it throws an exception if more then 3 seconds passed
public class MyClass {
// Simulates havy work that takes 10 seconds
private static int getValueOrTimeout() throws InterruptedException {
TimeUnit.SECONDS.sleep(10);
return 123;
}
public static void main(String... args) throws InterruptedException, ExecutionException {
Callable<Integer> task = () -> {
Integer val = null;
try {
val = getValueOrTimeout();
} catch (InterruptedException e) {
throw new IllegalStateException("task interrupted", e);
}
return val;
};
ExecutorService executor = Executors.newFixedThreadPool(1);
Future<Integer> future = executor.submit(task);
System.out.println("future done? " + future.isDone());
try {
Integer result = future.get(3, TimeUnit.SECONDS);
System.out.print("Value was computed and was : " + result);
} catch (TimeoutException ex) {
System.out.println("Not computed within 10 Seconds");
}
}
}
After looking in changes in your question I wanted to suggest a different approach using BlockingQueue in such case the producer logic completely separated from the consumer so you could do something like this
public class MyClass {
private static BlockingQueue<String> queue = new ArrayBlockingQueue<>(10);
private static Map<String, String> dataComputed = new ConcurrentHashMap<>();
public static void writeValues(String key) {
Random r = new Random();
try {
// Simulate working for long time
TimeUnit.SECONDS.sleep(r.nextInt(11));
String value = "Hello there fdfsd" + Math.random();
queue.offer(value);
dataComputed.putIfAbsent(key, value);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private static String getValueOrTimeout(String key) throws InterruptedException {
String result = dataComputed.get(key);
if (result == null) {
result = queue.poll(10, TimeUnit.SECONDS);
}
return result;
}
public static void main(String... args) throws InterruptedException, ExecutionException {
String key = "TheKey";
Thread producer = new Thread(() -> {
writeValues(key);
});
Thread consumer = new Thread(() -> {
try {
String message = getValueOrTimeout(key);
if (message == null) {
System.out.println("No message in 10 seconds");
} else {
System.out.println("The message:" + message);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
});
consumer.start();
producer.start();
}
}
With that said I have to agree with #earned that making the client thread to wait is not a good approach instead I would suggest using a WebSocket which gives you an ability to push data to the client when it is ready you can find lots of tutorials on WebSocket here is one for example ws tutorial
My android application implements data protection and working with cloud.
Application consists of UI and standalone service (runing in own process).
I'm using IPC(Messages & Handlers) to communicate between UI and service.
I have the next situation - before make some work with data i need to know about data size and data items count (i have to enumerate contacts, photos, etc and collect total information for progresses).
About problem:
When enumeration starts on the service side(it uses 4 runing threads in threadpool) my UI is freezing for several seconds (depends on total data size).
Does anybody know any way to make UI work good - without freezing in this moment?
Update:
Here is my ThreadPoolExecutor wrapper that i am using in service to execute estimate tasks(created like new ThreadPoolWorker(4,4,10)):
public class ThreadPoolWorker {
private Object threadPoolLock = new Object();
private ThreadPoolExecutor threadPool = null;
private ArrayBlockingQueue<Runnable> queue = null;
private List<Future<?>> futures = null;
public ThreadPoolWorker(int poolSize, int maxPoolSize, int keepAliveTime){
queue = new ArrayBlockingQueue<Runnable>(5);
threadPool = new ThreadPoolExecutor(poolSize, maxPoolSize, keepAliveTime, TimeUnit.SECONDS, queue);
threadPool.prestartAllCoreThreads();
}
public void runTask(Runnable task){
try{
synchronized (threadPoolLock) {
if(futures == null){
futures = new ArrayList<Future<?>>();
}
futures.add(threadPool.submit(task));
}
}catch(Exception e){
log.error("runTask failed. " + e.getMessage() + " Stack: " + OperationsHelper.StringOperations.getStackToString(e.getStackTrace()));
}
}
public void shutDown()
{
synchronized (threadPoolLock) {
threadPool.shutdown();
}
}
public void joinAll() throws Exception{
synchronized (threadPoolLock) {
try {
if(futures == null || (futures != null && futures.size() <= 0)){
return;
}
for(Future<?> f : futures){
f.get();
}
} catch (ExecutionException e){
log.error("ExecutionException Error: " + e.getMessage() + " Stack: " + OperationsHelper.StringOperations.getStackToString(e.getStackTrace()));
throw e;
} catch (InterruptedException e) {
log.error("InterruptedException Error: " + e.getMessage() + " Stack: " + OperationsHelper.StringOperations.getStackToString(e.getStackTrace()));
throw e;
}
}
}
}
Here the way to start enumeration tasks that i use:
estimateExecutor.runTask(contactsEstimate);
I must say you did not provided enough information (the part of the code you suspect as the cause..)
but from my knowledge and experience I can make an educated guess -
you are probably performing code on the UI thread (main thread) that it execution taking a while. I can also guess that this code is : querying cotacts / gallery provider for all the data..
in case you don't know - Service callback methods also been executed from the main thread (the UI thread..) unless explicitly you run them from AsyncTask / another thread, and querying content providers and processing it returned cursor for data can also be heavy operation that need to be executed from another thread for not blocking the main UI thread.
after removing the code performing this expensive queries to another thread - there is no reason you'll experience any freezing.
As in question. My current code is overkill for OS because it runs every wget process in seperate thread, which is fine, but I have almost 15k files to download, so I want to use a thread pool for this job. Unfortunately I must use wget for download process.
ExecutorService executor = Executors.newFixedThreadPool(5);
for(String filename: files) {
try {
String encodedFilename = URLEncoder.encode(filename, "UTF-8");
final String cmd = "wget --no-check-certificate -O " + filename +" " + BipDownloader.bipUrl + encodedFilename;
Runnable run = new Runnable()
{
public void run() {
try {
System.out.println(cmd);
Process process = Runtime.getRuntime().exec(cmd);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
};
executor.submit(run);
} catch(IOException e) {
System.err.println(e.getMessage());
}
}
EDIT
Updated source code to use Thread Pool but my system still is unstable during download.
Assuming that you do need to use wget, you can use an ExecutorService to handle a threadpool for you:
ExecutorService executor = new FixedThreadPool(100); //pool of 100 threads
...
Runnable r = new Runnable() {
public void run() {
try {
System.out.println(cmd);
Process process = Runtime.getRuntime().exec(cmd);
} catch (IOException e) {
e.printStackTrace();
}
}
}
executor.submit(r);
The optimal size of the pool depends on various factors and it is best to test several numbers. Something between 100 and 1000 should be ok.
If you need to monitor the progress of the executions, you can store the futures returned by executor.submit, or you can use a CompletionExecutorService.
EDIT
As noted in the comments, exec is non blocking so in theory, it is possible that all processes will be started before any of them has finished, even if the size of the pool is limited. To prevent that you should wait until in your run method until the process finishes:
Process process = Runtime.getRuntime().exec(cmd);
int exitVal = process.waitFor();