I'm new to Akka toolkit. I need to run a process on multiple files that takes a considerable amount of time. So I created one actor per file and started the processing. I'm creating these actors in a POJO class as follows:
public class ProcessFiles {
private static final Logger logger = LoggerFactory.getLogger(ProcessFiles.class.getSimpleName());
public static void main(String[] args) throws IOException, InterruptedException {
long startTime = System.currentTimeMillis();
logger.info("Creating actor system");
ActorSystem system = ActorSystem.create("actor_system");
Set<String> files = new HashSet<>();
Stream<String> stringStream = Files.lines(Paths.get(fileName));
stringStream.forEach(line -> files.addAll(Arrays.asList(line.split(","))));
List<CompletableFuture<Object>> futureList = new ArrayList<>();
files.forEach((String file) -> {
ActorRef actorRef = system.actorOf(Props.create(ProcessFile.class, file));
futureList.add(PatternsCS.ask(actorRef, file, DEFAULT_TIMEOUT).toCompletableFuture());
});
boolean isDone;
do {
Thread.sleep(30000);
isDone = true;
int count = 0;
for (CompletableFuture<Object> future : futureList) {
isDone = isDone & (future.isDone() || future.isCompletedExceptionally() || future.isCancelled());
if (future.isDone() || future.isCompletedExceptionally() || future.isCancelled()) {
++count;
}
}
logger.info("Process is completed for " + count + " files out of " + files.size() + " files.");
} while (!isDone);
logger.info("Process is done in " + (System.currentTimeMillis() - startTime) + " ms");
system.terminate();
}
}
Here, ProcessFile is the actor class. After invoking all the actors in order to exit the program, the main process checks whether all the actors are finished or not in every 30 seconds. Is there any better way to implement this kind of functionality?
I would suggest to create one more actor that keeps tracks of termination of all the actors in system, and closing the actor system when all the actors are killed.
So in your application-
ProcessFile actor can send a poison pill to self, after processing the file.
WatcherActor will watch(context.watch(processFileActor)) the ProcessFileActor and maintain the count of all the ProcessFile actors registered.
On termination of the actors WatcherActor will receive the Terminated message.
It will decrease the count, and when the count reaches 0, close the ActorSystem.
Related
I'm writing a console application to read json files and then do some processing with them. I have 200k json files to process, so I'm creating a thread per file. But I would like to have only 30 active threads running. I don't know how to control it in Java.
This is the piece of code I have so far:
for (String jsonFile : result) {
final String jsonFilePath = jsonFile;
Thread thread = new Thread(new Runnable() {
String filePath = jsonFilePath;
#Override
public void run() {
// Do stuff here
}
});
thread.start();
}
result is an array with the path of 200k files. From this point, I'm not sure how to control it. I thought about a List<Thread> and then in each thread implements a notifier and when they finish just remove from the list. But then I would have to make the main thread sleep and then wake-up. Which feels weird.
How can I achieve this?
I would suggest to not create one thread per file. Threads are limited resources. Creating too many can lead to starvation or even program abortion.
From what information was provided, I would use a ThreadPoolExecutor. Constructing such an Executor with a limited amount of threads is quite simple thanks to Executors::newFixedSizeThreadPool:
ExecutorService service = Executors.newFixedSizeThreadPool(30);
Looking at the ExecutorService-interface, method <T> Future<T> submit​(Callable<T> task) might be fitting.
For this, some changes will be necessary. The tasks (i.e. what is currently a Runnable in the given implementation) must be converted to a Callable<T>, where T should be substituted with the return-type. The Future<T> returned should then be collected into a list and waited upon on. When all Futures have completed, the result list can be constructed, e.g. through streaming.
With parallelStreams and ForkJoinPool maybe you can get a more straightforward code, plus, an easy way to collect the results of your files after processing. For parallel processing, I prefer to directly use Threads, as a last resort, only when parallelStream can't be used.
boolean doStuff( String file){
// do your magic here
System.out.println( "The file " + file + " has been processed." );
// return the status of the processed file
return true;
}
List<String> jsonFiles = new ArrayList<String>();
jsonFiles.add("file1");
jsonFiles.add("file2");
jsonFiles.add("file3");
...
jsonFiles.add("file200000");
ForkJoinPool forkJoinPool = null;
try {
final int parallelism = 30;
forkJoinPool = new ForkJoinPool(parallelism);
forkJoinPool.submit(() ->
jsonFiles.parallelStream()
.map( jsonFile -> doStuff( jsonFile) )
.collect(Collectors.toList()) // you can collect this to a List<Boolea> results
).get();
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown();
}
}
Put your jobs (filenames) into a queue, start 30 threads to process them, then wait until all threads are done. For example:
static ConcurrentLinkedDeque<String> jobQueue = new ConcurrentLinkedDeque<String>();
private static class Worker implements Runnable {
int threadNumber;
public Worker(int threadNumber) {
this.threadNumber = threadNumber;
}
public void run() {
try {
System.out.println("Thread " + threadNumber + " started");
while (true) {
// get the next filename from job queue
String fileName;
try {
fileName = jobQueue.pop();
} catch (NoSuchElementException e) {
// The queue is empty, exit the loop
break;
}
System.out.println("Thread " + threadNumber + " processing file " + fileName);
Thread.sleep(1000); // so something useful here
System.out.println("Thread " + threadNumber + " finished file " + fileName);
}
System.out.println("Thread " + threadNumber + " finished");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) throws InterruptedException {
// Create dummy filenames for testing:
for (int i = 1; i <= 200; i++) {
jobQueue.push("Testfile" + i + ".json");
}
System.out.println("Starting threads");
// Create 30 worker threads
List<Thread> workerThreads = new ArrayList<Thread>();
for (int i = 1; i <= 30; i++) {
Thread thread = new Thread(new Worker(i));
workerThreads.add(thread);
thread.start();
}
// Wait until the threads are all finished
for (Thread thread : workerThreads) {
thread.join();
}
System.out.println("Finished");
}
}
I'm testing processing of a large file (10.000.100 rows) with java.
I wrote a piece of code which reads from the file and spawns a specified number of Threads (at most equal to the cores of the CPU) which, then, print the content of the rows of the file to the standard output.
The Main class is like the following:
public class Main
{
public static void main(String[] args)
{
int maxThread;
ArrayList<String> linesForWorker = new ArrayList<String>();
if ("MAX".equals(args[1]))
maxThread = Runtime.getRuntime().availableProcessors();
else
maxThread = Integer.parseInt(args[1]);
ExecutorService executor = Executors.newFixedThreadPool(maxThread);
String readLine;
Thread.sleep(1000L);
long startTime = System.nanoTime();
BufferedReader br = new BufferedReader(new FileReader(args[0]));
do
{
readLine= br.readLine();
if ("X".equals(readLine))
{
executor.execute(new WorkerThread((ArrayList) linesForWorker.clone()));
linesForWorker.clear(); // Wrote to avoid storing a list with ALL the lines of the file in memory
}
else
{
linesForWorker.add(readLine);
}
}
while (readLine!= null);
executor.shutdown();
br.close();
if (executor.awaitTermination(1L, TimeUnit.HOURS))
System.out.println("END\n\n");
long endTime = System.nanoTime();
long durationInNano = endTime - startTime;
System.out.println("Duration in hours:" + TimeUnit.NANOSECONDS.toHours(durationInNano));
System.out.println("Duration in minutes:" + TimeUnit.NANOSECONDS.toMinutes(durationInNano));
System.out.println("Duration in seconds:" + TimeUnit.NANOSECONDS.toSeconds(durationInNano));
System.out.println("Duration in milliseconds:" + TimeUnit.NANOSECONDS.toMillis(durationInNano));
}
}
And then the WorkerThread class is structured as following:
class WorkerThread implements Runnable
{
private List<String> linesToPrint;
public WorkerThread(List<String> linesToPrint) { this.linesToPrint = linesToPrint; }
public void run()
{
for (String lineToPrint : this.linesToPrint)
{
System.out.println(String.valueOf(Thread.currentThread().getName()) + ": " + lineToPrint);
}
this.linesToPrint = null; // Wrote to help garbage collector know I don't need the object anymore
}
}
I run the application specifing 1 and "MAX" (i.e. number of CPUs core, which is 4 in my case) as the maximum thread of the FixedThreadPool and I experienced:
An execution time of about 40 minutes when executing the application with 1 single thread in the FixedThreadPool.
An execution time of about 44 minutes when executing the application with 4 threads in the FixedThreadPool.
Someone could explain me this strange (at least for me) behaviour? Why multithreading didn't help here?
P.S. I have SSD on my machine
EDIT: I modified the code so that the Threads now create a file and write their set of lines to that file in the SSD. Now the execution time has diminished to about 5 s, but I still have that the 1-thread version of the program runs in about 5292 ms, while the multithreaded (4 threads) version runs in about 5773 ms.
Why the multithreaded version still lasts more? Maybe every thread, even to write his "personal" file, has to wait the other threads to release the SSD resource in order to access it and write?
I'm hoping some concurrency experts can advise as I'm not looking to rewrite something that likely exists.
Picture the problem; I have a web connection that comes calling looking for their unique computed result (with a key that they provide in order to retrieve their result) - however the result may not have been computed YET so I would like for the connection to wait (block) for UP TO n seconds before giving up and telling them I don't (yet) have their result (computation time to calculate value is non deterministic). something like;
String getValue (String key)
{
String value = [MISSING_PIECE_OF_PUZZLE].getValueOrTimeout(key, 10, TimeUnit.SECONDS)
if (value == null)
return "Not computed within 10 Seconds";
else
return "Value was computed and was " + value;
}
and then have another thread (the computation threads)that is doing the calculations - something like ;
public void writeValues()
{
....
[MISSING_PIECE_OF_PUZZLE].put(key, computedValue)
}
In this scenario, there are a number of threads working in the background to compute the values that will ultimately be picked up by a web connections. The web connections have NO control or authority over what is computed and when the computations execute - as I've said - this is being done in a pool in the background but these thread can publish when the computation has completed (how they do is the gist of this question). The publish message maybe consumed or not - depending if any subscribers are interested in this computed value.
As these are web connections that will be blocking - i could potentially have 1000s of concurrent connections waiting (subscribing) for their specific computed value so such a solution needs to be very light on blocking resources. The closest i've came to is this SO question which I will explore further but wanted to check i'm not missing something blindly obvious before writing this myself?
I think you should use a Future it gives an ability to compute data in a separate thread and block for the requested time period while waiting for an answer. Notice how it throws an exception if more then 3 seconds passed
public class MyClass {
// Simulates havy work that takes 10 seconds
private static int getValueOrTimeout() throws InterruptedException {
TimeUnit.SECONDS.sleep(10);
return 123;
}
public static void main(String... args) throws InterruptedException, ExecutionException {
Callable<Integer> task = () -> {
Integer val = null;
try {
val = getValueOrTimeout();
} catch (InterruptedException e) {
throw new IllegalStateException("task interrupted", e);
}
return val;
};
ExecutorService executor = Executors.newFixedThreadPool(1);
Future<Integer> future = executor.submit(task);
System.out.println("future done? " + future.isDone());
try {
Integer result = future.get(3, TimeUnit.SECONDS);
System.out.print("Value was computed and was : " + result);
} catch (TimeoutException ex) {
System.out.println("Not computed within 10 Seconds");
}
}
}
After looking in changes in your question I wanted to suggest a different approach using BlockingQueue in such case the producer logic completely separated from the consumer so you could do something like this
public class MyClass {
private static BlockingQueue<String> queue = new ArrayBlockingQueue<>(10);
private static Map<String, String> dataComputed = new ConcurrentHashMap<>();
public static void writeValues(String key) {
Random r = new Random();
try {
// Simulate working for long time
TimeUnit.SECONDS.sleep(r.nextInt(11));
String value = "Hello there fdfsd" + Math.random();
queue.offer(value);
dataComputed.putIfAbsent(key, value);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private static String getValueOrTimeout(String key) throws InterruptedException {
String result = dataComputed.get(key);
if (result == null) {
result = queue.poll(10, TimeUnit.SECONDS);
}
return result;
}
public static void main(String... args) throws InterruptedException, ExecutionException {
String key = "TheKey";
Thread producer = new Thread(() -> {
writeValues(key);
});
Thread consumer = new Thread(() -> {
try {
String message = getValueOrTimeout(key);
if (message == null) {
System.out.println("No message in 10 seconds");
} else {
System.out.println("The message:" + message);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
});
consumer.start();
producer.start();
}
}
With that said I have to agree with #earned that making the client thread to wait is not a good approach instead I would suggest using a WebSocket which gives you an ability to push data to the client when it is ready you can find lots of tutorials on WebSocket here is one for example ws tutorial
There is declaring of host's features in platform.xml file:
<host id="Tier1_1" core="2" speed="100f"/>
The worker process lives in this host.
How can worker simultaneously receive and execute two tasks (in case of number of core is 2)?
Now I use such code, but it doesn't work in this case(this code can't simultaneously receive two task, only one);
while(true) {
commReceived = Task.irecv("Tier1_" + num);
commReceived.waitCompletion();
if (commReceived.test()){
task = commReceived.getTask();
commReceived = null;
Msg.info("Receive " + task.getName());
task.execute();
Msg.info("End to execute " + task.getName());
}
UPD:
Now I use this code. There are two processes with the same mailbox "Tier1_2". I send with isend to mailbox ("Tier1_2"):
for (int j=0; j<2; j++){
Process process = new Process(getHost().getName(), "Tier1_2_" + j) {
#Override
public void main(String[] strings) throws MsgException {
while (true){
commReceived = Task.irecv("Tier1_2");
commReceived.waitCompletion();
if (commReceived.test()){
task = commReceived.getTask();
commReceived = null;
Msg.info("Receive " + task.getName());
}
}
}
};process.start();
}
But it gives:
Exception in thread "Thread-5" java.lang.NullPointerException
at LHCb.Tier1$1.main(Tier1.java:46)
at org.simgrid.msg.Process.run(Process.java:338)
How correctly I should declare processes?
The idea is to have the worker process to spawn other processes that listen on different mailboxes. For instance something like (which I haven't tested)
for (int i = 0; i < 2; i++) {
Process p = new Process(getHost.getName(), "Tier1_" + i) {
public void main(String[] args) throws MsgException {
String mailbox = getName();
while(true) {
commReceived = Task.irecv(mailbox);
commReceived.waitCompletion();
if (commReceived.test()){
task = commReceived.getTask();
commReceived = null;
Msg.info("Receive " + task.getName());
task.execute();
Msg.info("End to execute " + task.getName());
}
}
});
p.start();
}
The new Process() method takes two arguments: the name of the host on which the process runs, and the name of the process itself. Here we declare a unique process name that will be used as the mailbox name (hence the mailbox = getName()).
Don't forget to kill these processes at some point, as they run forever. So you might want to put all the spawned processes in a vector to ease that.
i am currently learning to use the the concurrent features of Java provided by the package java.util.concurrent. As an exercise i tried to write a little program that could be used to performance test a HTTP API. But somehow my program is not terminating correctly very often. It even crashes my OS.
Following is the pseudo code of my program:
Instantiate Request Objects, that query an HTTP API (In the example i just query one random site).
Instantiate multiple Callables, where each one represents a represents an Http Call.
Iterate over the Callables and schedule them via a ScheduledExecutorService (how many requests should be performed per second can be configured at the begin of the code).
After scheduling all Callables, i am beginning to iterate over the Futures. If a futures is done, retrieve the response. Do this every second. If no new Future was finished, quit the loop.
What problems am i experiencing in detail?
Lots of times, the program is not finishing correctly. I see all log prints in the console, as if the program is finishing correctly. But actually i am seeing that stop button in eclipse still remains active . If i click it, it says that the program could not be terminated correctly. It does not finish no matter how i long i wait (NOTE: I am starting the program inside eclipse).
I can provoke the error easily if i am increasing the number of Requests. If am turning up to 2000, this will happen for sure. If it happens my OS even crashes, i can still use eclipse, but other apps do not work anymore.
My Environment is Eclipse 3.7 on Mac OS X 10.7 with Java 1.6 and Apache httpclient 4.2.2
Do you spot any major erros in my code? Before i have never had such issues in a java program with crashing my OS and seeing no exceptions at all.
The code:
public class ConcurrentHttpRequestsTest {
/**
* #param args
*/
public static void main(String[] args) {
ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(25);
Integer standardTimeout = 5000;
Float numberOfRequestsPerSecond = 50.0f;
Integer numberOfRequests = 500;
Integer durationBetweenRequests = Math.round(1000 / numberOfRequestsPerSecond);
// build Http Request
HttpGet request = null;
request = new HttpGet("http://www.spiegel.de");
// request.addHeader("Accept", "application/json");
HttpParams params = new BasicHttpParams();
HttpConnectionParams.setConnectionTimeout(params, standardTimeout);
HttpConnectionParams.setSoTimeout(params, standardTimeout);
request.setParams(params);
// setup concurrency logic
Collection<Callable<Long>> callables = new LinkedList<Callable<Long>>();
for (int i = 1; i <= numberOfRequests; i++) {
HttpClient client = new DefaultHttpClient();
callables.add(new UriCallable(request, client));
}
// start performing requests
int i = 1;
Collection<Future<Long>> futures = new LinkedList<Future<Long>>();
for (Callable<Long> callable : callables) {
ScheduledFuture<Long> future = scheduledExecutorService.schedule(callable, i * durationBetweenRequests, TimeUnit.MILLISECONDS);
futures.add(future);
i++;
}
// process futures (check wether they are ready yet)
Integer maximumNoChangeCount = 5;
boolean futuresAreReady = false;
int noChangeCount = 0;
int errorCount = 0;
List<Long> responses = new LinkedList<Long>();
while (!futuresAreReady) {
boolean allFuturesAreDone = true;
boolean atLeast1FutureIsDone = false;
Iterator<Future<Long>> iterator = futures.iterator();
while (iterator.hasNext()) {
Future<Long> future = iterator.next();
allFuturesAreDone = allFuturesAreDone && (future.isDone());
if (future.isDone()) {
try {
atLeast1FutureIsDone = true;
responses.add(future.get());
iterator.remove();
} catch (Exception e) {
// remove failed futures (e.g. timeout)
// System.out.println("Reached catch of future.get()" +
// e.getClass() + " " + e.getCause().getClass() + " " +
// e.getMessage());
iterator.remove();
errorCount++;
}
}
if (future.isCancelled()) {
// this code is never reached. Just here to make sure that
// this is not the cause of problems.
System.out.println("Found a cancelled future. Will remove it.");
iterator.remove();
}
}
if (!atLeast1FutureIsDone) {
System.out.println("At least 1 future was not done. Current noChangeCount:" + noChangeCount);
noChangeCount++;
} else {
// reset noChangeCount
noChangeCount = 0;
}
futuresAreReady = allFuturesAreDone;
// log the current state of responses, errors and remaining futures
System.out.println("Size of responses :" + responses.size() + "; Size of futures:" + futures.size() + " Errors:" + errorCount);
if (noChangeCount >= maximumNoChangeCount) {
System.out.println("Breaking while loop becauce no new future finished in the last " + maximumNoChangeCount + " iterations");
break;
}
// check every second
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
for (Long responsetime : responses) {
// analyze responsetimes or whatever
}
// clean up
// .shutdown() made even more problems than shutdownNow()
scheduledExecutorService.shutdownNow();
System.out.println("Executors have been shutdown - Main Method finished. Will exit System.");
System.out.flush();
System.exit(0);
}
private static class UriCallable implements Callable<Long> {
private HttpUriRequest request;
private HttpClient client;
public UriCallable(HttpUriRequest request, HttpClient client) {
super();
this.request = request;
this.client = client;
}
public Long call() throws Exception {
Long start = System.currentTimeMillis();
HttpResponse httpResponse = client.execute(request);
Long end = System.currentTimeMillis();
return end - start;
}
}
}
Never do this in a loop:
} catch (InterruptedException e) {
e.printStackTrace();
}
It might cause problems on shutdown.
Also, most of your code could be replaced by a single call to ExecutorService.invokeAll(), so try that and see if you have more luck.
Lastly, when you don't know what your Java application is doing, run jconsole, attach to the application, and look at the thread stacks to see what code is currently in progress.