I have a file with millions of lines in it that I need to process. Each line of the file will result in an HTTP call. I'm trying to figure out the best way to attack the problem.
I obviously could just read the file and make the calls sequentially, but it would be incredibly slow. I'd like to parallelize the calls, but I'm not sure if I should read the entire file into memory (something I'm not a huge fan of) or try to parallelize the reading of the file as well (which I'm not sure would make sense).
Just looking for some thoughts here on the best way to attack the problem. If there is an existing framework or library that does something similar I'm happy to use that as well.
Thanks.
I'd like to parallelize the calls, but I'm not sure if I should read the entire file into memory
You should used an ExecutorService with a bounded BlockingQueue. As you read in your million lines you submit jobs to the thread-pool until the BlockingQueue is full. This way you will be able to run 100 (or whatever number is optimal) of HTTP requests simultaneously without having to read all of the lines of the file beforehand.
You'll need to set up a RejectedExecutionHandler that blocks if the queue is full. This is better than a caller runs handler.
BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(100);
// NOTE: you want the min and max thread numbers here to be the same value
ThreadPoolExecutor threadPool =
new ThreadPoolExecutor(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, queue);
// we need our RejectedExecutionHandler to block if the queue is full
threadPool.setRejectedExecutionHandler(new RejectedExecutionHandler() {
#Override
public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
try {
// this will block the producer until there's room in the queue
executor.getQueue().put(r);
} catch (InterruptedException e) {
throw new RejectedExecutionException(
"Unexpected InterruptedException", e);
}
}
});
// now read in the urls
while ((String url = urlReader.readLine()) != null) {
// submit them to the thread-pool. this may block.
threadPool.submit(new DownloadUrlRunnable(url));
}
// after we submit we have to shutdown the pool
threadPool.shutdown();
// wait for them to complete
threadPool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
...
private class DownloadUrlRunnable implements Runnable {
private final String url;
public DownloadUrlRunnable(String url) {
this.url = url;
}
public void run() {
// download the URL
}
}
Gray's approach seems to be good. The other approach I would suggest is to split the files into chunks (you will have to write the logic), and process those with multiple threads.
Related
I have a producer-consumer model using a blocking queue where 4 threads read files from a directory puts it to the blocking queue and 4 threads(consumer) reads from blocking queue.
My problem is every time only one consumer reads from the Blockingqueue and the other 3 consumer threads are not reading:
final BlockingQueue<byte[]> queue = new LinkedBlockingQueue<>(QUEUE_SIZE);
CompletableFuture<Void> completableFutureProducer = produceUrls(files, queue, checker);
//not providing code for produceData , it is working file with all 4 //threads writing to Blocking queue. Here is the consumer code.
private CompletableFuture<Validator> consumeData(
final Response checker,
final CompletableFuture<Void> urls
) {
return CompletableFuture.supplyAsync(checker, 4)
.whenComplete((result, err) -> {
if (err != null) {
LOG.error("consuming url worker failed!", err);
urls.cancel(true);
}
});
}
completableFutureProducer.join();
completableFutureConsumer.join();
This is my code. Can someone tell me what I am doing wrong? Or help with correct code.
Why is one consumer reading from the Blocking queue.
Adding code for Response class reading from Blocking queue :
#Slf4j
public final class Response implements Supplier<Check> {
private final BlockingQueue<byte[]> data;
private final AtomicBoolean producersComplete;
private final Calendar calendar = Calendar.getInstance();
public ResponseCode(
final BlockingQueue<byte[]> data
) {
this.data = data;
producersDone = new AtomicBoolean();
}
public void notifyProducersDone() {
producersComplete.set(true);
}
#Override
public Check get() {
try {
Check check = null;
try {
while (!data.isEmpty() || !producersDone.get()) {
final byte[] item = data.poll(1, TimeUnit.SECONDS);
if (item != null) {
LOG.info("{}",new String(item));
// I see only one thread printing result here .
validator = validateData(item);
}
}
} catch (InterruptedException | IOException e) {
Thread.currentThread().interrupt();
throw new WriteException("Exception occurred while data validation", e);
}
return check;
} finally {
LOG.info("Done reading data from BlockingQueue");
}
}
}
It's hard to diagnose from this alone, but it's probably not correct to check for data.isEmpty() because the queue may happen to be temporarily empty (but later get items). So your threads might exit as soon as they encounter a temporarily empty queue.
Instead, you can exit if producers were done AND you got an empty result from the poll. That way the threads only exit when there are truly no more items to process.
It's a bit odd though that you are returning the result of the last item (alone). Are you sure this is what you want?
EDIT: I've done something very similar recently. Here is a class that reads from a file, transforms the lines in a multi-threaded way, then writes to a different file (the order of lines are preserved).
It also uses a BlockingQueue. It's very similar to your code, but it doesn't check for quue.isEmpty() for the aforementioned reason. It works fine for me.
4+4 threads is not that many, so you better do not use asynchronous tools like CompletableFuture. Simple multithreaded program would be simpler and work faster.
Having
BlockingQueue<byte[]> data;
don't use data.poll();
use data.take();
When you have lets say 1 item in the queue, and 4 consumers, one of them will poll the item rendering queue to be empty. Then 3 of the rest of the consumers checks if queue.isEmpty(), and since it is - quits the loop.
I'm trying to load images from some folder to ConcurrentHashMap using multiple threads to save time. Unfortunatelly, some threads 'getting stuck' while they are trying load and put image to my map. As a result, when calling shutdown() program goes further even if some threads didnt perform their tasks. When I set the ExecutorService threads pule to 1, everything goes properly but I waste a lot of time waiting to load all images. It seems to me that there is some race problems but as I know ConcurrentHashMap is safe for multithread operations. I'm still a beginner so please let me understand where is the problem and what I'm doing badly.
Here is the code:
public abstract class ImageContainer {
private final static Map<String, BufferedImage> imageMap = loadImages();
private static long loadingTime;
public static Map<String, BufferedImage> loadImages() {
loadingTime = System.currentTimeMillis();
ConcurrentHashMap<String, BufferedImage> imageMap = new ConcurrentHashMap<>();
ExecutorService es = Executors.newFixedThreadPool(5);
File imageDirectory = new File("Images/");
if (!imageDirectory.isDirectory()) {
System.out.println("Image directory error");
}
File[] files = imageDirectory.listFiles();
if (files != null) {
for (File file : files) {
if (file.isFile()) {
es.submit(new Runnable(){
#Override
public void run() {
try{
if(file.getAbsolutePath().contains(".jpg")) {
imageMap.put(file.getName().replace(".jpg",""),ImageIO.read(file));
}
else if (file.getAbsolutePath().contains(".png")) {
imageMap.put(file.getName().replace(".png",""),ImageIO.read(file));
}
}
catch (IOException e)
{
System.out.println("Cannot load image");
}
}
});
}
}
}
else
{
System.out.println("Image folder empty!");
}
es.shutdown();
try {
if(!es.awaitTermination(5L, TimeUnit.SECONDS)) {
System.out.println("Images did not load successfully!");
es.shutdownNow();
}
loadingTime = System.currentTimeMillis() - loadingTime;
}
catch(InterruptedException e) {
System.out.println("Loading images interrupted!");
}
System.out.println(imageMap.size());
return imageMap;
}
};
The problem has most likely nothing to do with ConcurrentHashMap. Each time, you put something in the map, no other thread will be able to put concurrently. So maybe, some threads will have to wait until the other has finished with put, but that will not cause any race conditions.
I executed your code on my machine and everything is working. (No error message, prints the number of loaded images). Maybe your computer is not as fast as mine in loading images and therefor, the awaitTermination times out.
As far as I can tell, I don't know if your approach (loading images with multithreading) is such a good idea. Your harddrive (or SSD) will be the bottleneck and your threads will end up waiting for the hard drive (statement ImageIO.read).
Also, spinning up an executor service (resp. starting new threads) is not very cheap, so maybe you're better of without multithreading. Especially because you only need to load the images once (after, they are cached in the map), so the speedup will probably never be significant. I would consider loading the images sequential.
ImageIO is pretty slow and very I/O intensive so adding many threads often won't help on typical PCs. Are you sure that you just don't need to add a large number for awaitTermination timeout?
Another option is to use LinkBlockingQueue with a limited length for the thread pools so that your main application thread slows down when the consumers are slow. That means the time delay 5L seconds at the end is realistic to allow calls in progress to end.
See JDK source for newFixedThreadPool(n), try a qSize = say 2 or 3 x nthreads in the contructor to LinkedBlockingQueue()
public static ExecutorService newFixedThreadPool(int nThreads) {
return new ThreadPoolExecutor(nThreads, nThreads,
0L, TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<Runnable>());
}
I have a web application, that, on a single request may require to load hundreds of data. Now the problem is that data is scattered. So, I have to load data from several places, apply filters on them, process them and then respond. Performing all these operations sequentially makes servlet slow!
So I have thought of loading all the data in separate threads like t[i] = new Thread(loadData).start();, waiting for all threads to finish using while(i < count) t[i].join(); and when done, join the data and respond.
Now I am not sure if this approach is right or there is some better method. I have read somewhere is that spawning thread in servlets is not advisable.
My desired code will look something like this.
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
{
Iterable<?> requireddata = requiredData(request);
Thread[] t = new Thread[requireddata.size];
int i = 0;
while (requireddata.hasNext())
{
t[i] = new Thread(new loadData(requiredata.next())).start();
i++;
}
for(i = 0 ; i < t.length ; i++)
t[i].join();
// after getting the data process and respond!
}
The main problem is that you'll bring the server to its knees if many concurrent requests comes in for your servlet, because you don't limit the number of threads that can be spawned. Another problem is that you keep creating new threads instead of reusing them, which is inefficient.
These two problems are solved easily by using a thread pool. And Java has native support for them. Read the tutorial.
Also, make sure to shutdown the thread pool when the webapp is shut down, using a ServletContextListener.
Sounds like a problem for the CyclicBarrier.
For example:
ExecutorService executor = Executors.newFixedThreadPool(requireddata.size);
public void executeAllAndAwaitCompletion(List<? extends T> threads){
final CyclicBarrier barrier = new CyclicBarrier(threads.size() + 1);
for(final T thread : threads){
executor.submit(new Runnable(){
public void run(){
//it is not a mistake to call run() here
thread.run();
barrier.await();
}
});
}
barrier.await();
}
The last thread from threads will be excuted once the all others finish.
Instead of calling Executors.newFixedThreadPool(requireddata.size);, it is better to reuse some existing thread pool.
You may consider using Executor framework from java.util.concurrent api. For example you can create your computation task as Callable and then submit that task to a ThreadPoolExecutor. Sample code from Java Concurrency in Practice:-
public class Renderer {
private final ExecutorService executor;
Renderer(ExecutorService executor) { this.executor = executor; }
void renderPage(CharSequence source) {
final List<ImageInfo> info = scanForImageInfo(source);
CompletionService<ImageData> completionService =
new ExecutorCompletionService<ImageData>(executor);
for (final ImageInfo imageInfo : info)
completionService.submit(new Callable<ImageData>() {
public ImageData call() {
return imageInfo.downloadImage();
}
});
renderText(source);
try {
for (int t = 0, n = info.size(); t < n; t++) {
Future<ImageData> f = completionService.take();
ImageData imageData = f.get();
renderImage(imageData);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} catch (ExecutionException e) {
throw launderThrowable(e.getCause());
}
}
}
Since you are waiting for all the threads to complete and then you are providing the response, IMO multiple threads won't help if you are using just CPU cycles. It will only increase the response time by adding the context switch delay in the threads. A single thread will be better. However if network/IO etc are involved you can make use of thread pool.
But you would like to re-consider your approach. Processing huge amount of data synchronously in a http request is not advisable. Will not be a good experience for the end user. What you can do is start a thread to process the data and provide a response saying "It is processing". You can provide the web user with some kind gesture to check the status whenever he wants.
I want to implement something like this.
1.A background process which will be running forever
2.The background process will check the database for any requests in pending state. If any found,will assign a separate thread to process the request.So one thread per request.Max threads at any point of time should be 10. Once the thread has finished execution,the status of the request will be updated to something,say "completed".
My code outline looks something like this.
public class SimpleDaemon {
private static final int MAXTHREADS = 10;
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(MAXTHREADS);
RequestService requestService = null; //init code omitted
while(true){
List<Request> pending = requestService.findPendingRequests();
List<Future<MyAppResponse>> completed = new ArrayList<Future<MyAppResponse>>(pending.size());
for (Request req:pending) {
Callable<MyAppResponse> worker = new MyCallable(req);
Future<MyAppResponse> submit = executor.submit(worker);
completed.add(submit);
}
// Now retrieve the result
for (Future<MyAppResponse> future : completed) {
try {
requestService.updateStatus(future.getRequestId());
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}
try {
Thread.sleep(10000); // Sleep sometime
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
Can anyone spend sometime to review this and comment any suggestion/optimization (from multi threading perspective) ? Thanks.
Using a max threads of ten seems somewhat arbitrary. Is this the maximum available connections to your database?
I'm a little confused as to why you are purposefully introducing latency into your applications. Why aren't pending requests submitted to the Executor immediately?
The task submitted to the Executor could then update the RequestService, or you could have a separate worker Thread belonging to the RequestService which calls poll on a BlockingQueue of Future<MyAppResponse>.
You have no shutdown/termination strategy. Nothing indicates that main is run on a Thread that is set to Daemon. If it is, I think the ExecutorService's worker threads will inherit the daemon status, but then your application could shutdown with live connection to the database, no? Isn't that bad?
If the thread isn't really a Daemon, then you need to handle that InterruptedException and treat it as an indication that you are being asked to exit the application.
Your calls to requestService appear to be single threaded resulted in any long running queries preventing completed queries from being completed.
Unless the updateStatus has to be called in a specific order, I suggest you call this as part of your query in MyCallable. This could simplify your code and allow results to be processed as they become available.
You need to handle the potential throwing of a RejectedExecutionException by executor.submit() because the thread-pool has a finite number of threads.
You'd probably be better off using an ExecutorCompletionService rather than an ExecutorService because the former can tell you when a task completes.
I strongly recommend reading Brian Goetz's book "Java Concurrency in Practice".
I'm wrestling with the best way to implement my processing pipeline.
My producers feed work to a BlockingQueue. On the consumer side, I poll the queue, wrap what I get in a Runnable task, and submit it to an ExecutorService.
while (!isStopping())
{
String work = workQueue.poll(1000L, TimeUnit.MILLISECONDS);
if (work == null)
{
break;
}
executorService.execute(new Worker(work)); // needs to block if no threads!
}
This is not ideal; the ExecutorService has its own queue, of course, so what's really happening is that I'm always fully draining my work queue and filling the task queue, which slowly empties as the tasks complete.
I realize that I could queue tasks at the producer end, but I'd really rather not do that - I like the indirection/isolation of my work queue being dumb strings; it really isn't any business of the producer what's going to happen to them. Forcing the producer to queue a Runnable or Callable breaks an abstraction, IMHO.
But I do want the shared work queue to represent the current processing state. I want to be able to block the producers if the consumers aren't keeping up.
I'd love to use Executors, but I feel like I'm fighting their design. Can I partially drink the Kool-ade, or do I have to gulp it? Am I being wrong-headed in resisting queueing tasks? (I suspect I could set up ThreadPoolExecutor to use a 1-task queue and override it's execute method to block rather than reject-on-queue-full, but that feels gross.)
Suggestions?
I want the shared work queue to
represent the current processing
state.
Try using a shared BlockingQueue and have a pool of Worker threads taking work items off of the Queue.
I want to be able to block the
producers if the consumers aren't
keeping up.
Both ArrayBlockingQueue and LinkedBlockingQueue support bounded queues such that they will block on put when full. Using the blocking put() methods ensures that producers are blocked if the queue is full.
Here is a rough start. You can tune the number of workers and queue size:
public class WorkerTest<T> {
private final BlockingQueue<T> workQueue;
private final ExecutorService service;
public WorkerTest(int numWorkers, int workQueueSize) {
workQueue = new LinkedBlockingQueue<T>(workQueueSize);
service = Executors.newFixedThreadPool(numWorkers);
for (int i=0; i < numWorkers; i++) {
service.submit(new Worker<T>(workQueue));
}
}
public void produce(T item) {
try {
workQueue.put(item);
} catch (InterruptedException ex) {
Thread.currentThread().interrupt();
}
}
private static class Worker<T> implements Runnable {
private final BlockingQueue<T> workQueue;
public Worker(BlockingQueue<T> workQueue) {
this.workQueue = workQueue;
}
#Override
public void run() {
while (!Thread.currentThread().isInterrupted()) {
try {
T item = workQueue.take();
// Process item
} catch (InterruptedException ex) {
Thread.currentThread().interrupt();
break;
}
}
}
}
}
"find an available existing worker thread if one exists, create one if necessary, kill them if they go idle."
Managing all those worker states is as unnecessary as it is perilous. I would create one monitor thread that constantly runs in the background, who's only task is to fill up the queue and spawn consumers... why not make the worker threads daemons so they die as soon as they complete? If you attach them all to one ThreadGroup you can dynamically re-size the pool... for example:
**for(int i=0; i<queue.size()&&ThreadGroup.activeCount()<UPPER_LIMIT;i++ {
spawnDaemonWorkers(queue.poll());
}**
You could have your consumer execute Runnable::run directly instead of starting a new thread up. Combine this with a blocking queue with a maximum size and I think that you will get what you want. Your consumer becomes a worker that is executing tasks inline based on the work items on the queue. They will only dequeue items as fast as they process them so your producer when your consumers stop consuming.