Sequentially processing file in threadpool executor - java

we use JDK 7 watchservice to watch directory which can have xml or csv files. These files are put in threadpool and later on processed and pushed into database.This application runs for ever watching the directory and keeps processing files as and when available. XML file are small and does not take time, however each csv file can contain more than 80 thousand records so processing takes time to put in database. Java application give us outofmemory error when there are 15 csv files getting processed from threadpool. Is there any way where when csv files comes into threadpool, it can be serially processed i.e only one at a time.

Java application give us outofmemory error when there are 15 csv files getting processed from threadpool. Is there any way where when csv files comes into threadpool, it can be serially processed i.e only one at a time.
If I'm understanding, you want to stop adding to the pool if you are over some threshold. There is an easy way to do that which is by using a blocking-queue and the rejected execution handler.
See the following answer:
Process Large File for HTTP Calls in Java
To summarize it, you do something like the following:
// only allow 100 jobs to queue
final BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(100);
ThreadPoolExecutor threadPool =
new ThreadPoolExecutor(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, queue);
// we need our RejectedExecutionHandler to block if the queue is full
threadPool.setRejectedExecutionHandler(new RejectedExecutionHandler() {
#Override
public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
try {
// this will block the producer until there's room in the queue
executor.getQueue().put(r);
} catch (InterruptedException e) {
throw new RejectedExecutionException(
"Unexpected InterruptedException", e);
}
}
});
This will mean that it will block adding to the queue and should not exhaust memory.

I would take a different route to solve your problem, I guess you have everything right except when you start reading too much data into memory.
Not sure how are you reading csv files, would suggest to use a LineReader and read e.g. 500 lines process them and then read next 500 lines, all large files should be handled this way only, because no matter how much you increase your memory arguments, you will hit out of memory as soon as you will have a bigger file to process, so use an implementation that can handle records in batches. This would require some extra coding effort but will never fail no matter how big file you have to process.
Cheers !!

You can try:
Increase the memory of JVM using the -Xmx JVM option
Use a different executor to reduce the number of processed files at a time. A drastical solution is to use a SingleThreadExecutor:
public class FileProcessor implements Runnable {
public FileProcessor(String name) { }
public void run() {
// process file
}
}
// ...
ExecutorService executor = Executors.newSingleThreadExecutor();
// ...
public void onNewFile(String fileName) {
executor.submit(new FileProcessor(fileName));
}

Related

Camel: File consumer component "bites off more than it can chew", pipeline dies from out-of-memory error

I have a route defined in Camel that goes something like this: GET request comes in, a file gets created in the file system. File consumer picks it up, fetches data from external web services, and sends the resulting message by POST to other web services.
Simplified code below:
// Update request goes on queue:
from("restlet:http://localhost:9191/update?restletMethod=post")
.routeId("Update via POST")
[...some magic that defines a directory and file name based on request headers...]
.to("file://cameldest/queue?allowNullBody=true&fileExist=Ignore")
// Update gets processed
from("file://cameldest/queue?delay=500&recursive=true&maxDepth=2&sortBy=file:parent;file:modified&preMove=inprogress&delete=true")
.routeId("Update main route")
.streamCaching() //otherwise stuff can't be sent to multiple endpoints
[...enrich message from some web service using http4 component...]
.multicast()
.stopOnException()
.to("direct:sendUpdate", "direct:dependencyCheck", "direct:saveXML")
.end();
The three endpoints in the multicast are simply POSTing the resulting message to other web services.
This all works rather well when the queue (i.e. the file directory cameldest) is fairly empty. Files are being created in cameldest/<subdir>, picked up by the file consumer and moved into cameldest/<subdir>/inprogress, and stuff is being sent to the three outgoing POSTs no problem.
However, once the incoming requests pile up to about 300,000 files progress slows down and eventually the pipeline fails due to out-of-memory errors (GC overhead limit exceeded).
By increasing logging I can see that the file consumer polling basically never runs, because it appears to take responsibility for all files it sees at each time, waits for them to be done processing, and only then starts another poll round. Besides (I assume) causing the resources bottleneck, this also interferes with my sorting requirements: Once the queue is jammed with thousands of messages waiting to be processed, new messages that would naively be sorted higher up are -if they even still get picked up- still waiting behind those that are already "started".
Now, I've tried the maxMessagesPerPoll and eagerMaxMessagesPerPoll options. They seem to alleviate the problem at first, but after a number of poll rounds I still end up with thousands of files in "started" limbo.
The only thing that sort of worked was making the bottle neck of delay and maxMessages... so narrow that the processing on average would finish faster than the file polling cycle.
Clearly, that is not what I want. I would like my pipeline to process files as fast as possible, but not faster. I was expecting the file consumer to wait when the route is busy.
Am I making an obvious mistake?
(I'm running a somewhat older Camel 2.14.0 on a Redhat 7 machine with XFS, if that is part of the problem.)
Try set maxMessagesPerPoll to a low value on the from file endpoint to only pickup at most X files per poll which also limits the total number of inflight messages you will have in your Camel application.
You can find more information about that option in the Camel documentation for the file component
The short answer is that there is no answer: The sortBy option of Camel's file component is simply too memory-inefficient to accomodate my use-case:
Uniqueness: I don't want to put a file on queue if it's already there.
Priority: Files flagged as high priority should be processed first.
Performance: Having a few hundred thousands of files, or maybe even a few million, should be no problem.
FIFO: (Bonus) Oldest files (by priority) should be picked up first.
The problem appears to be, if I read the source code and the documentation correctly, that all file details are in memory to perform the sorting, no matter whether the built-in language or a custom pluggable sorter is used. The file component always creates a list of objects containing all details, and that apparently causes an insane amount of garbage collection overhead when polling many files often.
I got my use case to work, mostly, without having to resort to using a database or writing a custom component, using the following steps:
Move from one file consumer on the parent directory cameldest/queue that sorts recursively the files in the subdirectories (cameldest/queue/high/ before cameldest/queue/low/) to two consumers, one for each directory, with no sorting at all.
Set up only the consumer from /cameldest/queue/high/ to process files through my actual business logic.
Set up the consumer from /cameldest/queue/low to simply promote files from "low" to "high" (copying them over, i.e. .to("file://cameldest/queue/high");)
Crucially, in order to only promote from "low" to "high" when high is not busy, attach a route policy to "high" that throttles the other route, i.e. "low" if there are any messages in-flight in "high"
Additionally, I added a ThrottlingInflightRoutePolicy to "high" to prevent it from inflighting too many exchanges at once.
Imagine this like at check-in at the airport, where tourist travellers are invited over into the business class lane if that is empty.
This worked like a charm under load, and even while hundreds of thousands of files were on queue in "low", new messages (files) dropped directly into "high" got processed within seconds.
The only requirement that this solution doesn't cover, is the orderedness: There is no guarantee that older files are picked up first, rather they are picked up randomly. One could imagine a situation where a steady stream of incoming files could result in one particular file X just always being unlucky and never being picked up. The chance of that happening, though, is very low.
Possible improvement: Currently the threshold for allowing / suspending the promotion of files from "low" to "high" is set to 0 messages inflight in "high". On the one hand, this guarantees that files dropped into "high" will be processed before another promotion from "low" is performed, on the other hand it leads to a bit of a stop-start-pattern, especially in a multi-threaded scenario. Not a real problem though, the performance as-is was impressive.
Source:
My route definitions:
ThrottlingInflightRoutePolicy trp = new ThrottlingInflightRoutePolicy();
trp.setMaxInflightExchanges(50);
SuspendOtherRoutePolicy sorp = new SuspendOtherRoutePolicy("lowPriority");
from("file://cameldest/queue/low?delay=500&maxMessagesPerPoll=25&preMove=inprogress&delete=true")
.routeId("lowPriority")
.log("Copying over to high priority: ${in.headers."+Exchange.FILE_PATH+"}")
.to("file://cameldest/queue/high");
from("file://cameldest/queue/high?delay=500&maxMessagesPerPoll=25&preMove=inprogress&delete=true")
.routeId("highPriority")
.routePolicy(trp)
.routePolicy(sorp)
.threads(20)
.log("Before: ${in.headers."+Exchange.FILE_PATH+"}")
.delay(2000) // This is where business logic would happen
.log("After: ${in.headers."+Exchange.FILE_PATH+"}")
.stop();
My SuspendOtherRoutePolicy, loosely built like ThrottlingInflightRoutePolicy
public class SuspendOtherRoutePolicy extends RoutePolicySupport implements CamelContextAware {
private CamelContext camelContext;
private final Lock lock = new ReentrantLock();
private String otherRouteId;
public SuspendOtherRoutePolicy(String otherRouteId) {
super();
this.otherRouteId = otherRouteId;
}
#Override
public CamelContext getCamelContext() {
return camelContext;
}
#Override
public void onStart(Route route) {
super.onStart(route);
if (camelContext.getRoute(otherRouteId) == null) {
throw new IllegalArgumentException("There is no route with the id '" + otherRouteId + "'");
}
}
#Override
public void setCamelContext(CamelContext context) {
camelContext = context;
}
#Override
public void onExchangeDone(Route route, Exchange exchange) {
//log.info("Exchange done on route " + route);
Route otherRoute = camelContext.getRoute(otherRouteId);
//log.info("Other route: " + otherRoute);
throttle(route, otherRoute, exchange);
}
protected void throttle(Route route, Route otherRoute, Exchange exchange) {
// this works the best when this logic is executed when the exchange is done
Consumer consumer = otherRoute.getConsumer();
int size = getSize(route, exchange);
boolean stop = size > 0;
if (stop) {
try {
lock.lock();
stopConsumer(size, consumer);
} catch (Exception e) {
handleException(e);
} finally {
lock.unlock();
}
}
// reload size in case a race condition with too many at once being invoked
// so we need to ensure that we read the most current size and start the consumer if we are already to low
size = getSize(route, exchange);
boolean start = size == 0;
if (start) {
try {
lock.lock();
startConsumer(size, consumer);
} catch (Exception e) {
handleException(e);
} finally {
lock.unlock();
}
}
}
private int getSize(Route route, Exchange exchange) {
return exchange.getContext().getInflightRepository().size(route.getId());
}
private void startConsumer(int size, Consumer consumer) throws Exception {
boolean started = super.startConsumer(consumer);
if (started) {
log.info("Resuming the other consumer " + consumer);
}
}
private void stopConsumer(int size, Consumer consumer) throws Exception {
boolean stopped = super.stopConsumer(consumer);
if (stopped) {
log.info("Suspending the other consumer " + consumer);
}
}
}
I would propose an alternative solution unless you really need to save the data as files.
From your restlet consumer, send each request to a message queuing app such as activemq or rabbitmq or something similar. You will quickly end up with lots of messages on that queue but that is ok.
Then replace your file consumer with a queue consumer. It will take some time but the each message should be processed separately and sent to wherever you want. I have tested rabbitmq with about 500 000 messages and that has worked fine. This should reduce the load on the consumer as well.

ExecutorService asynchronous never end the main method

I've to do a massive upload to a certain server - my program do each upload in about 1 second - but i will upload around 10 thousand documents each time.
To do this, i thought in use parallelism to send the documents in a good time. (My server already accept a lot of requests simultaneously).
So, i created a Task (implementing a Callable) that upload my document.
And put it in a ExecutorService.
ExecutorUpload.java
public void execute(){
ExecutorService executor = Executors.newWorkStealingPool();
//this code create the InputStream objects from
//real Files from disk using java.io.InputStream and java.io.File
List<CallbackCreateDocumentTask> tasks = createTasks(...);
//this list holds the Future objects to try to terminate the executorService
List<Future<DocumentDTO>> futures = new CopyOnWriteArrayList<>();
//iterate the list and call the Tasks
tasks.stream().forEach((task) -> futures.add(executor.submit(task)));
executor.shutdown();
//here i was trying to stop the executor,
//but this makes me lose de asynchronous upload because
//the application stops to wait this unique executoService to terminate,
//and i've more than one executorService doing upload
//executor.awaitTermination(10, TimeUnit.MINUTES);
}
App.java
public static void main (String[] args){
new ExecutorUpload().execute();
}
This code is allright. All my documents were uploaded by the Tasks, in all instances of ExecutorService that i created.
But, my application never ends. It's like it stay waiting for more unfinished executorServices;
Does anybody knows why my application never ends ?
I suspect a few things :
Java 8 never close the main method
My executorService's run in a never-ending thread, or something like
that
My InputStream's doesnt are been closed, making the main method
wait for it
Something with the File, not with the InputStream, related with the Files not been closed too...

Parallel java application not receiving CPU to finish task

I have a parallel running java application that consumes huge log files and applies some custom logic. Each log row is processed in a separate thread using fire-and-forget approach.
However sometimes the java process just stops processing, what I mean with that is that the java application doesn't get assigned CPU to execute the process even if the application is still hasn't finished consuming the file.
Running top I get quite low load average considering 16 cores that I have:
Running vmstat I can see that non of the user processes are running neither the kernel processes, rather it's idle 99%
The output of iostat shows me that there are no pending IO tasks running either:
I also haven't spotted any deadlocks or starvation taking a thread dump. The most of the threads are WAITING or RUNNABLE.
What am I missing? I got lost, and I don;t really know where to investigate further.
=UPDATE=
This is the part that initiates parallel execution, after this there are thousand lines of code applying modification incl. elasticsearch, akka etc
So I don't really know what the relevant code would be that might causes any troubles.
BlockingQueue<Runnable> workQueue = new ArrayBlockingQueue<Runnable>(100);
ExecutorService executorService = new MetricsThreadPoolExecutor(numThreadCore, numThreadCore, idleTime, TimeUnit.SECONDS, workQueue, new ThreadPoolExecutor.AbortPolicy(), "process.concurrent", metrics);
FileInputStream fileStream = new FileInputStream(file);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new GZIPInputStream(fileStream));
String strRow = bufferedReader.readLine();
while (strRow != null) {
final Row row = new Row(strRow);
try {
executorService.submit(new Runnable() {
#Override
public void run() {
if (!StringUtil.isBlank(row.getLine())) {
processor.process(row);
}
}
});
strRow = bufferedReader.readLine();
} catch (RejectedExecutionException ree) {
try {
logger.warn(ree.getMessage());
Thread.sleep(50L);
} catch (InterruptedException ie) {
logger.warn("Wait interrupted", ie);
}
}
However sometimes the java process just stops processing, what I mean with that is that the java application doesn't get assigned CPU to execute the process even if the application is still hasn't finished consuming the file.
Don't think about this at the CPU/vmstat/iostat level. That's just confusing the debugging of the problem. You should think about this in terms of threads only and trust the OS to schedule them appropriately.
I see no reason why the main thread shouldn't finish after all of the rows have been submitted for processing. As an aside, you may instead want to just block the producer instead of regenerating the rows in your spin/sleep loop like you are doing. See: RejectedExecutionException free threads but full queue
If you application is not completing then either one of the worker threads is hung while processing the row or maybe the MetricsThreadPoolExecutor has not been shutdown. I suspect the latter. The producer thread, after it exits the while (strRow != null) { loop should call executorService.shutdown(). Otherwise the threads will be waiting for more rows to be added.
You could do a thread-dump on your application to see if it is stuck in a worker. You could add logging when the producer thread finishes which should let you know if it completed it's work. Both might help figure out where the problem lies.

64 bit Centos Java JVM unable to create a native thread

I am using the Java Executor Service to create a singlethread.
Code:-
ExecutorService executor = Executors.newSingleThreadExecutor();
try {
executor.submit(new Runnable() {
#Override
public void run() {
Iterator<FileObject> itr = mysortedList.iterator();
while (itr.hasNext()) {
myWebFunction(itr.next();
}
};
}).get(Timeout * mysortedList.size() - 10, TimeUnit.SECONDS);
} catch (Exception ex) {
} finally {
executor.shutdownNow();
}
Details: myWebfunction processes files of different size and content.Processing involves extracting the entire content and applying further actions on the file content.
The program runs in 64bit Centos.
Problem: When the myWebfunction gets file of size greater than some threshold, say 10MB, the executor service is unable to create a native thread. I tried various -Xmx and -Xms settings, but still the executor service throws the same error.
My guess is you calling this many times, and you are not waiting for the thread which has timed out, leaving lots of threads lying around. When you run out of stack space, or you reach about 32K threads, you cannot create any more.
I suggest using a different approach which doesn't use so many threads or kills them off when you know you don't need them any more. E.g. have the while loop check for interrupts and call Future.cancel(true) to interrupt it.

Out of Memory exception while running multithreaded code

I am working on a project in which I will be having different Bundles. Let's take an example, Suppose I have 5 Bundles and each of those bundles will have a method name process.
Now currently, I am calling the process method of all those 5 bundles in parallel using multithread code below.
But somehow, everytime when I am running the below multithread code, it always give me out of memory exception. But if I am running it sequentially meaning, calling process method one by one, then it don't give me any Out Of memory exception.
Below is the code-
public void callBundles(final Map<String, Object> eventData) {
// Three threads: one thread for the database writer, two threads for the plugin processors
final ExecutorService executor = Executors.newFixedThreadPool(3);
final Map<String, String> outputs = (Map<String, String>)eventData.get(Constants.EVENT_HOLDER);
for (final BundleRegistration.BundlesHolderEntry entry : BundleRegistration.getInstance()) {
executor.submit(new Runnable () {
public void run() {
try {
final Map<String, String> response = entry.getPlugin().process(outputs);
//process the response and update database.
System.out.println(response);
} catch (Exception e) {
e.printStackTrace();
}
}
});
}
}
Below is the exception, I am getting whenever I am running above Multithreaded code.
JVMDUMP006I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" - please wait.
JVMDUMP032I JVM requested Heap dump using 'S:\GitViews\Stream\goldseye\heapdump.20130904.175256.12608.0001.phd' in response to an event
JVMDUMP010I Heap dump written to S:\GitViews\Stream\goldseye\heapdump.20130904.175256.12608.0001.phd
JVMDUMP032I JVM requested Java dump using 'S:\GitViews\Stream\goldseye\javacore.20130904.175256.12608.0002.txt' in response to an event
UTE430: can't allocate buffer
UTE437: Unable to load formatStrings for j9mm
JVMDUMP010I Java dump written to S:\GitViews\Stream\goldseye\javacore.20130904.175256.12608.0002.txt
JVMDUMP032I JVM requested Snap dump using 'S:\GitViews\Stream\goldseye\Snap.20130904.175256.12608.0003.trc' in response to an event
UTE001: Error starting trace thread for "Snap Dump Thread": -1
JVMDUMP010I Snap dump written to S:\GitViews\Stream\goldseye\Snap.20130904.175256.12608.0003.trc
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
ERROR: Bundle BullseyeModellingFramework [1] EventDispatcher: Error during dispatch. (java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 12)
java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 12
JVMDUMP006I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" - please wait.
JVMDUMP032I JVM requested Heap dump using 'S:\GitViews\Stream\goldseye\heapdump.20130904.175302.12608.0004.phd' in response to an event
JVMDUMP010I Heap dump written to S:\GitViews\Stream\goldseye\heapdump.20130904.175302.12608.0004.phd
JVMDUMP032I JVM requested Java dump using 'S:\GitViews\Stream\goldseye\javacore.20130904.175302.12608.0005.txt' in response to an event
I am using JDK1.6.0_26 as the installed JRE's in my eclipse.
Each call of callBundles() will create a new threadpool by creating an own executor. Each thread has its own stack space! So if you say you start the JVM, the first call will create three threads with a sum of 3M heap (1024k is the default stack size of a 64-bit JVM), the next call another 3M etc. 1000 calls/s will need 3GB/s!
The second problem is you never shutdown() the created executor services, so the thread will live on until the garbage collector removes the executor (finalize() also call shutdown()). But the GC will never clear the stack memory, so if the stack memory is the problem and the heap is not full, the GC will never help!
You need to use one ExecutorService, lets say with 10 to 30 threads or a custom ThreadPoolExecutor with 3-30 cached threads and a LinkedBlockingQueue. Call shutdown() on the service before your application stops if possible.
Check the physical RAM, load and response time of your application to tune the parameters heap size, maximum threads and keep alive time of the threads in the pool. Have a look on other locking parts of the code (size of a database connection pool, ...) and the number of CPUs/cores of your server. An staring point for a thread pool size may be number of CPUs/core plus 1., with much I/O wait more become useful.
The main problem is that you aren't really using the thread pooling properly. If all of your "process" threads are of equal priority, there's no good reason not to make one large thread pool and submit all of your Runnable tasks to that. Note - "large" in this case is determined via experimentation and profiling: adjust it until your performance in terms of speed and memory is what you expect.
Here is an example of what I'm describing:
// Using 10000 purely as a concrete example - you should define the correct number
public static final LARGE_NUMBER_OF_THREADS = 10000;
// Elsewhere in code, you defined a static thread pool
public static final ExecutorService EXECUTOR =
Executors.newFixedThreadPool(LARGE_NUMBER_OF_THREADS);
public void callBundles(final Map<String, Object> eventData) {
final Map<String, String> outputs =
(Map<String, String>)eventData.get(Constants.EVENT_HOLDER);
for (final BundleRegistration.BundlesHolderEntry entry : BundleRegistration.getInstance()) {
// "Three threads: one thread for the database writer,
// two threads for the plugin processors"
// so you'll need to repeat this future = E.submit() pattern two more times
Future<?> processFuture = EXECUTOR.submit(new Runnable() {
public void run() {
final Map<String, String> response =
entry.getPlugin().process(outputs);
//process the response and update database.
System.out.println(response);
}
}
// Note, I'm catching the exception out here instead of inside the task
// This also allows me to force order on the three component threads
try {
processFuture.get();
} catch (Exception e) {
System.err.println("Should really do something more useful");
e.printStackTrace();
}
// If you wanted to ensure that the three component tasks run in order,
// you could future = executor.submit(); future.get();
// for each one of them
}
For completeness, you could also use a cached thread pool to avoid repeated creation of short-lived Threads. However, if you're already worried about memory consumption, a fixed pool might be better.
When you get to Java 7, you might find that Fork-Join is a better pattern than a series of Futures. Whatever fits your needs best, though.

Categories

Resources