In my application, I have set of scheduled tasks which process all users' data and perform various tasks like earning, tax calculations, generate statements for all users. Since these processes need to run for each user, they take a lot of time(many hours) because of large number of users. Data processing for one user is completely independent from another user's, so they can be run in parallel. What are my options here? What are the best practices for doing such large/bulk operations.
We are using J2SE platform with spring, jpa and hibernate.
You should be doing the same via batch.
Since you have mentioned that you were doing with Spring you can consider using Spring Batch.
Spring Batch provides reusable functions that are essential in
processing large volumes of records, including logging/tracing,
transaction management, job processing statistics, job restart, skip,
and resource management. It also provides more advanced technical
services and features that will enable extremely high-volume and high
performance batch jobs through optimization and partitioning
techniques.
Check out the Reference manual on how to implement.
Have a look at the spring documentation for scheduled tasks: http://static.springsource.org/spring/docs/3.0.x/reference/scheduling.html
I find the easiest way to do scheduled tasks is to use #Scheduled annotation, and you can use cron style timing. For each of your users, you could perform your task in a different Thread for each:
#Scheduled(cron="*/5 * * * * MON-FRI")
public void doSomething() {
List<User> users = getUsers();
for(User user: users) {
MyTask task = new MyTask(user);
Thread t = new Thread(task);
t.start();
}
}
Make sure MyTask is a Runnable.
If you have mountains of data and users, you could look at Spring Batch: http://static.springsource.org/spring-batch/
Related
We have a mobile app which presents feed to users. The feed REST API is implemented on tomcat, which parallel makes calls to different data sources such as Couchbase, MYSQL to present the content. The simple code is given below:
Future<List<CardDTO>> pnrFuture = null;
Future<List<CardDTO>> newsFuture = null;
ExecutionContext ec = ExecutionContexts.fromExecutorService(executor);
final List<CardDTO> combinedDTOs = new ArrayList<CardDTO>();
// Array list of futures
List<Future<List<CardDTO>>> futures = new ArrayList<Future<List<CardDTO>>>();
futures.add(future(new PNRFuture(pnrService, userId), ec));
futures.add(future(new NewsFuture(newsService, userId), ec));
futures.add(future(new SettingsFuture(userPreferenceManager, userId), ec));
Future<Iterable<List<CardDTO>>> futuresSequence = sequence(futures, ec);
// combine the cards
Future<List<CardDTO>> futureSum = futuresSequence.map(
new Mapper<Iterable<List<CardDTO>>, List<CardDTO>>() {
#Override
public List<CardDTO> apply(Iterable<List<CardDTO>> allDTOs) {
for (List<CardDTO> cardDTOs : allDTOs) {
if (cardDTOs != null) {
combinedDTOs.addAll(cardDTOs);
}
}
Collections.sort(combinedDTOs);
return combinedDTOs;
}
}
);
Await.result(futureSum, Duration.Inf());
return combinedDTOs;
Right now we have around 4-5 parallel tasks per request. But it is expected to grow to almost 20-25 parallel tasks as we introduce new kinds of items in feed.
My question is, how can I improve this design? What kind of tuning is required in Tomcat to make sure such 20-25 parallel calls can be served optimally under heavy load.
I understand this is a broad topic, but any suggestions would be very helpful.
Tomcat just manages the incoming HTTP connections and pushes the bytes back and forth. There is no Tomcat optimization that can be done to make your application run any better.
If you need 25 parallel processes to run for each incoming HTTP request, and you think that's crazy, then you need to re-think how your application works.
No tomcat configuration will help with what you've presented in your question.
I understand you are calling this from a mobile app and the number of feeds could go up.
based on the amount of data being returned, would it be possible to return the results of some feeds in the same call?
That way the server does the work.
You are in control of the server - you are not in control of the users device and their connection speed.
As nickebbit suggested, things like DefferedResult are really easy to implement.
is it possible that the data from these feeds would not be updated in a quick fashion? If so - you should investigate the use of EHCache and the #Cacheable annotation.
You could come up with a solution where the user is always pulling a cached version of your content from your tomcat server. But your tomcat server is constantly updating that cache in the background.
Its an extra piece of work - but at the end of the day if the user experience is not fast - users will not want to use this app
It looks like your using Akka but not really embracing the Actor model, doing so will likely increase the parallelism and therefore scalability of your app.
If it was me I'd hand requests off from my REST API to a single or pool of coordinating actors that will process the request asynchronously. Using Spring's RestController this can be done using a Callable or DeferredResult but there will obviously be an equivalent in whatever framework you are using.
This coordinating actor would then in turn hand off processing to other actors (i.e. workers) that take care of the I/O bound tasks (preferably using their own dispatcher to ensure other CPU bound threads do not get blocked) and respond to the coordinator with their results.
Once all workers have fetched their data and replied to the coordinator with the results then the original request can be completed with the full result set.
I have run into a case where I have to use a persistent Scheduler, since I have a web application that can crash or close due to some problems and might lose it job details if this happens . I have tried the following:
Use Quartz scheduler:
I used RAMJobStore first, but since it isn't persistent, it wasn't of much help. Can't setup JDBCJobStore because, this will require huge code changes to my existing code base.
In light of such a scenario,
I have the following queries:
If I use Spring's built in #Schedule annotation will my jobs be persistent..? I don't mind if the jobs get scheduled after the application starts. All I want is the jobs to not lose their details and triggers.?
If not, are there any other alternatives that can be followed , keeping in mind that I need to schedule multiple jobs with my scheduler.?
If yes, how can I achieve this.? My triggers are different each job. For e.g I might have a job that is scheduled at 9AM and another at 8.30AM and so on.
If not a scheduler, then can I have a mechanism to handle this.?
One thing, I found is that the documentation for Quartz isn't very descriptive. I mean it's fine for a top level config, but configuring it on your an application is a pain. This is just a side note. Nothing to do with the question.
Appreciate the help. :)
No, Spring's #Schedule-annotation will typically only instruct Spring at what times a certain task should be scheduled to run within the current VM. As far as I know there is not a context for the execution either. The schedule is static.
I had a similar requirement and created db-scheduler (https://github.com/kagkarlsson/db-scheduler), a simple, persistent and cluster-friendly scheduler. It stores the next execution-time in the database, and triggers execution once it is reached.
A very simple example for a RecurringTask without context could look like this:
final RecurringTask myDailyTask = ComposableTask.recurringTask("my-daily-task", Schedules.daily(LocalTime.of(8, 0)),
() -> System.out.println("Executed!"));
final Scheduler scheduler = Scheduler
.create(dataSource)
.startTasks(myDailyTask)
.threads(5)
.build();
scheduler.start();
It will execute the task named my-daily-task at 08:00 every day. It will be scheduled in the database when the scheduler is first started, unless it already exists in the database.
If you want to schedule an ad-hoc task some time in the future with context, you can use the OneTimeTask:
final OneTimeTask oneTimeTask = ComposableTask.onetimeTask("my-onetime-task",
(taskInstance, context) -> System.out.println("One-time task with identifier "+taskInstance.getId()+" executed!"));
scheduler.scheduleForExecution(LocalDateTime.now().plusDays(1), oneTimeTask.instance("1001"));
See the example above. Any number of tasks can be scheduled, as long as task-name and instanceIdentifier is unique.
#Schedule has nothing to do with the actual executor. The default java executors aren't persistent (maybe there are some app-server specific ones that are), if you want persistence you have to use Quartz for job execution.
In my webservice all method calls submits jobs to a queue. Basically these operations take long time to execute, so all these operations submit a Job to a queue and return a status saying "Submitted". Then the client keeps polling using another service method to check for the status of the job.
Presently, what I do is create my own Queue, Job classes that are Serializable and persist these jobs (i.e, their serialized byte stream format) into the database. So an UpdateLogistics operation just queues up a "UpdateLogisticsJob" to the queue and returns. I have written my own JobExecutor which wakes up every N seconds, scans the database table for any existing jobs, and executes them. Note the jobs have to persisted because these jobs have to survive app-server crashes.
This was done a long time ago, and I used bespoke classes for my Queues, Jobs, Executors etc. But now, I would like to know has someone done something similar before? In particular,
Are there frameworks available for this ? Something in Spring/Apache etc
Any framework that is easy to adapt/debug and plays well along with libraries like Spring will be great.
EDIT - Quartz
Sorry if I had not explained more, Quartz is good for stateless jobs (and also for some stateful jobs), but the key for me is very stateful persisted "job instances" (not just jobs or tasks). So for example an operation of executeWorkflow("SUBMIT_LEAVE") might actually create 5 job instances each with atleast 5-10 parameters like userId, accountId etc to be saved into the database.
I was looking for some support around that area, where Job instances can be saved into DB and recreated etc ?
Take a look at JBoss jBPM. It's a workflow definition package that lets you mix automated and manual processes. Tasks are persisted to a DB back end, and it looks like it has some asynchronous execution properties.
I haven't used Quartz for a long time, but I suspect it would be capable of everything you want to do.
spring-batch plus quartz
Depending upon the nature of your job, you might also look into spring-integration to assist with queue processing. But spring-batch will probably handle most of your requirements.
Please try ted-driver (https://github.com/labai/ted)
It's purpose is similar to what you need - you create task (or many of them), which is saved in db, and then ted-driver is responsible to execute it. On error you can postpone retry for later or finish with status error.
Unlike other java frameworks, here tasks are in simple and clear structure in database, where you can manually search or update using standard sql.
I have a web application that synchronizes with a central database four times per hour. The process usually takes 2 minutes. I would like to run this process as a thread at X:55, X:10, X:25, and X:40 so that the users knows that at X:00, X:15, X:30, and X:45 they have a clean copy of the database. It is just about managing expectations. I have gone through the executor in java.util.concurrent but the scheduling is done with the scheduleAtFixedRate which I believe provides no guarantee about when this is actually run in terms of the hours. I could use a first delay to launch the Runnable so that the first one is close to the launch time and schedule for every 15 minutes but it seems that this would probably diverge in time. Is there an easier way to schedule the thread to run 5 minutes before every quarter hour?
You can let the Runnable schedule its "next run".
Such as,
class Task implements Runnable {
private final ScheduledExecutorService service;
public Task(ScheduledExecutorService service){
this.service = service;
}
public void run(){
try{
//do stuff
}finally{
//Prevent this task from stalling due to RuntimeExceptions.
long untilNextInvocation = //calculate how many ms to next launch
service.schedule(new Task(service),untilNextInvocation,TimeUnit.MILLISECONDS);
}
}
}
Quartz would be a good fit since you're application is web-based. It will provide the fine-grained time based scheduling you need.
Quartz is a full-featured, open source
job scheduling service that can be
integrated with, or used along side
virtually any Java EE or Java SE
application - from the smallest
stand-alone application to the largest
e-commerce system. Quartz can be used
to create simple or complex schedules
for executing tens, hundreds, or even
tens-of-thousands of jobs; jobs whose
tasks are defined as standard Java
components that may executed virtually
anything you may program them to do.
The Quartz Scheduler includes many
enterprise-class features, such as JTA
transactions and clustering.
TimerTask handles this case.
See schedule(TimerTask, Date)
If you don't want to have to keep scheduling the jobs, you may want to look into a job scheduling tool like Quartz.
EDIT: This is basically a "how to properly implement a data flow engine in Java" question, and I feel this cannot be adequately answered in a single answer (it's like asking, "how to properly implement an ORM layer" and getting someone to write out the details of Hibernate or something), so consider this question "closed".
Is there an elegant way to model a dynamic dataflow in Java? By dataflow, I mean there are various types of tasks, and these tasks can be "connected" arbitrarily, such that when a task finishes, successor tasks are executed in parallel using the finished tasks output as input, or when multiple tasks finish, their output is aggregated in a successor task (see flow-based programming). By dynamic, I mean that the type and number of successors tasks when a task finishes depends on the output of that finished task, so for example, task A may spawn task B if it has a certain output, but may spawn task C if has a different output. Another way of putting it is that each task (or set of tasks) is responsible for determining what the next tasks are.
Sample dataflow for rendering a webpage: I have as task types: file downloader, HTML/CSS renderer, HTML parser/DOM builder, image renderer, JavaScript parser, JavaScript interpreter.
File downloader task for HTML file
HTML parser/DOM builder task
File downloader task for each embedded file/link
If image, image renderer
If external JavaScript, JavaScript parser
JavaScript interpreter
Otherwise, just store in some var/field in HTML parser task
JavaScript parser for each embedded script
JavaScript interpreter
Wait for above tasks to finish, then HTML/CSS renderer (obviously not optimal or perfectly correct, but this is simple)
I'm not saying the solution needs to be some comprehensive framework (in fact, the closer to the JDK API, the better), and I absolutely don't want something as heavyweight is say Spring Web Flow or some declarative markup or other DSL.
To be more specific, I'm trying to think of a good way to model this in Java with Callables, Executors, ExecutorCompletionServices, and perhaps various synchronizer classes (like Semaphore or CountDownLatch). There are a couple use cases and requirements:
Don't make any assumptions on what executor(s) the tasks will run on. In fact, to simplify, just assume there's only one executor. It can be a fixed thread pool executor, so a naive implementation can result in deadlocks (e.g. imagine a task that submits another task and then blocks until that subtask is finished, and now imagine several of these tasks using up all the threads).
To simplify, assume that the data is not streamed between tasks (task output->succeeding task input) - the finishing task and succeeding task don't have to exist together, so the input data to the succeeding task will not be changed by the preceeding task (since it's already done).
There are only a couple operations that the dataflow "engine" should be able to handle:
A mechanism where a task can queue more tasks
A mechanism whereby a successor task is not queued until all the required input tasks are finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until the flow is finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until certain tasks have finished
Since the dataflow is dynamic (depends on input/state of the task), the activation of these mechanisms should occur within the task code, e.g. the code in a Callable is itself responsible for queueing more Callables.
The dataflow "internals" should not be exposed to the tasks (Callables) themselves - only the operations listed above should be available to the task.
Note that the type of the data is not necessarily the same for all tasks, e.g. a file download task may accept a File as input but will output a String.
If a task throws an uncaught exception (indicating some fatal error requiring all dataflow processing to stop), it must propagate up to the thread that initiated the dataflow as quickly as possible and cancel all tasks (or something fancier like a fatal error handler).
Tasks should be launched as soon as possible. This along with the previous requirement should preclude simple Future polling + Thread.sleep().
As a bonus, I would like to dataflow engine itself to perform some action (like logging) every time task is finished or when no has finished in X time since last task has finished. Something like: ExecutorCompletionService<T> ecs; while (hasTasks()) { Future<T> future = ecs.poll(1 minute); some_action_like_logging(); if (future != null) { future.get() ... } ... }
Are there straightforward ways to do all this with Java concurrency API? Or if it's going to complex no matter what with what's available in the JDK, is there a lightweight library that satisfies the requirements? I already have a partial solution that fits my particular use case (it cheats in a way, since I'm using two executors, and just so you know, it's not related at all to the web browser example I gave above), but I'd like to see a more general purpose and elegant solution.
How about defining interface such as:
interface Task extends Callable {
boolean isReady();
}
Your "dataflow engine" would then just need to manage a collection of Task objects i.e. allow new Task objects to be queued for excecution and allow queries as to the status of a given task (so maybe the interface above needs extending to include id and/or type). When a task completes (and when the engine starts of course) the engine must just query any unstarted tasks to see if they are now ready, and if so pass them to be run on the executor. As you mention, any logging, etc. could also be done then.
One other thing that may help is to use Guice (http://code.google.com/p/google-guice/) or a similar lightweight DI framework to help wire up all the objects correctly (e.g. to ensure that the correct executor type is created, and to make sure that Tasks that need access to the dataflow engine (either for their isReady method or for queuing other tasks, say) can be provided with an instance without introducing complex circular relationships.
HTH, but please do comment if I've missed any key aspects...
Paul.
Look at https://github.com/rfqu/df4j — a simple but powerful dataflow library. If it lacks some desired features, they can be added easily.