I'm looking to use Cassandra to store data which will be picked by some scheduling framework and will process those asynchronously at some time in future. Does anyone know if there are any scheduling frameworks that can be hooked up with Cassandra to make sure multiple instances of scheduling framework can work on the data set in parallel. Challenge here will be when a particular row is picked by any instance of scheduler framework, it should not be picked by any other instance. In RDMS I know we can achieve that by row locking mechanism but not sure if there is a cleaner way to achieve that in Cassandra. Please let me know if there is any framework of that nature to pick up those tasks.
Related
I have a terminal server monitor project. In the backend, I use the Spring MVC, MyBatis and PostgreSQL. Basically I query the session information from DB and send back to front-end and display it to users. But there is some large queries(like searching total users, total sessions, etc.), which slow down the system when user opens the website, So I want to do these queries as asynchronous tasks so the website could be opened fast rather than waiting for the query. Also, I would check terminal server state periodically from DB(every hour), and if terminal server fails or average load is too high, I would notifying admins. I do not know what should I use, maybe AKKA, or any other way to do these two jobs(1.do the large query asynchronously 2. do some periodical query)? Please help me, thanks!
You can achieve this using Spring and caching where necessary.
If the data you're displaying is not required to be "in real-time", but it can be "near real-time" you can read the data from the DB periodically and cache it. Your app then reads from the cache.
There's different approaches you can explore.
You can try to create a materialized view in PostgreSQL which will hold the statistic data you need. Depending on your requirements you have to see how to handle refresh intervals etc.
Another approach is to use application level cache - you can leverage Spring for that(Spring docs). You can populate the cache on start up and refresh it as necessary.
The task that runs every hour can be implemented again leveraging Spring (Spring docs) #Scheduled annotation.
To answer your question - don't use Akka - you have all the tools necessary to achieve the task in the Spring ecosystem.
Akka is not very relevant here, it is for event-driven programming model which deals with concurrency issues to build highly scalable multithreaded applications.
You can use Spring task scheduler for running heavy queries periodically. If you want to keep it simple, you can solve your problem by simply storing the data like total users, total sessions etc, in the global application context. And periodically update this data from database using spring scheduler. You can also store the same in a separate database table, so that this data can be easily loaded at the initialization time.
I really don't see why you need "memcached", "materialized views", "Websockets" and other heavy technologies and frameworks, for a caching a small set of data. All you need is maintain a set of global parameters in your application context, keep them updated using a scheduled task as frequently as desired.
I'm trying to determine all the things I need to consider when deploying jobs to a clustered environment.
I'm not concerned about parallel processing or other scaling things at the moment; I'm more interested in how I make everything act as if it was running on a single server.
So for I've determined that triggering a job should be done via messaging.
The thing that's throwing me for a loop right now is how to utilize something like the Spring Batch Admin UI (even if it's a hand rolled solution) in a clustered deployment. Getting the job information from a JobExplorer seems like one of the keys.
Is Will Schipp's spring-batch-cluster project the answer, or is there a more agreed upon community answer?
Or do I not even need to worry because the JobRepository will be pulling from a shared database?
Or do I need to publish job execution info to a message queue to update the separate Job Repositories?
Are there other things I should be concerned about, like the jobIncrementers?
BTW, if it wasn't clear that I'm a total noob to Spring batch, let it now be known :-)
Spring XD (http://projects.spring.io/spring-xd/) provides a distributed runtime for deploying clusters of containers for batch jobs. It manages the job repository as well as provides way to deploy, start, restart, etc the jobs on the cluster. It addresses fault tolerance (if a node goes down, the job is redeployed for example) as well as many other necessary features that are needed to maintain a clustered Spring Batch environment.
I'm adding the answer that I think we're going to roll with unless someone comments on why it's dumb.
If Spring Batch is configured to use a shared database for all the DAOs that the JobExplorer will use, then running is a cluster isn't much of a concern.
We plan on using Quarts jobs to create JobRequest messages which will be put on a queue. The first server to get to the message will actually kick off the Spring Batch job.
Monitoring running jobs will not be an issue because the JobExplorer gets all of it's information from the database and it doesn't look like it's caching information, so we won't run into cluster issues there either.
So to directly answer the questions...
Is Will Schipp's spring-batch-cluster project the answer, or is there a more agreed upon community answer?
There is some cool stuff in there, but it seems like over-kill when just getting started. I'm not sure if there is "community" agreed upon answer.
Or do I not even need to worry because the JobRepository will be pulling from a shared database?
This seems correct. If using a shared database, all of the nodes in the cluster can read and write all the job information. You just need a way to ensure a timer job isn't getting triggered more than once. Quartz already has a cluster solution.
Or do I need to publish job execution info to a message queue to update the separate Job Repositories?
Again, this shouldn't be needed because the execution info is written to the database.
Are there other things I should be concerned about, like the jobIncrementers?
It doesn't seem like this is a concern. When using the JDBC DAO implementations, it uses a database sequence to increment values.
My application is split into 2 web applications running in the same container sharing one db.
The first war does only background processing and the other is for the client GUI + some background stuffs.
The application with the client GUI allows the user to configure the scheduling of some tasks that will be executed by the "background application". Basically it configures the Quartz jobs and triggers.
I'd like that the scheduler of the background application handles only the jobs of a certain group (bg-jobs), and that the other scheduler handles the other group (fg-jobs).
Is it possible to configure this kind of isolation with quartz?
Note: I'd like to keep it simple and if I can avoid to use Quartz Where which seems to be liek a hammer to sledge this probably overkill for my need.
Thanks in advance
The simplest and quickest way is to create a separate load of tables for each application. So have one set of quartz tables prefixed with "bg-" and another prefixed with "fg-". Then just change your schedulers configs to point at the appropriate tables. I know it might be a little awkward but you did say you wanted to keep it simple :).
There is a requirement in the project that will have a scheduled task that will do some job.
The project is Spring based and the scheduled job will be part of the application war.I have
never implemented this kind of functionality before.
I have heard of Quartz. Also, I read somewhere that Spring provides some functionality to schedule tasks. So, I was thinking if I am already using Spring then why to go for some other API(Quartz).
I am not sure which one to use? what will be the pros/cons of one over another?
Please suggest what will be the best way to approach my requirement.
I have used Spring's Task execution and scheduling - http://static.springsource.org/spring/docs/3.0.x/reference/scheduling.html
I am looking for a pattern and/or framework which can model the following problem in an easily configurable way.
Every say 3 minutes, I needs to have a set of jobs kick off in a web application context that will concurrently hit web services to obtain the latest version of data, and push it off to a database. The problem is the database will be being heavily used to read the data from to do tons of complex calculations on the data. We are currently using spring so I have been looking at Spring Batch to run this process does anyone have any suggestions/patterns/examples of using Spring or other technologies of a similar system?
We have used ServletContextlisteners to kick off TimerTasks in our web applications when we needed processes to run repeatedly. The ServletContextListener kicks off when the app server starts the application or when the application is restarted. Then the timer tasks act like a separate thread that repeats your code for the specified period of time.
ServletContextListener
http://www.javabeat.net/examples/2009/02/26/servletcontextlistener-example/
TimerTask
http://enos.itcollege.ee/~jpoial/docs/tutorial/essential/threads/timer.html
Is refactoring the job out of the web application and into a standalone app a possibility?
That way you could stick the batch job onto a separate batch server (so that the extra load of the batch job wouldn't impact your web application), which then calls the web services and updates the database. The job can then be kicked off using something like cron or Autosys.
We're using Spring-Batch for exactly this purpose.
The database design would also depend on what the batched data is used for. If it is for reporting purposes, I would recommend separating the operational database from the reporting database, using a database link to obtain the required data from the operational database into the reporting database and then running the complex queries on the reporting database. That way the load is shifted off the operational database.
I think it's worth also looking into frameworks like camel-integration. Also take a look at the so called Enterprise Integration Patterns. Check the catalog - it might provide you with some useful vocabulary to think about the scaling/scheduling problem at hand.
The framework itself integrates really well with Spring.