How to run very long process in a java based web application?

How to run very long process in a java based web application? - java

I need to run a very long process in a java based spring boot web application. The process consists of following steps.
Get details for about 3,00,000 users from the database.
Iterate over them.
Generate PDF file for each user using itext.
Save PDF file on the filesystem.
Update the database that the PDF file for the given user has been created.
Update the PDF path for the user in the database.
Now, this entire process can take lots of time. May be lots of hours or even may be days as it consist of creating pdf file for each user, then lots of db updates.
Also, I need this process to run in background so that the rest of the web application can run smoothly.
I am thinking of using Spring Batch or Messaging Queue. Haven't really used any of them, so not sure if they are proper frameworks for such kind of problem or which one of these two is best fit for the problem.
What is the ideal way to implement such kind of tasks?

If you can't name a requirement you expect to be satisfied by a framework / library you most likely won't need one...
Generating PDFs might need a lot of power, you might want to keep this background process away from your main web application on it's own machines.
If it's a simple java process it's usually easier to control and to move it around your environment.
To me this looks like a simple task for "plain" java - KISS. Or am I missing something?
I'd make sure the Finder used to fetch the users from the database is
restartable, i.e. only fetches unprocessed users (in case you have to stop the processing because shit happens:-)
runs in batches to keep the db round trips and load low
is multi threadable i.e. can fetch users split into a given number of threads (userid mod numberOfThreads, assuming userId is evenly distributed) so you can add more machines / threads if necessary.

You should use spring batch for this process. When the user presses the button, you would launch the job asynchronously. It will then run in a separate thread and process all your records. The current status of the job can be obtained from the job repository. Spring batch is made for this type of processing.

Related

Should I use AKKA for the periodical task

I have a terminal server monitor project. In the backend, I use the Spring MVC, MyBatis and PostgreSQL. Basically I query the session information from DB and send back to front-end and display it to users. But there is some large queries(like searching total users, total sessions, etc.), which slow down the system when user opens the website, So I want to do these queries as asynchronous tasks so the website could be opened fast rather than waiting for the query. Also, I would check terminal server state periodically from DB(every hour), and if terminal server fails or average load is too high, I would notifying admins. I do not know what should I use, maybe AKKA, or any other way to do these two jobs(1.do the large query asynchronously 2. do some periodical query)? Please help me, thanks!

You can achieve this using Spring and caching where necessary.
If the data you're displaying is not required to be "in real-time", but it can be "near real-time" you can read the data from the DB periodically and cache it. Your app then reads from the cache.
There's different approaches you can explore.
You can try to create a materialized view in PostgreSQL which will hold the statistic data you need. Depending on your requirements you have to see how to handle refresh intervals etc.
Another approach is to use application level cache - you can leverage Spring for that(Spring docs). You can populate the cache on start up and refresh it as necessary.
The task that runs every hour can be implemented again leveraging Spring (Spring docs) #Scheduled annotation.
To answer your question - don't use Akka - you have all the tools necessary to achieve the task in the Spring ecosystem.

Akka is not very relevant here, it is for event-driven programming model which deals with concurrency issues to build highly scalable multithreaded applications.
You can use Spring task scheduler for running heavy queries periodically. If you want to keep it simple, you can solve your problem by simply storing the data like total users, total sessions etc, in the global application context. And periodically update this data from database using spring scheduler. You can also store the same in a separate database table, so that this data can be easily loaded at the initialization time.
I really don't see why you need "memcached", "materialized views", "Websockets" and other heavy technologies and frameworks, for a caching a small set of data. All you need is maintain a set of global parameters in your application context, keep them updated using a scheduled task as frequently as desired.

Real time data consumption from mysql

I have a use case in which my data is present in Mysql.
For each new row insert in Mysql, I have to perform analytics for the new data.
How I am currently solving this problem is:
My application is a Spring-boot application, in which I have used Scheduler which checks for new row entered in the database after every 2 seconds.
The problem with the current approach is:
Even if there is no new data available in Mysql table, Scheduler fires MySQL query to check if new data available or not.
One way to solve this type of problem in any SQL database in Triggers .
But till now I am not successful in creating Mysql triggers which can call Java-based Spring application or a simple java application.
My question is :
Is their any better way to solve my above use-case? Even I am open to change to another storage (database) system if they are built for this type of use-case.

This fundamentally sounds like an architecture issue. You're essentially using a database as an API which, as you can see, causes all kinds of issues. Ideally, this db would be wrapped in a service that can manage the notification of systems that need to be notified. Let's look at a few different options going forward.
Continue to poll
You didn't outline what the actual issue is with your current polling approach. Is running the job when it's not needed causing an issue of some kind? I'd be a proponent for just leaving it unless you're interested in making a larger change.
Database Trigger
While I'm unaware of a way to launch a java process via a db trigger, you can do an HTTP POST from one. With that in mind, you can have your batch job staged in a web app that uses a POST to launch the job when the trigger fires.
Wrap existing datastore in a service
This is, IMHO, the best option. This allows there to be a system of record that provides an API that can be versioned, etc. This would allow any logic around who to notify would also be encapsulated into this service.
Replace data store with something that allows for better notifications
Without any real information on what the data being store is, it's hard to say how practical this is. But using something as Apache Kafka or Apache Geode would both be options that provide the ability to be notified when new data is persisted (Kafka by listening to the topic, Geode via a continuous query).
For the record, I'd advocate for the wrapping of the existing database in a service. That service would be the only way into the db and take on responsibility for any notifications required.

What is the right way to run background task in Play 2.1 (Java)?

In my app I need to process uploaded documents and put results of processing in DB.
Documents are stored in file system and metadata is stored in DB.
For each document it is needed to open and process file from disk, than update metadata in DB accordingly. Processing may be expensive and take long time.
What I plan to do is:
Span N tasks, one task to process single document
Each task will go and find oldest, "unprocessed" document
Task will mark it as "in progress" in DB and start processing it
After processing document task will update metadata and mark it in DB as "processed"
Task will go to step 2 after that
What is the right / easiest way to implement this leveraging Play and Akka assuming applicaton is written in Java, not Scala? Source code examples would be also appreciated.

The right way is "Don't run any background tasks in a Play app". Play is a web framework for writing web apps, and a background task, by definition, does not use a web interface. So set up a separate background task runner and send it messages/events via Akka. In fact, you will have a far more scalable application if you push as much business logic as possible into background tasks.
For an example of this model taken to its logical conclusion, have a look at the Mongrel2 web server http://mongrel2.org/manual/book-final.html
Given that we have tools like Akka and Camel in the JVM world, and that frameworks like Play are weaning us off the servlet architecture, I think it is about time to follow Mongrel2's lead and get back to more of a 3 tier architecture where the web app layer only does the minimum of work.
If you follow this architecture, you would bundle up all the information needed run the background task into a message, send that to an external actor which does the work and then possibly, have that actor send a completion message to another actor which would update the database.

Java web application storing large objects available best options

Java - spring - webapplication
I have a web application which has wizard based processes to create complex entity and, there are atleast 10 screens to complete one process but problem is at any step between 1 to 10 user can come out without completeting the process and we want to store that data so that when user want to resume process it should be able to do, there are multiple tables involved in this process.
I am worried about saving data into database on every wizard step cuz after some time data will become clustered and orphan into the database and it will become garbage.
I wana discuss the solution with you guys, please advise.

You could serialize the data to XML or JSON and store it somewhere on the DB temporarily. This would avoid dealing with multiple tables. You can use a timeout and remove those entries after a while (some days maybe). Once completed do the real save and remove the temp data on success.

after some time data will become clustered and orphan into the
database
eh ?
Delete the temporary/incomplete data when the user has successfully completed the process.

Fast Multithreaded Online Processing Application Framework Suggestions

I am looking for a pattern and/or framework which can model the following problem in an easily configurable way.
Every say 3 minutes, I needs to have a set of jobs kick off in a web application context that will concurrently hit web services to obtain the latest version of data, and push it off to a database. The problem is the database will be being heavily used to read the data from to do tons of complex calculations on the data. We are currently using spring so I have been looking at Spring Batch to run this process does anyone have any suggestions/patterns/examples of using Spring or other technologies of a similar system?

We have used ServletContextlisteners to kick off TimerTasks in our web applications when we needed processes to run repeatedly. The ServletContextListener kicks off when the app server starts the application or when the application is restarted. Then the timer tasks act like a separate thread that repeats your code for the specified period of time.
ServletContextListener
http://www.javabeat.net/examples/2009/02/26/servletcontextlistener-example/
TimerTask
http://enos.itcollege.ee/~jpoial/docs/tutorial/essential/threads/timer.html

Is refactoring the job out of the web application and into a standalone app a possibility?
That way you could stick the batch job onto a separate batch server (so that the extra load of the batch job wouldn't impact your web application), which then calls the web services and updates the database. The job can then be kicked off using something like cron or Autosys.
We're using Spring-Batch for exactly this purpose.
The database design would also depend on what the batched data is used for. If it is for reporting purposes, I would recommend separating the operational database from the reporting database, using a database link to obtain the required data from the operational database into the reporting database and then running the complex queries on the reporting database. That way the load is shifted off the operational database.

I think it's worth also looking into frameworks like camel-integration. Also take a look at the so called Enterprise Integration Patterns. Check the catalog - it might provide you with some useful vocabulary to think about the scaling/scheduling problem at hand.
The framework itself integrates really well with Spring.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.