Fast Multithreaded Online Processing Application Framework Suggestions

Fast Multithreaded Online Processing Application Framework Suggestions - java

I am looking for a pattern and/or framework which can model the following problem in an easily configurable way.
Every say 3 minutes, I needs to have a set of jobs kick off in a web application context that will concurrently hit web services to obtain the latest version of data, and push it off to a database. The problem is the database will be being heavily used to read the data from to do tons of complex calculations on the data. We are currently using spring so I have been looking at Spring Batch to run this process does anyone have any suggestions/patterns/examples of using Spring or other technologies of a similar system?

We have used ServletContextlisteners to kick off TimerTasks in our web applications when we needed processes to run repeatedly. The ServletContextListener kicks off when the app server starts the application or when the application is restarted. Then the timer tasks act like a separate thread that repeats your code for the specified period of time.
ServletContextListener
http://www.javabeat.net/examples/2009/02/26/servletcontextlistener-example/
TimerTask
http://enos.itcollege.ee/~jpoial/docs/tutorial/essential/threads/timer.html

Is refactoring the job out of the web application and into a standalone app a possibility?
That way you could stick the batch job onto a separate batch server (so that the extra load of the batch job wouldn't impact your web application), which then calls the web services and updates the database. The job can then be kicked off using something like cron or Autosys.
We're using Spring-Batch for exactly this purpose.
The database design would also depend on what the batched data is used for. If it is for reporting purposes, I would recommend separating the operational database from the reporting database, using a database link to obtain the required data from the operational database into the reporting database and then running the complex queries on the reporting database. That way the load is shifted off the operational database.

I think it's worth also looking into frameworks like camel-integration. Also take a look at the so called Enterprise Integration Patterns. Check the catalog - it might provide you with some useful vocabulary to think about the scaling/scheduling problem at hand.
The framework itself integrates really well with Spring.

Related

Disable Runtime DB in Camunda

I have a simple process like this:
The process doesn't have any user tasks, there are only some service tasks but the process will be started a lot of times and performance issue is important.
I set HistoryLevel to none for better performance and it was effective in a load test.
I have a question which I couldn't find in searching on the web.
Is there any way for disabling Runtime DB in Camunda? I'm not sure this is a rational goal or not but I want to know about that.

The process models are not read from the classpath, but deployed to the database and read from there. So Camunda requires a relation database even if you disable the history and have no asynchronous continuations or wait states in your process model.
However, if you do not require persistence at all, then you can simply configured an in-memory database, such as the H2 database Camunda ships for development purposes in its different distributions. You can switch the database url to e.g. jdbc:h2:mem:camunda-db (see https://www.h2database.com/html/features.html#in_memory_databases) to switch to an in-memory configuration.
24 Hour Fitness is running millions of process instances daily using a similar approach. You may be interested in this talk they gave at CamundaCon 2020.1:
https://vimeo.com/440715573

rest interface to configure, spawn, monitor, stop long tasks java

I'm looking for suggestion about this project.
I would like to build a web ui from which configure, launch, monitor, long lived task (potentially weeks).
I was thinking of a mongodb database where to write configuration and a monitoring service from which detect change and eventually spawn new tasks.
However this solution has lots of issues and drawbacks:
If a service died? how should I know, present part of the result (or the cuase of fail) to the user?
If i want to monitor the service (i.e. how much data it is collecting) where shoud i write this info? (maybe i can write them to mongo)
If i want to stop a service? I may write the stop condition to mongo (a boolean in the configuration document?)
Do you know some example project? It can be considered a task list with a real running task behind it.
Other info that could be related to the problem: I write usually in kotlin (coroutines look like a good alternative I need to study them thoroughly), I use Spring, Camel, and Kafka. I have an ELK stack where to put the message, I'm planning to write the UI with Vue.js (the web server is node.js at the moment). I don't want to use any cloud service. I'm alone on this project.

Should I use AKKA for the periodical task

I have a terminal server monitor project. In the backend, I use the Spring MVC, MyBatis and PostgreSQL. Basically I query the session information from DB and send back to front-end and display it to users. But there is some large queries(like searching total users, total sessions, etc.), which slow down the system when user opens the website, So I want to do these queries as asynchronous tasks so the website could be opened fast rather than waiting for the query. Also, I would check terminal server state periodically from DB(every hour), and if terminal server fails or average load is too high, I would notifying admins. I do not know what should I use, maybe AKKA, or any other way to do these two jobs(1.do the large query asynchronously 2. do some periodical query)? Please help me, thanks!

You can achieve this using Spring and caching where necessary.
If the data you're displaying is not required to be "in real-time", but it can be "near real-time" you can read the data from the DB periodically and cache it. Your app then reads from the cache.
There's different approaches you can explore.
You can try to create a materialized view in PostgreSQL which will hold the statistic data you need. Depending on your requirements you have to see how to handle refresh intervals etc.
Another approach is to use application level cache - you can leverage Spring for that(Spring docs). You can populate the cache on start up and refresh it as necessary.
The task that runs every hour can be implemented again leveraging Spring (Spring docs) #Scheduled annotation.
To answer your question - don't use Akka - you have all the tools necessary to achieve the task in the Spring ecosystem.

Akka is not very relevant here, it is for event-driven programming model which deals with concurrency issues to build highly scalable multithreaded applications.
You can use Spring task scheduler for running heavy queries periodically. If you want to keep it simple, you can solve your problem by simply storing the data like total users, total sessions etc, in the global application context. And periodically update this data from database using spring scheduler. You can also store the same in a separate database table, so that this data can be easily loaded at the initialization time.
I really don't see why you need "memcached", "materialized views", "Websockets" and other heavy technologies and frameworks, for a caching a small set of data. All you need is maintain a set of global parameters in your application context, keep them updated using a scheduled task as frequently as desired.

How can I avoid duplication of business logic when batch processing?

I have a web application dedicated to batch processing (batch service here on out, api driven) and I have the main web application that is dedicated to everything else. I've been struggling with making a decision on what the best way is to avoid duplication of business logic in the batch service. Both applications are clustered. The separation for batch processing has been okay for simple jobs, but I have more complex jobs where it would just cause chaos if the business logic were duplicated. Here's my use case for the purposes of this question.
Customer schedules a cron job for user updates.
Batch service is given a CSV file with 20,000 user records.
The batch service rips through the file performing validation on the records, basically a dry run.
The batch service will check the allowable change and error thresholds (percentages are counts)
If validation thresholds pass, the batch service will begin creating/updating users.
When users are created or updated, there are a number of modules/features that need to know about these events.
Job progress is tracked and customer can view progress, logs, and status of job.
Here are a few solutions I have been thinking about:
Jar up the business logic and share it across the two applications. This wouldn't necessarily be easy because the main application is a Grails application and it's got GORM littered throughout.
Have the batch service hit APIs on the main application for the create and updates and possibly the more complex validation scenarios. Worried about the toll this would take on tomcat, but calls would be going through the load balancer so they would be distributed.
Have the batch service hit APIs on the main application for validation, then queue create/update requests and let the main application retrieve them. Same as above, queue would help reduce http calls. Also would need a queue to report status back to batch service.
Duplicate some logic by having batch service do it's own validation and inserts/updates, but then fire a user created event or user updated event so modules/features in the main app can deal with the changes.
Embed the batch processing service into the main application
Other details:
The batch service and web application are both clustered
Both are running on AWS, so I have tools like SQS and SNS easily accessible
Java 1.7 applications
Tomcat containers
Main application is Grails
Batch service uses Spring Batch and Quartz at it's core
So my question is what are accepted ways to avoid duplication of business logic based on the details above? Can/Should the architecture be changed to better accommodate this?
Another idea to consider is what would this look like and a "microservices" architecture. That word has been tossed around a number of times in the office and we have been considering the idea of breaking up the main web application into services. So for example, we may end up with a service for user management.

Say for example you are using a Java EE 6 application.
Your CSV batch updater could be nothing more than a Timer that every once in a while reads a CSV file dumped in a folder and for each user update encoded on that file pumps a message to a queue encoding the update you want to do.
Somewhere else, you have a message driven bean that reacts to the update request message and triggers the update business logic for the user reported on the JMS message.
After the transaction is committed successfuly, if you have ten differn application that are interested in knowing that the user was updated, you could post a message to, for example, a notification topic with - say - messageType='userUpdated'.
Each of your 10 applications that cares about this could be a consumer on this topic.
They would be informed that a user was updated and maybe internally publish a local event (e.g. a CDI event - a guava event - whatever - and the internal satke holders would now of it).
Normally, there are always Java EE replacements in every technlogy stack.
Every decent technology stack offers ways to promote loose coupling between UI and business logic, precisely so that HTML / WEB is just viewed as one of many entry points to an application's business logic.
In scala, i there is an AKKA framework that looks super interesting.
The point is, as long as your business logic is not written in some place that only the web application can tap into, your fine. Otherwise, you've made the design decision to couple your business logic with UI.

In your case, I would suggest to make a separation by concern, I mean a plugin that gathers only the domain classes if using Grails, the other plugins that take care of Services ... these would represent the application core, I think it's much easier this way if your application contain too much KLOC, using microservices will take you time too much time if you have a lot of calls between modules.
Communication between functional modules aka. plugins can be made via events, see events-si or rabbit MQ plugin.

Two different Java applications sharing the same database

In my web application I have a part which needs to continuously crawl the Web, process those data and present it to a user. So I was wondering if it is a good approach to split it up into two separate applications where one would do the crawling, data processing and store the data in the database. And the other app would be a web application (mounted on some web server) which would present to a user the data from the database and allow him a certain interaction with the data.
The reason I think I need this split is because if I make certain changes to my web app (like adding new functionalities, change the interface etc.) I wouldn't like the crawling to be interrupted.
My application stack is Tapestry (web layer), Spring, Hibernate (over MySQL) and my own implementation of the crawler independent from the others.
Is it good for the integration to be done just by using the same database? This might cause an issue with accessing the database from the both applications at the same time. Or can the integration be done on the Hibernate level, so both applications could use the same Hibernate session? But can the app from one JVM instance access the object from another JVM instance?
I would be grateful for any suggestions regarding this matter.
UPDATE
The user (from web app's interface) would enter the URLs for crawler to parse. The crawler app would just read the tables with URLs the web app populates. And vice versa, the data processed by the crawler would just be presented on the user interface. So, I think I shouldn't concern about any kind locking, right?
Thanks,
Nikola

I would definitely keep them separated like you are planning. The web crawling is more a "batch" process than a request driven web application. The web crawling app will run in its own JVM and your web app will be running in a servlet/Java EE container.
How often will the crawler run or is it a continuously running process? You may want to consider the frequency based on your requirements.
Will the users from web app be updating the same tables that the crawler will post data to? In that case you will need to take precaution otherwise a potential deadlock may arise. If you want your web app to auto refresh data based on new inserts in the tables then you can create a message driven bean (using JMS) to asynchronously notify the web app from the crawler app. When a new data insert message arrives you can either do a form submit on your page or use ajax to update the data on the page itself.
The web app should use connection pooling and the batch app could use DBCP or C3P0. I am not sure you gain much benefit by trying to share the database sessions in this scenario.
This way you have the integration between the two apps while not slowing down each other waiting on other to process.
HTH!

You are right, splitting the application into two could be reasonable in your case.
Disadvantages of separating into two applications -
You can not cache in Hibernate or any other cached mutable objects that are modifiable from both applications in any one of them. Optimistic locking should work fine with two hibernate applications. I don't see any other problems.
Advantages you have already specified in your code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.