Is Spring Integration suitable for web-farm processing of "reliable queue"?

Is Spring Integration suitable for web-farm processing of "reliable queue"? - java

Sorry if title is confusing, let me explain my question.
Our team need to develop web service which is suppose to run on several nodes (web farm - horizontal scaling). We know how to implement this "manually", but we're pretty excited about Spring Integration which is new to us - so we really trying to understand whether this is good fit for our scenario - and if so we'll try to make use of it.
Typical scenario:
Sevaral servers ("nodes") running same web application (lets call it "OurWebService")
We need to pull files from external systems ("InboundExtSystems")
Process this data with help of other external systems (involves local resource-consuming operations) ("UtilityExtServices")
Submit processing results to another set of external systems ("OutboundExtSystems")
Non-functional requirements:
Due to performance reasons we cannot query UtilityExtServices by demand -AND- local processing also CPU-intensive. So we need to have queue, in order to control pace at which we performing requests and process results
We expect several nodes will equally pull tasks from this queue and process them
We need to make sure that every queued task pulled from InboundExtSystems will be handled - we need to guarantee that none of them will disappear.
We need to make sure timeouts are handled as well. If task processing timed out - we need to "requeue" this task (and make sure previous handled will not submit results for this task)
We need to be able to perform rolling updates. Like let's say 5 nodes are processing queue. We want to be able to sequentially stop-upgrade-start each node without noticeably impacting system performance.
So question is: is spring integration perfect fit for such case?
If answer is "Yes", could you kindly name primary components we should use primarily?
p.s. Sure enough we would probably also need to pick something as a message bus and queue acessible by every node (maybe redis, hazelcast or maybe rabbitmq, not sure what is more appropriate)

Yes, it's a good fit. I would suggest rabbitmq for the transport/queuing and the Spring Integration AMQP enpoints.
Rolling updates shouldn't be an issue unless you change the format of the messages sent between nodes). But even then you could handle it relatively easily by moving to a new set of queues.

Related

Spring event lifecycle

To understand if the spring events fits the task im working on I need to understand how they work, where are they stored?
as I can guess they are stored in a spring application context and disappears if the application crashes, is my guess correct?

Spring events are intended to use when calling methods directly would create too much coupling. If you need to track events for auditing or replay purposes you have to save the events yourself. Based on your comments, there are many ways to achieve this, based on the topology and purpose of the application (list not complete):
Model entities that represent the events and store them in a repository
Incorporate a message broker such as Kafka that support message persistence
Install an in-memory cache such as Hazelcast
Use a cloud message service such as AWS SQS
Lastly, please make sure that you carefully evaluate which options suits your needs best. Options 2 to 4 all introduce heavy complexity and distributed applications can bring sorrow and misery to your life. Go for the simplest option if you can and only resort the other options if absolutely necessary.

With CQRS, Is it a good desgin to use Kafka for sending commands and make sure aggregates cached locally on one node?

With CQRS architecture, in write intensive realtime applications like trading systems, traditional approaches of loading aggregates from database + distributed cache + distributed lock do not perform well.
Actor model (AKKA) fits well here but i am looking for an alternative solution. What I have in mind is to use Kafka for sending commands, make use of topic partitioning to make sure commands for the same aggregate always arrive on the same node. Then use database + local cache + local pessimistic lock to load aggregate roots and handle commands. This brings 3 main benefits:
aggregates are distributed across multiple nodes
no network traffics for looking up central cache and distributed locks
no serialization & deserialization when saving and loading aggregates
One problem of this approach is when consumer groups rebalance, may produce stale aggregate state in the local cache, setting short cache timeouts should work most of time.
Has anyone used this approach in real projects?
Is it a good design and what are the down sides?
Please share your thoughts and experiences. Thank you.

IMHO Kafka will do the job for you. You need to ensure, if the network is fast enough.
In our project are reacting on soft real time on customer needs and purchases, and we are sending over Kafka infos to different services, which are performing busines logic. This works well.
Confirmantions on network levels are done well within Kafka broker.
For example when one of Broker nodes crashes, we do not loose messages.
Another matter is if you need any kind of very strong transactional confirmations for all actions, then you need to be careful in design - perhaps you need more topics, to send infos and all needed logical confirmations.
If you need to implement some more logic, like confirmation when the message is processed by other external service, perhaps you will need also disable auto commits.
I do not know if it is complete answer to your question.

Multi Threading vs JMS Queue for Asynchronous Logging

Requirement: Log events like Page Views and form Submits. Each page has ~1 second SLA. The application can have 100's of concurrent users at a time.
Log events are stored into the Database.
Solution: My initial thought was to use an async logging approach where the control returns back to the application and the logging happens in a different thread (via Spring's Thread pool task executor).
However someone suggested using JMS would be a more robust approach. Is the added work(Setting-up queue(s), writing to the queue(s), reading from the queue(s)) required when using this approach worthwhile?
What are some of the best practices / things to look out for (in a production environment) when implementing something like this?

Both approaches are valid, but one is vulnerable if you app unexpectedly stops. In your first scenario, events yet to be written to the database will be lost. Using a persistent JMS queue will mean that those events will be read from the queue and persisted to the database upon restart.
Of course, if your DB writes are so much slower than placing a message of similar size on to a JMS queue, you may be solving the wrong problem?

Using JMS for logging is a complete mismatch. JMS is a Java Abstraction for a Middleware Tool like MQ Series. That is complete overkill, and will let you go through a setup and configuration hell. JMS also lets you place messages in a transactional context, so you already get quickly the idea that JMS might be not much better than Database writes as #rjsang suggested.
This is not that JMS is not a nice technolgy. It is a good technology where it is applied properly.
For Assynchronous logging, you better just depend on a Logging API that directly supports it like Log4j2. In your case, you might be looking to configure a AsyncAppender with a JDBCAppender. Log4j2 has many more appenders as additional options, including one for JMS. However, by at least using a Logging abstraction, you make that all configurable and will make it possible to change your mind at a later time.
In the future we might have something similar to Asynchronous CDI Events, which should work similar to JMS, but would be much more lightweight. Maybe you can get something similar to work by combining CDI Events with EJB Asynchronous Methods. As long as you don't use EJB's with a remote interface, it should also be pretty lightweight.

You could give it a try using fully async and external tooling if you want to. If you have to stick to your SLA at any price and resilience is important for you, you could try using either logstash or process your logs offline. With doing so, you decouple your application from the database and you are no longer depending on the database performance. If the database is slow and you're using async loggers, queues might run full.
With logstash using GELF the whole log processing is handled within a different (or even remote) JVM. Offline processing (e.g. you write CSV logs) allows you to load the log data afterwards into the database.

Combining java spring/thread and database access for time-critical web applications

I'm developing an MVC spring web app, and I would like to store the actions of my users (what they click on, etc.) in a database for offline analysis. Let's say an action is a tuple (long userId, long actionId, Date timestamp). I'm not specifically interested in the actions of my users, but I take this as an example.
I expect a lot of actions by a lot of (different) users par minutes (seconds). Hence the processing time is crucial.
In my current implementation, I've defined a datasource with a connection pool to store the actions in a database. I call a service from the request method of a controller, and this service calls a DAO which saves the action into the database.
This implementation is not efficient because it waits that the call from the controller and all the way down to the database is done to return the response to the user. Therefore I was thinking of wrapping this "action saving" into a thread, so that the response to the user is faster. The thread does not need to be finished to get the reponse.
I've no experience in these massive, concurrent and time-critical applications. So any feedback/comments would be very helpful.
Now my questions are:
How would you design such system?
would you implement a service and then wrap it into a thread called at every action?
What should I use?
I checked spring Batch, and this JobLauncher, but I'm not sure if it is the right thing for me.
What happen when there are concurrent accesses at the controller, the service, the DAO and the datasource level?
In more general terms, what are the best practices for designing such applications?
Thank you for your help!

Take a singleton object # apps level and update it with every user action.
This singleton object should have a Hashmap as generic, which should get refreshed periodically say after it reached a threshhold level of 10000 counts and save it to DB, as a spring batch.
Also, periodically, refresh it / clean it upto the last no.# of the records everytime it processed. We can also do a re-initialization of the singleton instance , weekly/ monthly. Remember, this might lead to an issue of updating the same in case, your apps is deployed into multiple JVM. So, you need to implement the clone not supported exception in singleton.

Here's what I did for that :
Used aspectJ to mark all the actions of the user I wanted to collect.
Then I sent this to log4j with an asynchronous dbAppender...
This lets you turn it on or off with log4j logging level.
works perfectly.

If you are interested in the actions your users take, you should be able to figure that out from the HTTP requests they send, so you might be better off logging the incoming requests in an Apache webserver that forwards to your application server. Putting a cluster of web servers in front of application servers is a typical practice (they're good for serving static content) and they are usually logging requests anyway. That way the logging will be fast, your application will not have to deal with it, and the biggest work will be writing a script to slurp the logs into a database where you can do analysis.

Typically it is considered bad form to spawn your own threads in a Java EE application.
A better approach would be to write to a local queue via JMS and then have a separate component, e.g., a message driven bean (pretty easy with EJB or Spring) which persists it to the database.
Another approach would be to just write to a log file and then have a process read the log file and write to the database once a day or whenever.
The things to consider are: -
How up-to-date do you need the information to be?
How critical is the information, can you lose some?
How reliable does the order need to be?
All of these will factor into how many threads you have processing your queue/log file, whether you need a persistent JMS queue and whether you should have the processing occur on a remote system to your main container.
Hope this answers your questions.

Java enterprise architecture for delegating tasks between applications

In my environment I need to schedule long-running task. I have application A which just shows to the client the list of currently running tasks and allows to schedule new ones. There is also application B which does the actual hard work.
So app A needs to schedule a task in app B. The only thing they have in common is the database. The simplest thing to do seems to be adding a table with a list of tasks and having app B query that table every once in a while and execute newly scheduled tasks.
Yet, it doesn't seem to be the proper way of doing it. At first glance it seems that the tool for the job in an enterprise environment is a message queue. App A sends a message with task description to the queue, app B reads a message from the queue and executes the task. Is it possible in such case for app A to get the status of all the tasks scheduled (persistent queue?) without creating a table like the one mentioned above to which app B would write the status of completed tasks? Note also that there may be multiple instances of app A and each of them needs to know about all tasks of all instances.
The disadvantage of the 'table approach' is that I need to have DB polling.
The disadvantage of the 'message queue approach' is that I'm introducing a new communication channel into the infrastructure (yet another thing that can fail).
What do you think? Any other ideas?
Thank you in advance for any advice :)
========== UPDATE ==========
Eventually I decided on the following approach: there are two sides of this problem: one is communication between A and B. The other is getting information about the tasks.
For communication the right tool for the job is JMS. For getting data the right tool is the database.
So I'll have app A add a new row to the 'tasks' table descibing a task (I can query this table later on to get list of all tasks). Then A will send a message to B via JMS just to say 'you have work to do'. B will do the work and update task status in the table.
Thank you for all responses!

You need to think about your deployment environment both now and likely changes in the future.
You're effectively looking at two problems, both which can be solved in several ways, depending on how much infrastructure you able to obtain and are also willing to introduce, but it's also important to "right size" your design for your problems.
Whilst you're correct to think about the use of both databases and messaging, you need to consider whether these items are overkill for your domain and only you and others who know your domain can really answer that.
My advice would be to look at what is already in use in your area. If you already have database infrastructure that you can build into, then monitoring task activity and scheduling jobs in a database are not a bad idea. However, if you would have to run your own database, get new hardware, don't have sufficient support resources then introduction of a database may not be a sensible option and you could look at a simpler, but potentially more fragile approach of having your processes write files to schedule jobs and report tasks.
At the same time, don't look at the introduction of a DB or JMS as inherently error prone. Correctly implemented they are stable and proven technologies that will make your system scalable and manageable.
As #kan says, use exposing an web service interface is also a useful option.

Another option is to make the B as a service, e.g. expose control and status interfaces as REST or SOAP interfaces. In this case the A will just be as a client application of the B. The B stores its state in the database. The A is a stateless application which just communicates with B.
BTW, using Spring Remote you could expose an interface and use any of JMS, REST, SOAP or RMI as a transport layer which could be changed later if necessary.

You have messages (JMS) in enterprise architecture. Use these, they are available in Java EE containers like Glassfish. Messages can be serialized to be sure they will be delivered even if the server reboots while they are in the queue. And you even do not need to care how all this is implemented.

There can be couple of approaches here. First, as #kan suggested to have app B expose some web service for the interactions. This will heterogenous clients to communicate with app B. Seems a good approach. App B can internally use whatever persistent store it deems fit.
Alternatively, you can have app B expose some management interface via JMX and have applications like app A talk to app B through this management interface. Implementing the task submission and retrieving the statistics etc. would be simpler. Additionally, you can also leverage JMX notifications for real time updates on task submissions and accomplishments etc. Downside to this is that this would be a Java specific solution and hence supporting heterogenous clients will be distant dream.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.