What ways exist to distribute asynchronous batch tasks?

What ways exist to distribute asynchronous batch tasks? - java

I am currently investigating what Java compatible solutions exist to address my requirements as follows:
Timer based / Schedulable tasks to batch process
Distributed, and by that providing the ability to scale horizontally
Resilience, no SPFs please
The nature of these tasks (heavy XML generation, and the delivery to web based receiving nodes) means running them on a single server using something like Quartz isn't viable.
I have heard of technologies like Hadoop and JavaSpaces which have addressed the scaling and resilience end of the problem effectively. Not knowing whether these are quite suited to my requirements, its hard to know what other technologies might fit well.
I was wondering really what people in this space felt were options available, and how each plays its strengths, or suits certain problems better than others.
NB: Its worth noting that schedule-ability is perhaps a hangover from how we do things presently. Yes there are tasks which ought to go at certain times. It has also been used to throttle throughput at times when no mandate for set times exists.

Asynchronous always brings JMS to mind for me. Send the request message to a queue; a MessageListener is plucked out of the pool to handle it.
This can scale, because the queue and listener can be on a remote server. The size of the listener thread pool can be configured. You can have different listeners for different tasks.
UPDATE: You can avoid having a single point of failure by clustering and load balancing.
You can get JMS without cost using ActiveMQ (open source), JBOSS (open source version available), or any Java EE app server, so budget isn't a consideration.
And no lock-in, because you're using JMS, besides the fact that you're using Java.
I'd recommend doing it with Spring message driven POJOs. The community edition is open source, of course.
If that doesn't do it for you, have a look at Spring Batch and Spring Integration. Both of those might be useful, and the community editions are open source.

Have you looked into GridGain? I am pretty sure it won't solve the scheduling problem, but you can scale it and it happens like "magic", the code to be executed is sent to a node and it is executed in there. It works fine when you don't have a database connection to be sent (or anything that is not serializable).

Related

MQ choice for point-to-point high throughput

I have a scenario with these particular demands:
Production ready & stable.
Point to point connection, with the producer behind a firewall and a consumer in the cloud. It might be possible to split the traffic between a couple of producers\consumers, but all the traffic still has to traverse a single WAN connection which will probably be the bottleneck.
High throughput - something along the order of 300 Mb/sec (may be up to 1Gb!). Message sizes vary from ~1KB to possibly several MBs.
Guaranteed delivery a must - every message has to arrive at the consumer eventually, so we need to start saving messages to disk in the event of a momentary network outage or risk running out of memory.
Message order is not important, messages are timestamped and can be re-arranged at the consumer.
Highly preferable but not as important - should run on both linux & windows (JVM seems the obvious choice)
I've been looking at so many MQs lately, and I don't have any hands-on experience with any.
Thought it will be a better idea to ask someone with experience.
We're considering mostly Kafka, but I'm not sure it's the best for our use case, seems to be tailored to distributed deployment & mutliple topics\consumers\producers. Also, definitely not production ready on windows.
What about Apache ActiveMQ or Apollo\Artemis? RabbitMQ seems not to be a good fit for our performance requirements. Or maybe there's some Java library that has the features we need without a middleman broker?
Any help making sense of this kludge would be greatly appreciated.

If anyone comes across this, we went with Kafka in the end. Its performance is impressive and so far it's very stable on linux. No attempt yet to run it on windows in production deployments.
UPDATE 12/3/2017:
Works fine and very stable on Linux, but on Windows this is not usable in production. Old data never gets deleted due to leaky file handles, the relevant Jira is being ignored since 2013: https://issues.apache.org/jira/browse/KAFKA-1194

Decision to go for distributed application?

I have a legacy product in financial domain.Using tomcat 6. We get millions of request 10k of request in hour. I am wondering at high level
should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
The reason i am planning to go in this direction so that load is distribute and get better performance With this it becomes scalable also.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
If some expert can help
what is the criteria i should consider to go for distributed model and pros/cons of it? I also tried googling where i could get some stats
like how much load a given webserver (tomcat in my case)handle efiiciently with given hardware(16 gb ram, windows 7, processor ).
Yes i am going
to do POC where i will be measuring performance with distributed model vs without bit high level input will be highly appreciated?

It is impossible to answer this questions without more details - how long does it take to reply to one request on the current server? How many resources are allocated for one request?
having 10k requests per hour means ~3 requests per second. If performing the necessary operations and replying to a request, using 1 CPU takes ~300ms - one simple machine is totally fine. This is simple math, and doesn't always work. I guess you still have peaks within those 10k requests per hour and they aren't gradually distributed.
If we assume, one reply can take up to 1 second, than you can handle as many replies per second as your system has CPUs (given that a CPU would be the bottle neck) If the CPU isn't the bottle neck for your application server, there's probably something wrong. You should set up the database(s) on a different machine and only perform computation tasks on the application server machine.
Especially in the financial sector with a legacy software, I wouldn't try splitting a running product. How old is the current server? I believe that a new Server should be cheaper than rewriting an application. Unless you expect 50-100k requests per hour very soon, I don't think, splitting up such small parts makes sense.
Instead - run it on an up to date server hardware, split application server and data storage and you should be fine.

I am wondering at high level if should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
I'm not sure what you mean for "system" in this context, but if it means that you are planning to run your application in two servers,
one dedicated to presentation and other dedicated to business layer, take in mind that a simpler approach (and probably more suitable for your app)
is build a co-located architecture.
Basically, the idea is to replicate your app in several servers (at least two) and put in front of them a load balancer that routes the incoming requests among the available servers.
All servers share the same database instance. This will give you vertical scalability and also will improve the availability of your system.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
Distributing your business logic will probably involve a refactor of your application code, if the system is working well you will add some bugs for sure.
The necessary remote calls will add latency and the fact that you execute your business logic in several servers doesn't resolve the performance problems on the presentation tier.
In Expert One-on-One J2EE Development Without EJB (pag. 65), you can find a good reading about why not distribute your business logic.

How to properly throttle web requests to external systems?

My Java web application pulls some data from external systems (JSON over HTTP) both live whenever the users of my application request it and batch (nightly updates for cases where no user has requested it). The data changes so caching options are likely exhausted.
The external systems have some throttling in place, the exact parameters of which I don't know, and which likely change depending on system load (e.g., peak times 10 requests per second from one IP address, off-peak times 100 requests per second from open IP address). If the requests are too frequent, they time out or return HTTP 503.
Right now I am attempting the request 5 times with 2000ms delay between each, giving up if an error is received each time. This is not optimal as sometimes at peak-times nearly all requests fail; I could avoid making these requests and perhaps get at least some to succeed instead.
My goals are to have a somewhat simple, reliable design, and enough flexibility so that I could both pull some metrics from the throttler to understand how well the external systems are responding (and thus adjust how often they are invoked), and to auto-adjust the interval with which I call them (individually per system) so that it is optimal both on off-peak and peak hours.
My infrastructure is Java with RabbitMQ over MongoDB over Linux.
I'm thinking of three main options:
Since I already have RabbitMQ used for batch processing, I could just introduce a queue to which the web processes would send the requests they have for external systems, then worker processes would read from that queue, throttle themselves as needed, and return the results. This would allow running multiple parallel worker processes on more servers if needed. My main concern is that it isn't a very simple solution, and how to manage peak-hour throughput being low and thus the web processes waiting for a long while. Also this converts my RabbitMQ into a critical single failure point; if it dies the whole system stops (as opposed to the nightly batch processes just not running any more, which is less critical). I suppose rpc is the correct pattern of RabbitMQ usage, but not sure. Edit - I've posted a related question How to properly implement RabbitMQ RPC from Java servlet web container? on how to implement this.
Introduce nginx (e.g. ngx_http_limit_req_module), HAProxy (link) or other proxy software to the mix (as reverse proxies?), have them take care of the throttling through some configuration magic. The pro is that I don't have to make code changes. The con is that it is more technology used, and one I've not used before, so chances of misconfiguring something are quite high. It would also likely not be easy to do dynamic throttling depending on external server load, or prioritizing live requests over batch requests, or get statistics of how the throttling is doing. Also, most documentation and examples will likely be on throttling incoming requests, not outgoing.
Do a pure-Java solution (e.g., leaky bucket implementation). Would be simple in the sense that it is "just code", but the devil is in the details; debugging all the deadlocks, starvations and race conditions isn't always fun.
What am I missing here?
Which is the best solution in this case?
P.S. Somewhat related question - what's the proper approach to log all the external system invocations, so that statistics are collected as to how often I invoke them, and what the success rate is?
E.g., after every invocation I'd invoke something like .logExternalSystemInvocation(externalSystemName, wasSuccessful, elapsedTimeMills), and then get some aggregate data out of it whenever needed.
Is there a standard library/tool to use, or do I have to roll my own?
If I use option 1. with RabbitMQ, is there a way to organize the flow so that I get this out of the box from the RabbitMQ console? I wouldn't want to send all failed messages to poison queue, it would fill up too quickly though and in most cases there is no need to re-process these failed requests as the user has already sadly moved on.

Perhaps this open source system can help you a little: http://code.google.com/p/valogato/

Hard time choosing ... IO vs NIO

I would like to ask what would be more appropriate to choose when developing a server similar to SmartFoxServer. I intend to develop a similar yet different server. In the benchmarks made by the ones that developed the above server they had something like 10000 concurrent clients.
I made a bit of research regarding the cost of using too many threads(>500) but cannot decide which way to go. I once made a server in java but that was for a small application and had nothing to do with heavy loads.
Thanks

Take a look at Apache Mina. They've done alot of the heavy lifting required to use NIO effectively in a networking application. Whether or not NIO increases your ability to process concurrent connections really depends on your implementation, but the performance boosts in Tomcat, JBoss and Jetty are plenty evidence to you already in the positive.

i'm not familiar with smartfoxserver, so i can only speak generically (which is not always good :P but here i go)
i think those are 2 different questions. on one hand, the io performance when using native java sockets vs. native sockets written in c (like tomcat).
the other question is how to scale up to that kind of concurrency level. other than that, i'd always choose native sockets (i.e: c).
now, how to scale: it's not a good idea to have a lot of threads running at the same time (os constraints, etc), so i'd choose to scale horizontally, meaning to add a load balancer that can send the requests to different servers that can be linked by using messages (using jms, like rabbitmq or activemq, or even using a protocol like stomp or amqp).
other solution, a cloud environment that allows you to grow your installation as you need

In most benchmarks which test 10K or 100K connections, the server is doing no work and unless your server does next to nothing, these test are unrealistic.
You need to take a clear idea of mow many concurrent connections you want to support.
If you have less than 1K connection, using a thread per connection will work ok. This is the simplest approach to take. Using a dispatcher model with NIO will work better if your request are very simple. Otherwise it won't matter much.
If you have more than 1K connections it is likely you want to use more than one server as each connection is getting less than 1% of a core and the cost of a basic server is relatively cheap these days.

Critically efficient server

I am developing a client-server based application for financial alerts, where the client can set a value as the alert for a chosen financial instrument , and when this value will be reached the monitoring server will somehow alert the client (email, sms ... not important) .The server will monitor updates that come from a data generator program. Now, the server has to be very efficient as it has to handle many clients (possible over 50-100.000 alerts ,with updates coming at 1,2 seconds) .I've written servers before , but never with such imposed performances and I'm simply afraid that a basic approach(like before) will just not do it . So how should I design the server ?, what kind of data structures are best suited ?..what about multithreading ?....in general what should I do (and what I should not do) to squeeze every drop of performance out of it ?
Thanks.

I've worked on servers like this before. They were all written in C (or fairly simple C++). But they were even higher performance -- handling 20K updates per second (all updates from most major stock exchanges).
We would focus on not copying memory around. We were very careful in what STL classes we used. As far as updates, each financial instrument would be an object, and any clients that wanted to hear about that instrument would subscribe to it (ie get added to a list).
The server was multi-threaded, but not heavily so -- maybe a thread handing incoming updates, one handling outgoing client updates, one handling client subscribe/release notifications (don't remember that part -- just remember it had fewer threads than I would have expected, but not just one).
EDIT: Oh, and before I forget, the number of financial transactions happening is growing at an exponential rate. That 20K/sec server was just barely keeping up and the architects were getting stressed about what to do next year. I hear all major financial firms are facing similar problems.

You might want to look into using a proven message queue system, as it sounds like this is basically what you are doing in your application.
Projects like Apache's ActiveMQ or RabbitMQ are already widely used and highly tuned, and should be able to support the type of load you are talking about outside of the box.

I would think that squeezing every drop of performance out of it is not what you want to do, as you really never want that server to be under load significant enough to take it out of a real-time response scenario.
Instead, I would use a separate machine to handle messaging clients, and let that main, critical server focus directly on processing input data in "real time" to watch for alert criteria.

Best advice is to design your server so that it scales horizontally.
This means distributing your input events to one or more servers (on the same or different machines), that individually decide whether they need to handle a particular message.
Will you be supporting 50,000 clients on day 1? Then that should be your focus: how easily can you define a single client's needs, and how many clients can you support on a single server?
Second-best advice is not to artificially constrain yourself. If you say "we can't afford to have more than one machine," then you've already set yourself up for failure.

Beware of any architecture that needs clustered application servers to get a reasonable degree of performance. London Stock Exchange had just such a problem recently when they pulled an existing Tandem-based system and replaced it with clustered .Net servers.
You will have a lot of trouble getting this type of performance from a single Java or .Net server - really you need to consider C or C++. A clustered architecture is much more error prone to build and deploy and harder to guarantee uptime from.
For really high volumes you need to think in terms of using asynchronous I/O for networking (i.e. poll(), select() and asynchronous writes or their Windows equivalents), possibly with a pool of worker threads. Read up about the C10K problem for some more insight into this.
There is a very mature C++ framework called ACE (Adaptive Communications Environment) which was designed for high volume server applications in telecommunications. It may be a good foundation for your product - it has support for quite a variety of concurrency models and deals with most of the nuts and bolts of synchronisation within the framework. You might find that the time spent learning how to drive this framework pays you back in less development and easier implementation and testing.

One Thread for the receiving of instrument updates which will process the update and put it in a BlockingQueue.
One Thread to take the update from the BlockingQueue and hand it off to the process that handles that instrument, or set of instruments. This process will need to serialize the events to an instrument so the customer will not receive notices out-of-order.
This process (Thread) will need to iterated through the list of customers registered to receive notification and create a list of customers who should be notified based on their criteria. The process should then hand off the list to another process that will notify the customer of the change.
The notification process should iterate through the list and send each notification event to another process that handles how the customer wants to be notified (email, etc.).
One of the problems will be that with 100,000 customers synchronizing access to the list of customers and their criteria to be monitored.

You should try to find a way to organize the alerts as a tree and be able to quickly decide what alerts can be triggered by an update.
For example let's assume that the alert is the level of a certain indicator. Said indicator can have a range of 0, n. I would groups the clients who want to be notified of the level of the said indicator in a sort of a binary tree. That way you can scale it properly (you can actually implement a subtree as a process on a different machine) and the number of matches required to find the proper subset of clients will always be logarithmic.

Probably the Apache Mina network application framework as well as Apache Camel for messages routing are the good start point. Also Kilim message-passing framework looks very promising.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.