I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component
You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.
Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.
Related
I've been playing around with a SpringBoot application, the Micrometer façade, the Statsd implementation for Micrometer and the AWS OpenTelemetry distro deployed on ECS/Fargate. So far, I've been able to export many different metrics (JVM, tomcat, data source, etc) to CloudWatch, adding the cluster name, the service name and the task ID as dimensions.
My problem now is that I don't know how to handle that information. In a production deployment I may have more than one container and I may need to scale them out/in. This makes impossible (or at least I don't know how to do it) to create a dashboard as I need to select the task IDs up front. Another problem is that there is no way to add a filter in the dashboard that just shows the list of available task IDs so I can select the one I want to monitor at that moment to remove the noise from the other ones. Something QuickSight can do.
Am I better off just moving to something like Prometheus/Grafana for this? How do people handle monitoring of containers, specially Java applications?
AWS gives you the option to alarm you based on ECS metrics but at the service level (so I guess either based on the average or max of CPU usage for example) but that isn't enough when you have a workload that is not evenly spread across your instances. Is alerting not possible when at the container level (something like alert me when the service is at 60% CPU or a single container is at 80% for example)?
I am using Spring Boot mail and ActiveMQ to build an email system. I followed this example project. Because our application QPS is small one server is enough to handle the requests. In the example project ActiveMQ, sender, and receiver are all on the same server. Is this a good practice for small application? Or I should put ActiveMQ, sender, and receiver on three separate machines?
It's depends...
The size of the application is irrelevant. It depends more on your requirements for availability, scalability and data safety.
If you have everything on the same machine you have a single point of risk. If the machine crash you lost everything on that machine. But this setup is the most
simple one (also for maintenance) and the change that the server will crash is low. Modern machines are able to handle a big load.
If you have a really high load and/or a requirement for guaranteed delivery you should use multiple systems with producers that sends messages to an ActiveMQ cluster (also distributed over multiple machines). The consumers, also on more than one machine. Use also load balancers to connect/interface to the machines.
You can also have a setup in the middle of both example setups (simple and
complex).
If you are able to reproduce all the messages (email messages in your example), and the load is not so high, I will advise you to put it simple all on the same machine.
The short answer is it depends. The longn answer is measure it. The use of small application criteria is flawed. You can have both on the same server if your server have all the resources required by your application and message queue broker, and not impacting the performance of end user.
I would suggest run your performance tests to test your criteria then decide your target environment setup.
The simplest setup is everything on the same box. If this one box has enough CPU and disk space, why not ? One (performance) advantage is that nothing needs to go over the network.
If you are concerned about fault-tolerance, replicate that whole setup on a second machine.
I have multiple vms running all of the modules in the project. Each request created by the user has to be processed by all modules, but needs to be done only once. So if VM1 picks up a request then module1 can process the request partially, next VM1 or VM2 or any other VM in cluster can pick up and process for module2. And so on.
Since each VM is of limited capacity i would like to use a load balancer for allocating work among individual VM's.
Are there load balancers(open source for java) available which can solve this or do i need to implement it using several load balancing algos(round robin,weighted etc) for solving my requirement?
Edit 1:
Each module is a java class which is independent in itself but needs previous modules to be done before its started.Each Vm is listening to a message bus. As and when a message appears in the bus any of the vm can pick up this and start working on it.
You can try HAProxy (TCP/HTTP loadbalancer ) which is open source, feature rich and quite widely used. Apart from good documentation you can find lots of information available.
Depending on the exact semantics of the problem you're trying to parallelize, you might get good results by chunking your problem into "work packets" of some size and keeping them in a central queue. Then, just have each VM poll a packet from said queue as soon as it finished the previous packet. This is called self-scheduling.
Currently I have two separated applications.
First is RESTful API.
Second is data storage that can process raw data and store processed data on the file system. This data grouped by folders and folder ids are grouped by user ids.
These applications connected through message queue (ActiveMQ) using queueCount queues.
Files sending also through this queue using embedded fileserver.
I want to distribute this data storage across several nodes.
1) First variant
On each of n nodes set up ActiveMQ and current storage application.
Create master node that will be serve queries to these shards.
In this way data for the different users will be stored on different nodes.
2) Second
Set up n nodes with storage app. Set up one instance for ActiveMQ. Create n*queueCount queues in ActiveMQ. Consume messages from corresponding queues from storage nodes.
But both variants are not perfect, maybe you can give advice to me?
Thanks in advance
Update:
What is the best way to evenly distribute data based on uuid?
Why dont you use a distributed file system like hdfs to distribute your data store. This way replication is covered, data is distributed and you can use hadoop to even send jobs to process your data in parallel.
#vvsh, what you are attempting is distributed storage with load-balancing (but I did not understand how you plan to keep specific user's files on a specific node and at the same time get even load distribution). Any way, before I go any further, the mechanism you are attempting is quite difficult to achieve in a stable manner, instead, consider using some of the infrastructures mentioned in the comments, they may not 100% fit your requirement but will do a much better job.
Now, to achieve even distribution, your architecture essentially needs to be some kind of hub-spoke model, where the hub (in your case master server) will collect the load from a single queue with multiple JMS clients running on multiple thread. The master server has to essentially do the round-robin dispatching (you may choose different types of schemes, based on file number, if file sizes are fairly constant or file size and net total dispatched to a node).
The persistence agents must run on every node to actually take the files, process them and persist in the datastore. The communication between the master and the agents could be through web service or direct socket (depending on the performance you require), the Q based communication with the agents could potentially chock your JMS server.
One point of observation is that the files could be staged in another location, like a Document/CMS and only the ID could be communicated to the master as well as the agents, there by reducing the network load and JMS persistence load.
The above mechanism needs to toke care of exceptions, failures and re-dispatching i.e. guaranteed delivery, horizontal scaling, concurrency handling, and optimized for performance. In my view you shall be better off using some proven infrastructure but if you really want to do it, the above architecture will get the job done.
My Java web application pulls some data from external systems (JSON over HTTP) both live whenever the users of my application request it and batch (nightly updates for cases where no user has requested it). The data changes so caching options are likely exhausted.
The external systems have some throttling in place, the exact parameters of which I don't know, and which likely change depending on system load (e.g., peak times 10 requests per second from one IP address, off-peak times 100 requests per second from open IP address). If the requests are too frequent, they time out or return HTTP 503.
Right now I am attempting the request 5 times with 2000ms delay between each, giving up if an error is received each time. This is not optimal as sometimes at peak-times nearly all requests fail; I could avoid making these requests and perhaps get at least some to succeed instead.
My goals are to have a somewhat simple, reliable design, and enough flexibility so that I could both pull some metrics from the throttler to understand how well the external systems are responding (and thus adjust how often they are invoked), and to auto-adjust the interval with which I call them (individually per system) so that it is optimal both on off-peak and peak hours.
My infrastructure is Java with RabbitMQ over MongoDB over Linux.
I'm thinking of three main options:
Since I already have RabbitMQ used for batch processing, I could just introduce a queue to which the web processes would send the requests they have for external systems, then worker processes would read from that queue, throttle themselves as needed, and return the results. This would allow running multiple parallel worker processes on more servers if needed. My main concern is that it isn't a very simple solution, and how to manage peak-hour throughput being low and thus the web processes waiting for a long while. Also this converts my RabbitMQ into a critical single failure point; if it dies the whole system stops (as opposed to the nightly batch processes just not running any more, which is less critical). I suppose rpc is the correct pattern of RabbitMQ usage, but not sure. Edit - I've posted a related question How to properly implement RabbitMQ RPC from Java servlet web container? on how to implement this.
Introduce nginx (e.g. ngx_http_limit_req_module), HAProxy (link) or other proxy software to the mix (as reverse proxies?), have them take care of the throttling through some configuration magic. The pro is that I don't have to make code changes. The con is that it is more technology used, and one I've not used before, so chances of misconfiguring something are quite high. It would also likely not be easy to do dynamic throttling depending on external server load, or prioritizing live requests over batch requests, or get statistics of how the throttling is doing. Also, most documentation and examples will likely be on throttling incoming requests, not outgoing.
Do a pure-Java solution (e.g., leaky bucket implementation). Would be simple in the sense that it is "just code", but the devil is in the details; debugging all the deadlocks, starvations and race conditions isn't always fun.
What am I missing here?
Which is the best solution in this case?
P.S. Somewhat related question - what's the proper approach to log all the external system invocations, so that statistics are collected as to how often I invoke them, and what the success rate is?
E.g., after every invocation I'd invoke something like .logExternalSystemInvocation(externalSystemName, wasSuccessful, elapsedTimeMills), and then get some aggregate data out of it whenever needed.
Is there a standard library/tool to use, or do I have to roll my own?
If I use option 1. with RabbitMQ, is there a way to organize the flow so that I get this out of the box from the RabbitMQ console? I wouldn't want to send all failed messages to poison queue, it would fill up too quickly though and in most cases there is no need to re-process these failed requests as the user has already sadly moved on.
Perhaps this open source system can help you a little: http://code.google.com/p/valogato/