I'm looking for an existing system to replace an existing slow and complicated self written mechanism of job management.
The existing system:
1 MySQL DB with a long massive table of jobs - the queue
Multiple servers (written in java) all extracting jobs from the queue and processing them
a job might NOT be deleted from the queue after processing it, to rerun it later
a job might create other jobs and insert them to the queue
The limitations:
As more and more jobs are created and inserted in to the queue, it takes longer to extract jobs from it. (The jobs are chosen by priority and type) - create a bottle neck
I'm looking for an existing system that can replace this one, and improve it's performance.
Any suggestions?
Thanks
I don't generally recommend JMS, but it sounds like it really is what you need here. Distributed, transactional, persistent job queue management is what JMS is all about.
Popular open-source implementations include HornetQ and ActiveMQ.
You could:
submit your jobs to Amazon's Simple Queue Service (maybe JAXB marshalled)
dynamically start some EC2 instances according to your queue's length and probably
submit the results (or availability notice for some files on S3) to Simple Notification Service (again JAXB marshalled).
That exactly what we do, using EC2 Spot instances to minimize costs. And that's what I call serious cloud computing ;)
Related
We are now in the process of refactoring our messaging application written in Vert.x. The application processes incoming messages from users. Initially, it was implemented so that there is a single verticle instance that listens to a single queue in the event bus and processes all the incoming messages.
What we are thinking of doing is to refactor it so that it works a bit similar to actor model: we deploy an instance of a verticle for each active user and make it listen to a user-specific queue. This way the verticle instance can maintain user-specific state and the parallelization of the message processing becomes much easier.
The issue, however, is that this would lead to a huge number of verticles deployed (30k - 50k in parallel) and huge amount of queues in the eventbus. And also we would need to maintain the verticles manually (undeploy unused verticles and deploy the ones when there is a message from a new user).
Question is - is this actor-style architecture good for vert.x and can it handle large amount of deployed verticles and eventbus queues at the same time?
There's one major correction to be made here - EventBus is a single queue. So, you won't have "huge number of queues". There will be only one. You'll have huge number of addresses on a single queue.
But is this number so huge? Well, is a HashMap of 50K elements can be considered huge? Probably not, at least in terms of keys. Now note that this applies only to Vert.x in non-clustered mode. Clustered Vert.x is different (still should work, though).
Now having those verticles is another matter. Each verticle is a separate object, and if you plan to store some data in it, it will be even larger. But if you can afford machines with some decent RAM (16GB+), it should work just fine.
What does concern me in this solution, though, is that you plan to deploy verticles on demand, then undeploy them. It does incur delays, so your users will experience degraded performance for first message they send.
What you call "actor-style" does not mean, that you have to inflate a new verticle instance per user. If you do so, you are going to get a system with 98% redundancy.
It's absolutely enough to register an event-bus address for each user and use some sort of persistant storage to keep track of them. Such a storage can be any DB for long-term persistance or a cluster-wide SharedMap for short-term, or a combination of both.
Perhaps you don't even need a address-per-user scheme. Such a scheme is nice when the users are connected constantly to your system via some sort of EventBusBridge. If this is not a case, you can register a single event-bus address for all users and process messages based on payload.
I'm working in a Java EE application and I want that some WebServices are executed in parallel.
I would like to know the pros and cons of 2 different approaches:
Use JMS queues and MDBs, so each message I put in the queue would be executed in parallel. This way the application part that put the message into the queue would have a while, that waits the MDBs to response in a RS Queue.
Use the java concurrent API (Future / Callable).
ADDED
This is what the application needs to do:
The application already does it via an MDB, but I was thinking about a refactoring.
TODAY'S SCENARIO:
//CALLER CLASS
FOREACH INTEGRATION
PUT MESSAGE INTO A QUEUE AND STORE AN ARRAY OF CORRELATION_IDs
END
THREAD.SLEEP(X) // SOMETIME FOR INTEGRATION TO FINISH
WHILE (true){
GET RESPONSE FROM THE RESPONSE QUEUE FOR EACH INTEGRATION USING THE CORRELATION PREVIOUSLY STORED
}
//MDB CLASS
HAS A HUGE SWITCH CASE THAT PROCESS EACH INTEGRATION
RETURN THE RESULT INTO THE RESPONSE QUEUE;
Questions:
Is it ok to use the concurrent API in java? In my opinion using the concurrent API will eliminate a layer of failure (JMS).
My deployment environment is Websphere. Is it a good practice to create your own threads with the concurrent java API.
Thanks in advance
Whatever solution you go with, you will eventually need to cope with a burst of traffic. The JMS/MDB the burst is controlled by the queue effectively. Also a point to consider is that the queue can be made persistence, so it will survive a server restart. Also a queue can be distributed across many servers, giving you horizontal scalability.
The thread approach is of course quicker to develop, test and deploy. However, I would consider using a BlockingQueue so that your threads do not run amock.
jms pros: you can have persistence, you can connect to existing infrastructure
jms cons: seems to heavy to be used only as a dispatcher
manual concurrency cons: well, it's manual. and parallel programming is difficult. some webservers (especially clouds) may forbid to create your own threads
not sure what exactly you want to do but webserver by default processes requests in parallel, so maybe you don't need anything else?
I need some design and developments inputs on reading messages from queue. i have following requirements and constraints
i need read message from queue and inert to db.
messages can come at any interval (100's at same time or 1 by one with few mins gap)
don't have any MDB container to host (just plain tomcat server)
Need to write java application to perform the above.
so not very sure how to put this simple application.
if is use quartz scheduler to trigger job to read all messages in the queue then not sure before even that complete next instance of scheduler might start and create problem.
please suggest me any inputs.
this is basically some utility so i don't want to spend too long time nor too much resources on this.
thanks & regards
LR
The usage of an ESB like Mule or Camel would simplify a lot your development. You'd find already developed components (called endpoints) for reading from a queue, and writing into a db. Also for scheduling jobs with quartz.
I am currently investigating what Java compatible solutions exist to address my requirements as follows:
Timer based / Schedulable tasks to batch process
Distributed, and by that providing the ability to scale horizontally
Resilience, no SPFs please
The nature of these tasks (heavy XML generation, and the delivery to web based receiving nodes) means running them on a single server using something like Quartz isn't viable.
I have heard of technologies like Hadoop and JavaSpaces which have addressed the scaling and resilience end of the problem effectively. Not knowing whether these are quite suited to my requirements, its hard to know what other technologies might fit well.
I was wondering really what people in this space felt were options available, and how each plays its strengths, or suits certain problems better than others.
NB: Its worth noting that schedule-ability is perhaps a hangover from how we do things presently. Yes there are tasks which ought to go at certain times. It has also been used to throttle throughput at times when no mandate for set times exists.
Asynchronous always brings JMS to mind for me. Send the request message to a queue; a MessageListener is plucked out of the pool to handle it.
This can scale, because the queue and listener can be on a remote server. The size of the listener thread pool can be configured. You can have different listeners for different tasks.
UPDATE: You can avoid having a single point of failure by clustering and load balancing.
You can get JMS without cost using ActiveMQ (open source), JBOSS (open source version available), or any Java EE app server, so budget isn't a consideration.
And no lock-in, because you're using JMS, besides the fact that you're using Java.
I'd recommend doing it with Spring message driven POJOs. The community edition is open source, of course.
If that doesn't do it for you, have a look at Spring Batch and Spring Integration. Both of those might be useful, and the community editions are open source.
Have you looked into GridGain? I am pretty sure it won't solve the scheduling problem, but you can scale it and it happens like "magic", the code to be executed is sent to a node and it is executed in there. It works fine when you don't have a database connection to be sent (or anything that is not serializable).
I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin
Might want to look into MapReduce and Hadoop
You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.
Check out Hadoop
I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.
The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.
If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.
I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.