I'm looking for an orchestration framework/engine/toolkit with which to replace/upgrade an existing software, mainly because of scalability limitations. By orchestration I mean asynchronous and distributed execution of generic tasks and workflows.
More specifically the requirements are pretty much these:
Wrapping and execution of generic tasks, in Java if language dependent
API for tasks and workflows on-demand triggering
Scheduling would be nice as well
Support for distributed architecture & scalability (mainly for big numbers of small tasks)
Persistency and resilience
Advanced workflow configuration capabilities (do this, then these 3 tasks in parallel, then this, having priorities, dependencies...)
Monitoring and administration UI (or at least API)
The existing system is an old fashion monolithic service (in Java) that has most of that, including the execution logic itself which should remain as untouched as possible.
Does anyone have experience with a similar problem? It seems to me it's supposed to be pretty common, would be strange if I have to implement it myself entirely. I found some questions here (like this and this) discussing the theory of orchestration and choreography systems, but not real examples of tools implementing it. Also I think we're not exactly talking about microservices - the tasks are not prolonged and heavy, they're just many, running in the background executing short jobs of many types. I wouldn't create a service for every job type.
I'm also not looking for cloud and container services at this point - to my understanding the deployment is a different issue.
The closest I got is the Netflix Conductor engine, which answers most of the requirements by running an orchestration server that manages tasks implemented in servlets (or any web services in any language - a plus). However it seems like it's built mainly for arranging heavy tasks in a workflow rather than running a huge number of small tasks, which makes me wonder what would be the overhead of invoking many small tasks in servlets for example.
Does anyone have experience or any input on the Conductor or other tools I could use? Or even my entire approach to the problem?
EDIT: I realize it's kind of a "research advice needed" so let's put it simply in 3 questions:
Am I right to look for an orchestration solution for the requirements above?
Does anyone have experience with the Netflix Conductor? Any feedback on it?
Does it have good competitors?
The main competitor of Netflix Conductor is Temporal Workflow. It scales better and is more developer-friendly by using code instead of JSON DSL to implement the orchestration logic.
It also works OK with the fine-grained tasks by implementing specific optimizations (local activities) that allow batching multiple small tasks into a single database update.
Temporal has been production hardened for over five years at Uber, Coinbase, HashiCorp, Dagadog, Stripe, and hundreds of other companies.
Perhaps you are looking for something like Airflow https://airflow.apache.org/ ?
Wrapping and execution of generic tasks, in Java if language dependent
https://github.com/apache/incubator-airflow/tree/master/airflow/hooks
https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/operators
API for tasks and workflows on-demand triggering
https://airflow.apache.org/api.html (experimental)
Scheduling would be nice as well
think of cron on steroids -
https://airflow.apache.org/scheduler.html
Support for distributed architecture & scalability (mainly for big numbers of small tasks)
scale with dask or celery nodes - Airflow + celery or dask. For what, when?
Persistency and resilience
uses a postgres db & rabbitMQ - if your deployment arch is stateless ( eg. repeatable containers & volumes with docker) you should be in good shape with WAL replication
if you use Kubernetes or Consul there are other ways to implement more resilience on the other components
Advanced workflow configuration capabilities (do this, then these 3 tasks in parallel, then this, having priorities, dependencies...)
Airflow uses DAG's. The capabilities can be called fairly advanced. You also have parameter sharing using XCOMs if you really need that
Monitoring and administration UI (or at least API)
Has one, shows tasks & schedules & has a gantt view. also can see logs & run details easily & also manually schedule tasks directly from the UI
also look at oozie & azkaban
did this help?
You could take a look at unify-flowret, a lightweight Java orchestration engine I created as part of developing a new platform in American Express. If you think Netflix Conductor seems like a good fit for your problem, you should definitely take a look at unify-flowret as Netflix Conductor was one of the options which we had evaluated before building unify-flowret.
Unify-flowret provides core orchestration functionality and depends upon the application to provide everything else. You define the workflow in a very simple JSON file using steps and routes. Then, in the application which wants to use flowret, you create certain implementations e.g. an implementation for persisting state to a database (this way it is possible to use any data store). Or an implementation to return an object to flowret on which flowret will invoke the step function. This way, rather than implementing all types of requirements within the orchestration engine, to keep things simple, most are deferred to the application.
Unify-flowret runs in an embedded mode and so is scalable horizontally. It resumes from where it left off. It is resilient in the face of crashes and will resume from the last recorded position. It provides for true technical parallel processing via definition in the workflow JSON. It provides an SLA framework that informs the application of the milestones to be set up in the future. It provides work management functionality in the form of work baskets. And many other features!
We have had great success in using it within American Express for really complex orchestration requirements.
You can checkout unify-flowret on https://github.com/americanexpress/unify-flowret.
Related
In the attempt to design & implement & test a distributed capabilities system, Remote Promises[1][2][3], bit identical between Squeak & Java, there are shortcomings. I am seeking work-arounds.
With Remote Promises, proxies can change state, which changes the class implementing the proxy. In Squeak this is done with #becomeForward:, while in Java, it requires a secondary proxy, one that can change it's implemention. This does work.
Exceptions should be non-blocking to allow the event loop to continue, yet also display the problem stack for debugging, out of a quarantine. This is good in Squeak but an open issue with Java. I suppose the answer is do all your logging and then close the exception, allowing the event loop to proceed: it is server-style log debugging.
Using a meta repository, it should be possible to demand load consumers of a particular event type. Dynamically load the latest released code into the consumer servers and spread out the load to speed up the throughput. Update the system at runtime for continuous, seemless operations. I suppose the solution here is to build a dynamic jar classLoader system. Are there any examples of this? An Apache project perhaps?
Remote Promises in Squeak
Cryptography in
Squeak
Remote Promises in
Java, called Raven
Use cloud technologies made for that kind of usecases
I would say that in today world, to get the latest version of a code, you don't use a class loader or any advanced capability of your programming langage. You would user likely some kind of cloud service.
That's may be serverless cloud implementation or a container/kubernetes (https://kubernetes.io/) implementation. You can then perfectly when the new release is loaded, control if you want to do Canary, Blue/Green or progressive rollout or even implement your own strategy.
Because it would work with containers, that would be fine whatever the langage be it C++, java, python, shell, Squeak or anything.
That layer would also provide auto scaling of your various services, redundancy and load balancing and distribute the workload on your cluster.
You can go to the next step with gitops. A PR merge in git automatically trigger the load of the new version in production (https://www.weave.works/technologies/gitops/)
Dynamically loading of jars in Java
Still for sure java thanks to its class loaded API allows to load classes dynamically. This is what web servers are doing and several implementations of that do exist like OSGI or check the response of dimo414.
Conclusion
It would seems that the java route make more sense for a generic plugin system like the one of Eclipse (OSGI), and that the containers solution make more sense for a globally distributed system, auto scaling & resiliance in clusters.
Kubernetes scale to thousand of nodes and provides a whole echosystem to deal with distributed system and it can scale and operate any linux or windows process. This is the de-facto standard pushed by Google and used by thousand of companies over the world.
demand load consumers of a particular event type.
This is typically done via the ServiceLoader API. See the AutoService project to simplify working with services.
This may not be what you need; your question is still very broad, and there are many plausible approaches. Searches for [dynamically load jars] finds existing posts like Load jar dynamically at runtime? that may be of interest.
What products/projects could help me with the following scenario?
More than one server (same location)
Some state should be shared between server (for instance information if a scheduled task is running and on what server).
The obvious answer could of course be databases but we are using Seam and there doesn't seem to be a good way to nest transactions inside a Seam-bean so I need to find a way where I don't have to go crazy over configuration (tried to use EJB:s but persistence.xml wasn't pretty afterwards). So i need another way around this problem until Seam support nested transactions.
This is basically the same scenario as I have if you need more details: https://community.jboss.org/thread/182126.
Any ideas?
Sounds like you need to do distributed job management.
The reality is that in the Java EE world, you are going to end up having to do Queues, as in MoM [Message-oriented Middleware]. Seam will work with JMS, and you can have publish and subscribe queues.
Where you might want to take a look for an alternative is at Akka. It gives you the ability to distribute jobs across machines using an Actor/Agent model that is transparent. That's to say your agents can cooperate with each other whether they are on the same instance or across the network from each other, and you are not writing a ton of code to make that happen, or having to special handle things up and down the message chain.
The other thing Akka has going for it is the notion of Supervision, aka Go Ahead and Fail, or Let it Crash. This is the idea (followed by the Telcos for years), that systems will fail and you should design for it and have a means of making things resilient.
Finally, the state of other options job wise in the Java world is dismal. Have used Seam for years. It's great, but they decided to just support Quartz for jobs, which is useless.
Akka is built on Netty, too, which does some pretty crazy stuff in terms of concurrency and performance.
[Not a TypeSafe employee, btw…]
if you needed to build a highly scalable web application using java, what framework would you use and why?
I'm just reading thinking-in-java, head first servlets and manning's spring framework book, but really I want to focus on highly scalable architectures etc.
would you use tomcat, hibernate, ehcache?
(just assume you have to design for scale, not looking for the 'worry about it when you get traffic type responses)
The answer depends on what we mean by "scalable". A lot depends on your application, not on the framework you choose to implement it with.
No matter what framework you choose, the fact is that the hardware you deploy it on will have an upper limit on the number of simultaneous requests it'll be able to handle. If you want to handle more traffic, you'll have to throw more hardware at it and include load balancing, etc.
The part that's pertinent in that case has to do with shared state. If you have a lot of shared state, you'll have to make sure that it's thread safe, "sticky" when it needs to be, replicated throughout a cluster, etc. All that has to do with the app server you deploy it to and the way you design your app, not the framework.
Tomcat's not a "framework", it's a servlet/JSP engine. It's got clustering capabilities, but so do most other Java EE app servers. You can use Tomcat if you've already chosen Spring, because it implies that you don't have EJBs. Jetty, Resin, WebLogic, JBOSS, Glassfish - any of them will do.
Spring is a good choice if you already know it well. I think following the Spring idiom will make it more likely that your app is layered and architecturally sound, but that's not the deciding factor when it comes to scalability.
Hibernate will make your development life easier, but the scalability of your database depends a great deal on the schema, indexes, etc. Hibernate isn't a guarantee.
"Scalable" is one of those catch-all terms (like "lightweight") that is easy to toss off but encompasses many considerations. I'm not sure that a simple choice of framework will solve the issue once and for all.
I would check out Apache Mina. From the home page:
Apache MINA is a network application
framework which helps users develop
high performance and high scalability
network applications easily. It
provides an abstract · event-driven ·
asynchronous API over various
transports such as TCP/IP and UDP/IP
via Java NIO.
It has an HTTP engine AsyncWeb built on top of it.
A less radical suggestion (!) is Jetty - a servlet container geared towards performance and a small footprint.
The two keywords I would mainly focus on are Asynchronous and Stateless. Or at least "as stateless as possible: Of course you need state but maybe, instead of going for a full fledged RDBMS, have a look at document centered datastores.
Have a look at AKKA concerning async and CouchDB or MongoDB as datastores...
Frameworks are more geared towards speeding up development, not performance. There will be some overhead with any framework because of use cases it handles that you don't need. Granted, the overhead may be low, and most frameworks will point you towards patterns that have been proven to scale, but those patterns can be used without the framework as well.
So I would design your architecture assuming 'bare metal', i.e. pure servlets (yes, you could go even lower level, but I'm assuming you don't want to write your own http socket layer), straight JDBC, etc. Then go back and figure out which frameworks best fit your architecture, speed up your development, and don't add too much overhead. Tomcat versus other containers, Hibernate versus other ORMs, Struts versus other web frameworks - none of that matters if you make the wrong decisions about the key performance bottlenecks.
However, a better approach might be to choose a framework that optimizes for development time and then find the bottlenecks and address those as they occur. Otherwise, you could spin your wheels optimizing prematurely for cases that never occur. But that probably falls in the category of 'worry about it when you get traffic'.
All popular modern frameworks (and "stacks") are well-written and don't pose any threat to performance and scaling, if used correctly. So focus on what stack will be best for your requirements, rather than starting with the scalability upfront.
If you have a particular requirement, then you can ask a question about it and get recommendations about what's best for handling it.
There is no framework that is magically going to make your web service scalable.
The key to scalability is replicating the functionality that is (or would otherwise be) a bottleneck. If you are serious about making your service, you need to start with a good understanding of the characteristics of your application, and hence an idea of where the bottlenecks are likely to be:
Is it a read-only service or do user requests cause primary data to change?
Do you have / need sessions, or is the system RESTful?
Are the requests normal HTTP requests with HTML responses, or are you doing AJAX or callbacks or something.
Are user requests computation intensive, I/O intensive, rendering intensive?
How big/complicated is your backend database?
What are the availability requirements?
Then you need to decide how scalable you want it to be. Do you need to support hundreds, thousands, millions of simultaneous users? (Different degrees of scalability require different architectures, and different implementation approaches.)
Once you have figured these things out, then you decide whether there is an existing framework that can cope with the level traffic that you need to support. If not, you need to design your own system architecture to be scalable in the problem areas.
If you are able to work with a commercial system, then I'd suggest taking a look at Jazz Foundation at http://jazz.net. It's the base for IBM Rational's new products. The project is led by the guys that developed Eclipse within IBM before it was open-sourced. It has pluggable DB layer as well as supporting multiple App Servers. It's designed to handle clustering and multi-site type deployments. It has nice capabilities like OAuth support and License management.
In addition to the above:
Take a good look at JMS (Java Message Service). This is a much under rated technology. There are vendor solutions such as TibCo EMS, Oracle etc. But there are also free stacks such as Active MQ.
JMS will allow you to build synch and asynch solutions using queues. You can choose to have persistent or non-persistent queues.
As others already have replied scalability isn't about what framework you use. Sure it is nice to squeeze out as much performance as possible from each node, but what you ideally want is that by adding another node you scale your app in a linear fashion.
The application should be architected in distinct layers so it is possible to add more power to different layers of the application without a rewrite and also to add different layered caching. Caching is key to archive speed.
One example of layers for a big webapp:
Load balancers (TCP level)
Caching reverse proxies
CDN for static content
Front end webservers
Appservers (business logic of the app)
Persistent storage (RDBMS, key/value, document)
I've got a doozy of a problem here. I'm aiming to build a framework to allow for the integration of different traffic simulation models. This integration is based upon the sharing of link connectivities, link costs, and vehicles between simulations.
To make a distributed simulation, I plan to have a 'coordinator' (star topology). All participating simulations simply register with it, and talk only to the coordinator. The coordinator then coordinates the execution of various tasks between each simulation.
A quick example of a distribution problem, is when one simulation is 'in charge' of certain objects, like a road. And another is 'in charge' of other roads. However, these roads are interconnected (and hence, we need synchronisation between these simulations, and need to be able to exchange data / invoke methods remotely).
I've had a look at RMI and am thinking it may be suited for this task. (To abstract out having to create an over-wire signalling discipline).
Is this sane? The issue here, is that simulation participants need to centralize some of their data storage in the 'coordinator' to ensure explicit synchronisation between simulations. Furthermore, some simulations may require components or methods from other simulations. (Hence the idea of using RMI).
My basic approach is to have the 'coordinator' run a giant RMI registry. And every simulation simply looks up everything in the registry, ensuring that the correct objects are used at each step.
Anyone have any tips for heading down this path?
You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast
Hazelcast is released under Apache license and enterprise grade support is also available.
Is this sane? IMHO no. And I'll tell you why. But first I'll add the disclaimer that this is a complicated topic so any answer has to be viewed as barely scratching the surface.
First instead of repeating myself I'll point you to a summary of Java grid/cluster technologies that I wrote awhile ago. Its a mostly complete list.
The star topology is "natural" for a "naive" (I don't mean that in a bad way) implementation because point-to-point is simple and centralizing key controller logic is also simple. It is however not fault-tolerant. It introduces scalability problems and a single bottleneck. It introduces communication inefficiences (namely the points communicate via a two-step process through the center).
What you really want for this is probably a cluster (rather than a data/compute grid) solution and I'd suggest you look at Terracotta. Ideally you'd look at Oracle Coherence but it's no doubt expensive (compared to free). It is a fantastic product though.
These two products can be used a number of ways but the core of both is to treat a cache like a distributed map. You put things in, you take things out and you fire off code that alters the cache. Coherence (with which I'm more familiar) in this regards scales fantastically well. These are more "server" based products though for a true cluster.
If you're looking at a more distributed model then perhaps you should be looking at more of an SOA based approach.
Have a look at http://www.terracotta.org/
its a distributed Java VM, so it has the advantage of being clustered application looks no different than a standard Java application.
I have used it in applications and the speed is very impressive so far.
Paul
Have you considered using a message queue approach? You could use JMS to communicate/coordinate tasks and results among a set of servers/nodes. You could even use Amazon's SQS (Simple Queue Service: aws.amazon.com/sqs) and have your servers running on EC2 to allow you to scale up and down as required.
Just my 2 cents.
Take a look at JINI, it might be of some use to you.
Well, Jini, or more specifically Javaspaces is a good place to start for a simple approach to the problem. Javaspaces lets you implement a master-worker model, where your master (coordinator in your case) writes tasks to the Javaspace, and the workers query for and process those tasks, writing the results back for the master. Since your problem is not embarrassingly parallel, and your workers need to synchronize/exchanging data, this will add some complexity to your solution.
Using Javaspaces will add a whole lot more abstraction to your implementation that using plain RMI (which is used by the Jini framework internally as the default "wire protocol").
Have a look at this article from sun for an intro.
And Jan Newmarch's Jini Tutorial is a pretty good place to start learning Jini
Just as an addition to the other answers which as far as I have seen all focus on grid and cloud computing, you should notice that simulation models have one unique characteristic: simulation time.
When running distributed simulation models in parallel and synchronized then I see two options:
When each simulation model has its own simulation clock and event list then these should be synchronized over the network.
Alternatively there could be a single simulation clock and event list which will "tick the time" for all distributed (sub) models.
The first option has been extensively researched for the High Level Architecture (HLA) see for example http://en.wikipedia.org/wiki/IEEE_1516 as a starter.
However the second option seems more simple and with less overhead to me.
GridGain is a good alternative. They have a map/reduce implementation with "direct API support for split and aggregation" and "distributed task session". You can browse their examples and see if some of them fits with your needs.
My company is considering using web service as mean of ETL process. However I don't think web service fit into this purpose, for several reasons:
1. web service could possibly consume a lot of memory when generating large xml.
2. xml is a bloated format.
3. possibly time-out if the server takes huge amount of time to generate data
4. file size limitation? (for windows, it's 2Gb, if my memory serves me right)
I am not a web service expert, so I need your opinions. :)
Thanks.
There are plenty of technologies in the Web Services tool shed that circumvent all the problems you elaborate. There is stream oriented XML shredding, there are XML compression formats for delivery, protocols that deal with fragmentation and fairness and there are many a storage systems that can hold terabytes upon terabytes of data.
If by web service you imagine some college freshmen homework concoction of an interface that accepts a single glop argument with a 2GB serialized table in it then all your arguments are valid. But if you give your requirements to an experienced team with knowledge of the concepts involved in WS-ReliableMessaging and WS-Transaction then there is no reason not to have an ETL process around Web Services. Note that I do not advocate the SOAP protocols per-se, but I do advocate knowledge and understanding of the concepts involved.
Now that being said, whether an Web Service oriented ETL process makes sense for you or not it depends on a whole set of other reasons. However, your rebuttal of the Web Service technologies does not hold water.
I would not use a web service for an ETL task. There are specialized tools for that task (e.g., Ab Initio, Informatica, etc.) that are better suited.
If you have a large amount of data, I'd say that the price of the extra latency that the network would introduce would be prohibitive.
It really does depend on what you are doing and how you are trying to accomplish it. In general webservices require more care and feeding than you would normally put into an ETL process, but they can be surprisingly effective at the task as well. I did not get enough specifics for your scenario to say whether it would work.
I have worked on Webservices which transmit and recieve 100+ MB documents, some encoded in XML some not, and do it in seconds (on a closed local network). These services required a good deal of tuning and planning, but they did work well for our scenario and they allowed a wide variety of clients to connect and transmit differing amounts of data through a fairly standard interface. This differed from some of the other ETL jobs we had were the job was specific to each client and had to be setup and maintained for each client.
It all depends on what you are doing and what your constraints are.
If you are going to pursue this route sit down and draft out the process from beginning to end, including how you want clients to connect, verify that the data was received and verify that the job is finished. Consider some of the scenarios, the clients and the types of data being transmitted and then work out what would be needed. Contrast that with what is already available in other tools, and how much time you have to get it done.
I'm really wondering why your company is not considering using a real ETL tool like like those mentioned by duffymo in his answer or, Talend or CloverETL if open source is an option.
They are in general good for ETL purpose :)
Building your own solution sounds like reinventing the wheel.
Many of them have web services oriented features (see Export a job as webservice in Talend's wiki or CloverETL Server HTTP Launch Services for example).
I'm not an ETL product expert and I didn't check them all but I'm pretty sure this is something to consider.
Look up MTOM, to start with, which allows arbitrary non-XML data to be streamed in a web service.
Web services are just fine for ETL tasks. Remember that each task is going to get handled in its own thread for free, and you're guaranteed proper cleanup between requests. Using web services inside something like Tomcat wouldn't be nearly as heavy as you think.
If you're concerned over the bloat of XML, consider JSON format.