How work applications with I/O threads and working threads? - java

I would like to know what is the best architecture style for creating application 1.) for server or 2.) when you want to create application from beginning (not necessary with server etc...).
My question is: "Do you know some source when I can read about usage of I/O threads and Worker threads?"
As I can understand. When you need to create a good application it is good practice to separate I/O threads and Work threads. But I am not possible to find some good explanation with examples on web. Can someone write me how this architecture should work?
How for example Spring boot apply this? There are many examples with some usage of Spring boot on web. But I have not realized that I would find any example with separation of work of this two types of threads. Or just to describe some principles or examples.
Thank you

Check on books like Pattern Oriented Software Architecture, especially volume 2: https://en.wikipedia.org/wiki/Pattern-Oriented_Software_Architecture
Understand how the Multiprocessing Module in Apache httpd works. Why are there three of them (prefork, worker, event)? How did they evolve?
Look at thread pool usage in application servers like Wildfly.

The typical solution is to use some SEDA design: Staged Event-driven Architecture. Often each stage is made of reactors; so threads executing some event loop and moving tasks from one stage to the next (e.g. from socket read -> process -> socket write). E.g Hazelcast, Kafka, and Cassandra are good examples of that.
But recently, for high-performance systems, there is a shift to thread per core designs. So instead of having different stages, there is just a single stage. At this stage, all activities are performed: reading/writing socket, reading/writing disk, actual logic etc. So the processing of a single request is done on a single thread. The big advantage is that it scales a lot better and can provide superior performance. Scylla/Redpanda/Dragonfly are good examples of that.

Related

Orchestration engines and frameworks?

I'm looking for an orchestration framework/engine/toolkit with which to replace/upgrade an existing software, mainly because of scalability limitations. By orchestration I mean asynchronous and distributed execution of generic tasks and workflows.
More specifically the requirements are pretty much these:
Wrapping and execution of generic tasks, in Java if language dependent
API for tasks and workflows on-demand triggering
Scheduling would be nice as well
Support for distributed architecture & scalability (mainly for big numbers of small tasks)
Persistency and resilience
Advanced workflow configuration capabilities (do this, then these 3 tasks in parallel, then this, having priorities, dependencies...)
Monitoring and administration UI (or at least API)
The existing system is an old fashion monolithic service (in Java) that has most of that, including the execution logic itself which should remain as untouched as possible.
Does anyone have experience with a similar problem? It seems to me it's supposed to be pretty common, would be strange if I have to implement it myself entirely. I found some questions here (like this and this) discussing the theory of orchestration and choreography systems, but not real examples of tools implementing it. Also I think we're not exactly talking about microservices - the tasks are not prolonged and heavy, they're just many, running in the background executing short jobs of many types. I wouldn't create a service for every job type.
I'm also not looking for cloud and container services at this point - to my understanding the deployment is a different issue.
The closest I got is the Netflix Conductor engine, which answers most of the requirements by running an orchestration server that manages tasks implemented in servlets (or any web services in any language - a plus). However it seems like it's built mainly for arranging heavy tasks in a workflow rather than running a huge number of small tasks, which makes me wonder what would be the overhead of invoking many small tasks in servlets for example.
Does anyone have experience or any input on the Conductor or other tools I could use? Or even my entire approach to the problem?
EDIT: I realize it's kind of a "research advice needed" so let's put it simply in 3 questions:
Am I right to look for an orchestration solution for the requirements above?
Does anyone have experience with the Netflix Conductor? Any feedback on it?
Does it have good competitors?
The main competitor of Netflix Conductor is Temporal Workflow. It scales better and is more developer-friendly by using code instead of JSON DSL to implement the orchestration logic.
It also works OK with the fine-grained tasks by implementing specific optimizations (local activities) that allow batching multiple small tasks into a single database update.
Temporal has been production hardened for over five years at Uber, Coinbase, HashiCorp, Dagadog, Stripe, and hundreds of other companies.
Perhaps you are looking for something like Airflow https://airflow.apache.org/ ?
Wrapping and execution of generic tasks, in Java if language dependent
https://github.com/apache/incubator-airflow/tree/master/airflow/hooks
https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/operators
API for tasks and workflows on-demand triggering
https://airflow.apache.org/api.html (experimental)
Scheduling would be nice as well
think of cron on steroids -
https://airflow.apache.org/scheduler.html
Support for distributed architecture & scalability (mainly for big numbers of small tasks)
scale with dask or celery nodes - Airflow + celery or dask. For what, when?
Persistency and resilience
uses a postgres db & rabbitMQ - if your deployment arch is stateless ( eg. repeatable containers & volumes with docker) you should be in good shape with WAL replication
if you use Kubernetes or Consul there are other ways to implement more resilience on the other components
Advanced workflow configuration capabilities (do this, then these 3 tasks in parallel, then this, having priorities, dependencies...)
Airflow uses DAG's. The capabilities can be called fairly advanced. You also have parameter sharing using XCOMs if you really need that
Monitoring and administration UI (or at least API)
Has one, shows tasks & schedules & has a gantt view. also can see logs & run details easily & also manually schedule tasks directly from the UI
also look at oozie & azkaban
did this help?
You could take a look at unify-flowret, a lightweight Java orchestration engine I created as part of developing a new platform in American Express. If you think Netflix Conductor seems like a good fit for your problem, you should definitely take a look at unify-flowret as Netflix Conductor was one of the options which we had evaluated before building unify-flowret.
Unify-flowret provides core orchestration functionality and depends upon the application to provide everything else. You define the workflow in a very simple JSON file using steps and routes. Then, in the application which wants to use flowret, you create certain implementations e.g. an implementation for persisting state to a database (this way it is possible to use any data store). Or an implementation to return an object to flowret on which flowret will invoke the step function. This way, rather than implementing all types of requirements within the orchestration engine, to keep things simple, most are deferred to the application.
Unify-flowret runs in an embedded mode and so is scalable horizontally. It resumes from where it left off. It is resilient in the face of crashes and will resume from the last recorded position. It provides for true technical parallel processing via definition in the workflow JSON. It provides an SLA framework that informs the application of the milestones to be set up in the future. It provides work management functionality in the form of work baskets. And many other features!
We have had great success in using it within American Express for really complex orchestration requirements.
You can checkout unify-flowret on https://github.com/americanexpress/unify-flowret.

Share data between Java EE servers

What products/projects could help me with the following scenario?
More than one server (same location)
Some state should be shared between server (for instance information if a scheduled task is running and on what server).
The obvious answer could of course be databases but we are using Seam and there doesn't seem to be a good way to nest transactions inside a Seam-bean so I need to find a way where I don't have to go crazy over configuration (tried to use EJB:s but persistence.xml wasn't pretty afterwards). So i need another way around this problem until Seam support nested transactions.
This is basically the same scenario as I have if you need more details: https://community.jboss.org/thread/182126.
Any ideas?
Sounds like you need to do distributed job management.
The reality is that in the Java EE world, you are going to end up having to do Queues, as in MoM [Message-oriented Middleware]. Seam will work with JMS, and you can have publish and subscribe queues.
Where you might want to take a look for an alternative is at Akka. It gives you the ability to distribute jobs across machines using an Actor/Agent model that is transparent. That's to say your agents can cooperate with each other whether they are on the same instance or across the network from each other, and you are not writing a ton of code to make that happen, or having to special handle things up and down the message chain.
The other thing Akka has going for it is the notion of Supervision, aka Go Ahead and Fail, or Let it Crash. This is the idea (followed by the Telcos for years), that systems will fail and you should design for it and have a means of making things resilient.
Finally, the state of other options job wise in the Java world is dismal. Have used Seam for years. It's great, but they decided to just support Quartz for jobs, which is useless.
Akka is built on Netty, too, which does some pretty crazy stuff in terms of concurrency and performance.
[Not a TypeSafe employee, btw…]

Huge data processing/ HPC in Java - suggest me how to begin

I am thinking to work on a programming problem for which, I suppose, I will need to know a lot of advanced programming concepts. For some reasons I have decided to code it in Java - even though I am not proficient in it.
So I want you to help me with suggestions, guidance, pointers to resources, books, tutorials or any generic advises that you think is pertinent.
Here is the basic nature of my problem:
I need to create a client-server architecture. Server supports multiple concurrent clients. Clients send it simple instructions (may be server exposes some kind of API/ runs listener on specific port), server executes the instructions and send result back to client.
The main job of the server is to do huge volume of data processing based on the instructions given to it. It takes data from backend database/ file systems. Data volume can easily surge up to ~ 200GB - 700GB. Data will be usually streamed to it, but it may require to hold huge volume of data in memory cache during processing (and if RAM is not enough, then page it to disk). Computations are generally numerically intensive in nature (let's say taking the inverse of a matrix)
The server should be able to do multithreading (I don't know what this term mean in Java, what I wish is, the server should be able to distribute the job in multiple parallel sub-processes.)
The server itself should be very lightweight. I Do NOT need any GUI Interface.
It will be great if I design it in a way so that I can integrate it later with HPC frameworks like Hadoop.
Now if I got to do this, what kind of programming do I need to learn? By the way, I have good understanding on OOP, I am somewhat familiar with Data Structures and algorithms, I know basic Java (never done any network or multithreaded programming in Java before, but have used typical oop concepts, generics, comparable interfaces etc.). I basically work in database programming, but have also done lot of C, C++, C#, Python in the past.
Given the requirement and my background, please suggest,
How should I begin to work on this project? What is the way to architect the project?
Should I create some basic API definitions first and then start working on the details?
Should I follow any particular design pattern? Where to learn them from?
What are the things I need to learn in Java and where to learn them from?
What is the best way to read huge data in memory? Is Java nio good solution?
If I instantiate a class with huge amount of data, would it work? (example, let's say I have a Vector class to represent a matrix with millions of elements and the constructor of the class reads huge data set in the memory). What's the best way to handle that?
You will want to define how the client and server will talk to eachother. The easiest way is to use established protocols such as HTTP by creating REST services that the client can call without much coding.
Most frameworks that support HTTP create several listeners that run in different threads. This gives you multi threading out of the box.
I'd suggest looking into I prefer Spring Controllers. Spring is fairly light weight.
If you want to use these frameworks, you will want to quickly find, and incorporate them into your application for compilation and packaging.
I would suggest looking into Maven for this. It's a big time saver. In particular using archetypes to create your project's folder structure, and auto download dependencies, and their dependencies.
Finally my words of wisdom. Ensure your services are singleton stateless services. This means you only create the objects once, and each thread uses the same objects. There is lots less garbage collection happening. This makes a huge difference when processing large amounts of requests.
Be careful not to use class level variables to hold state, in these services. If you do, different threads will over write each others data.
First thing I would like to say that as per your explanation of the things you seem to be in a pretty good shape to use java as your server side language.
The kind of client server architecture you choose may depend on what kind of clients actually you are serving to. Would they be typical GUI or CUI based desktop clients or the web clients.
In the latter case you could use Spring Framework in a normal fashion and for the former one you could go further to explore Spring's support for Restful Web services. I would advise not to go with socket or TCP based networking solutions or use java networking.
Spring's RESTful API gives you a very cool abstraction over things like networking and multi threading even for a desktop based client. In case of a desktop client you can use JSON/XML as response and can use HttpClient library for making calls to server, which is a very cool abstraction of the underlying networking stuff.
Further up Spring's design patterns follow a very linear flow of data. A lot of your fundamental design considerations are catered by the Spring itself using Dependency Injection and Inversion of Control which are extremely simple to incorporate.
For a detailed analysis of design patterns related to specific requirements I would suggest you to read the book called Java Design Patterns: A Tutorial of Addison Wesley publications and the author is James W. Cooper.
One more thing about the API design. It would be preferable for you to first create a API specification and then go further to implement them.

PHP Java combination for multithreaded processing - good or bad?

i need to make multiple calls to different web services using PHP and i was wondering if the php-java combination would be more appropriate in dealing with this issue.
The multiple calls to the services if called sequentially will create a significant amount of delay, so i am looking for ways to overcome that.
I have read articles that 'simulate' concurrent processing in php and deal with this particular issue but i was wondering if the introduction of let's say a java socket server that accepts requests and creates worker threads would be more efficient (faster).
Any comments appreciated.
regards,
Interestingly I've been thinking about this issue as well. You have a number of options:
Use PHP calls to fork new processes;
Use a worker framework like beanstalkd to create work requests and have something pick them up;
Use something else like memcache to create work requests.
(2) is the interesting one (to me). You could run CLI PHP scripts to do the processing of beanstalk requests. Or you could use Java. Which depends on a large number of factors. I'd generally favour a single language environment over multi-language where possible and practical. But I can also envision instances where a Java backend would be a good idea.
That's exactly the reason why we switched from php to java - because of multithreading. We had an app that reads rss feeds through http. Switching from single threaded php app to several threads in java gave about 10x boost. I can't say anything about php threading simulation though.

Java - Distributed Programming, RMI?

I've got a doozy of a problem here. I'm aiming to build a framework to allow for the integration of different traffic simulation models. This integration is based upon the sharing of link connectivities, link costs, and vehicles between simulations.
To make a distributed simulation, I plan to have a 'coordinator' (star topology). All participating simulations simply register with it, and talk only to the coordinator. The coordinator then coordinates the execution of various tasks between each simulation.
A quick example of a distribution problem, is when one simulation is 'in charge' of certain objects, like a road. And another is 'in charge' of other roads. However, these roads are interconnected (and hence, we need synchronisation between these simulations, and need to be able to exchange data / invoke methods remotely).
I've had a look at RMI and am thinking it may be suited for this task. (To abstract out having to create an over-wire signalling discipline).
Is this sane? The issue here, is that simulation participants need to centralize some of their data storage in the 'coordinator' to ensure explicit synchronisation between simulations. Furthermore, some simulations may require components or methods from other simulations. (Hence the idea of using RMI).
My basic approach is to have the 'coordinator' run a giant RMI registry. And every simulation simply looks up everything in the registry, ensuring that the correct objects are used at each step.
Anyone have any tips for heading down this path?
You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast
Hazelcast is released under Apache license and enterprise grade support is also available.
Is this sane? IMHO no. And I'll tell you why. But first I'll add the disclaimer that this is a complicated topic so any answer has to be viewed as barely scratching the surface.
First instead of repeating myself I'll point you to a summary of Java grid/cluster technologies that I wrote awhile ago. Its a mostly complete list.
The star topology is "natural" for a "naive" (I don't mean that in a bad way) implementation because point-to-point is simple and centralizing key controller logic is also simple. It is however not fault-tolerant. It introduces scalability problems and a single bottleneck. It introduces communication inefficiences (namely the points communicate via a two-step process through the center).
What you really want for this is probably a cluster (rather than a data/compute grid) solution and I'd suggest you look at Terracotta. Ideally you'd look at Oracle Coherence but it's no doubt expensive (compared to free). It is a fantastic product though.
These two products can be used a number of ways but the core of both is to treat a cache like a distributed map. You put things in, you take things out and you fire off code that alters the cache. Coherence (with which I'm more familiar) in this regards scales fantastically well. These are more "server" based products though for a true cluster.
If you're looking at a more distributed model then perhaps you should be looking at more of an SOA based approach.
Have a look at http://www.terracotta.org/
its a distributed Java VM, so it has the advantage of being clustered application looks no different than a standard Java application.
I have used it in applications and the speed is very impressive so far.
Paul
Have you considered using a message queue approach? You could use JMS to communicate/coordinate tasks and results among a set of servers/nodes. You could even use Amazon's SQS (Simple Queue Service: aws.amazon.com/sqs) and have your servers running on EC2 to allow you to scale up and down as required.
Just my 2 cents.
Take a look at JINI, it might be of some use to you.
Well, Jini, or more specifically Javaspaces is a good place to start for a simple approach to the problem. Javaspaces lets you implement a master-worker model, where your master (coordinator in your case) writes tasks to the Javaspace, and the workers query for and process those tasks, writing the results back for the master. Since your problem is not embarrassingly parallel, and your workers need to synchronize/exchanging data, this will add some complexity to your solution.
Using Javaspaces will add a whole lot more abstraction to your implementation that using plain RMI (which is used by the Jini framework internally as the default "wire protocol").
Have a look at this article from sun for an intro.
And Jan Newmarch's Jini Tutorial is a pretty good place to start learning Jini
Just as an addition to the other answers which as far as I have seen all focus on grid and cloud computing, you should notice that simulation models have one unique characteristic: simulation time.
When running distributed simulation models in parallel and synchronized then I see two options:
When each simulation model has its own simulation clock and event list then these should be synchronized over the network.
Alternatively there could be a single simulation clock and event list which will "tick the time" for all distributed (sub) models.
The first option has been extensively researched for the High Level Architecture (HLA) see for example http://en.wikipedia.org/wiki/IEEE_1516 as a starter.
However the second option seems more simple and with less overhead to me.
GridGain is a good alternative. They have a map/reduce implementation with "direct API support for split and aggregation" and "distributed task session". You can browse their examples and see if some of them fits with your needs.

Categories

Resources