Java Design Pattern (Orchestration/Workflow)

Java Design Pattern (Orchestration/Workflow) - java

I need to automate the workflow after an event occurred. I do have experience in CRUD applications but not in Workflow/Batch processing. Need help in designing the system.
Requirement
The workflow involves 5 steps. Each step is a REST call and are dependent on previous step.
EX of Steps: (VerifyIfUserInSystem, CreateUserIfNeeded, EnrollInOpt1, EnrollInOpt2,..)
My thought process is to maintain 2 DB Tables
WORKFLOW_STATUS Table which contains columns like
(foreign key(referring to primary table), Workflow Status: (NEW, INPROGRESS, FINISHED, FAILED), Completed Step: (STEP1, STEP2,..), Processed Time,..)
EVENT_LOG Table to maintain the track of Events/Exceptions for a particular record
(foreign key, STEP, ExceptionLog)
Question
#1. Is this a correct approach to orchestrate the system(which is not that complex)?
#2. As the steps involve REST Calls, I might have to stop the process when a service is not available and resume the process in a later point of time. I am not sure for many retry attempts should be made and how to maintain the no of attempts made before marking it as FAILED. (Guessing create another column in the WORKFLOW_STATUS table called RETRY_ATTEMPT and set some limit before marking it Failed)
#3 Is the EVENT_LOG Table a correct design and what datatype(clob or varchar(2048)) should I be using for exceptionlog? Every step/retry attempts will be inserted as a new record to this table.
#4 How to reset/restart a FAILED entry after a dependent service is back up.
Please direct me to an blogs/videos/resources if available.
Thanks in advance.

Have you considered using a workflow orchestration engine like Netflix's Conductor? docs, Github.
Conductor comes with a lot of the features you are looking for built in.
Here's an example workflow that uses two sequential HTTP requests (where the 2nd requires a response from the first):
Input supplies an IP address (and a Accuweather API key)
{
"ipaddress": "98.11.11.125"
}
HTTP request 1 locates the zipCode of the IP address.
HTTP request 2 uses the ZipCode (and the apikey) to report the weather.
The output from this workflow is:
{
"zipcode": "04043",
"forecast": "rain"
}
Your questions:
I'd use an orchestration tool like Conductor.
Each of these tasks (defined in Conductor) have retry logic built in. How you implement will vary on expected timings etc. Since the 2 APIs I'm calling here are public (and relatively fast), I don't wait very long between retries:
"retryCount": 3,
"retryLogic": "FIXED",
"retryDelaySeconds": 5,
Inside the connection, there are more parameters you can tweak:
"connectionTimeOut": 1600,
"readTimeOut": 1600
There is also exponential retry logic if desired.
The event log is stored in ElasticSearch.
You can build error pathways for all your workflows.
I have this workflow up and running in the Conductor Playground called "Stack_overflow_sequential_http". Create a free account. Run the workflow - click "run workflow, select "Stack_overflow_sequential_http" and use the JSON above to see it in action.
The get_weather connection is a very slow API, so it may fail a few times before succeeding. Copy the workflow, and play with the timeout values to improve the success.

You describe an Enterprise Integration Pattern with enrichments/transformations from REST calls and stateful aggregation of the results over time (consequently meaning many such flows may be in progress at any one time). Apache Camel was designed for exactly these scenarios.
See What exactly is Apache Camel?

Related

Database polling when there's an insert - Spring Data JPA

I have a requirement where whenever there's an entry in the table, I want to trigger an event. I have used EntityListeners (Spring Data JPA concept) for this, which is working perfectly fine; but the issue here is the insert can happen through stored procedures or manual entry. I tried searching online and found the Spring JPA inbound and outbound channel adapter concept, but I think this concept doesn't help me much in what I want to achieve. Can anybody clarify to me if this concept helps me as I have no much idea on this concept or provide me with any solutions on how I can achieve this?

There are no "great" mechanisms for raising events "from the data layer" in SQL Server.
There are three "OK" ones:
Triggers (only arguably OK)
Triggers seem like an obvious solution, but then you have to ask yourself... what will the trigger actually do? If it just writes data into another table, you still haven't gotten yourself outside the database. There are various arcane tricks you could try to use for this, like CLR procedures, or a few extended procedures.
But if you go down that route, you have to start thinking about another consideration: Triggers happen in the same transaction as the DML operation that caused them to fire. If they take time to execute, you'll be slowing down your OLTP workloads. If they do anything that is potentially unreliable they could fail, causing your transaction to roll back.
Triggers plus service broker
Service broker provides a mechanism - perhaps the only even-half-sensible mechanism - to get your data out of SQL and into some kind of listener in a "push" based manner. You still have a trigger, but the trigger writes data to a service broker queue. A listener can use a special waitfor receive statement to listen for data as it appears in the queue. The nice thing about this is that once the trigger has pushed data into a broker queue, its job is done. The "receipt" of that data is decoupled from the transaction that caused it to be enqueued in the first place. This sort of service broker mechanism is what is used by things like the SqlDependency built into dot net.
The two main issues with service broker are complexity and performance. Service broker has a steep learning curve, and it's easy to get things wrong. Performance becomes complex if you need to scale, because while it's "easy" to build xml or json payloads, large set based data changes can mean those payloads are massive.
In any case, if you want to explore this route, you're going to want to read (all of) the excellent articles on the subject by Remus Rusanu
Bear in mind that this is an asynchronous "near real time" mechanism, not a synchronous "real time" mechanism like triggers.
Polling a built in change detection mechanism: CDC or Change Tracking.
Sql server comes with two flavours of technology that can natively "watch" changes that happen in tables, and record them: Change Tracking, and Change Data Capture
Neither of these push data out of the database, they're both "pull" based. What they do is store additional data in the database when changes happen. CDC can provide a complete log of every change, whereas change tracking "points to" rows that have changed via the primary key values. Though both of these involve "polling history", there are significant differences between them, so read the fine print.
Note that CDC is "doubly asynchronous" - the data is read from the transaction log, so recording the data is not part of the original transaction. And then you have to poll the CDC data, it's not pushed out to you. Furthermore, the functions generated by Microsoft when you enable CDC can be unbelievably slow as soon as you ask for something useful, like net changes with mask (which can tell you which columns really changed their value), and your ability to enable CDC comes with a lot of caveats and limitations (again, read the docs for all of this).
As to which of these is "best", well, that's a matter of opinion and circumstance. I have used CDC extensively, service broker rarely, and triggers almost never, as a way of getting events out of SQL. I have never actually used change tracking in a production environment, but if I had the choice again I would probably have chosen change tracking rather than change data capture, at least until or unless there were requirements that mandated the use of CDC because of its additional functionality beyond what change tracking can provide.
One last note: If you need to "guarantee" that the events that get raised have in fact been collected by a listener and successfully forwarded to subscribers, well, you have some work ahead of you! Guaranteed messaging is hard.

Multi-threaded Java console application for reading from database and updating REST server

I have developed a Java console application which does the following;
Fetch Product details like Product Id, Name, cost etc from an Oracle database and put it in a map (say dbMap) - one product can have multiple records as there are sub products.
Fetch similar Product details from a REST server and store it in a map (say restMap)
Since the DB has the correct data, compares the two maps - dbMap and restMap and identifies what should be added, replaced and removed from the REST server.
For this purpose, I create one JSON patch request for each product - with add, replace, remove operations(around hundred or so for each product) and send it to REST server.
However, i see it takes a few minutes to perform all these operations and all these operations happen in a linear manner - the database call, the rest server call, the comparison and finally the Patch to REST server.
I am assuming, instead of tackling all the data in a single thread, if I can get a list of products and go product by product, with each product in its own thread and run these threads in parallel, it might be faster.
so, each thread might do the following - fetch product details of one product from database and also from REST server, compare them both and generate a patch request(with Add/remove/replace operations) for that product and send it to REST server.
Could you please suggest how I can implement this type of thread architecture in Java? (There seem to be several ways like threadpools, AKKA etc and I am confused.)

Since dbCall and restCall are not mutually depended you can make parallel calls in 2 Threads. And one dedicated thread to process comparison.
You may use producer consumer approach here.
You can use Thread pools for same using Executor service:
http://tutorials.jenkov.com/java-util-concurrent/executorservice.html

Java Streams batch combinatory call

My question of the day is about combinatory operation when building microservices.
Let us use the fictionnal scenario : I want to build a dashboard. The dashboard is composed of a bunch of people and their infos (history, reviews, purchases, last products searched).
Reading spring-cloud and spring-reactor, I would like a non blocking solution calling multiple microservices : user service, review service, search engine service, ....
My first guess was to do something like
load the users,
for each one load its reviews then
load its history then
combine all the data
In pseudo-code something like loadUsers().flatmap(u -> loadReviews(u))....reduce(). It's really approximative here as you can see.
When loading 1 user, we can estimate that we need 4 more http calls. For 100 users, 400 additional calls etc etc. The Big-O doesn't seem linear.
In the worst case where a microservice also delegates data loading from a XYZ microservices then we have got : for 1 user -> N calls including 1 review call -> 1 XYZ call. Sorry I didn't calculate the Big-O (quadratic ?).
To avoid that, we can perhaps load all the users, extract their id, call earch microservcies with a batch of ids. Each microservices can load all the data at once (List of reviews mapped by id perhaps) and the original called will merge all theses lists. (a kind of zip function)
Summary : I just read this question about Observables composition. My question is can be summarize with "Do you use the same strategy when you don't have a unique user at the start of the chain but hundreds of users ?" (the performance can be a problem no ?)

You will likely want to use batching to reduce the number downstream calls. Instead of sending a single user through the observable, you will want to send the batch.

Duplicates because Oracle is too fast or multithread?

I have a problem with duplicate records arriving in our database via a Java web service, and I think it's to do with Oracle processing threads.
Using an iPhone app we built, users add bird observations to a new site they visit on holiday. They create three records at "New Site A" (for example). The iPhone packages each of these three records into separate JSON strings containing the same date and location details.
On Upload, the web service iterates through each JSON string.
Iteration/Observation 1. It checks the database to see if the site exists, and if not, creates a new site and adds the observation into a hanging table.
Iteration/Obs 2. The site should now exists in the database, but it isn't found by the database site check in Iteration 1, and a second new site is created.
Iteration/Obs 3. The check for existing site NOW WORKS, and the third observation is attached to one of the existing sites. So the web service and database code does work.
The web service commits at the end of each iteration.
Is the reason for the second iteration not finding the new site in the database due to delays in Oracle commit after it's been called by the Java, so that it's already started processing iteration 2 by the time iteration 1 is truly complete, OR is it possible that Oracle is running each iteration on a separate thread?
One solution we thought about was to use Thread.sleep(1000) in the web service, but I'd rather not penalize the iPhone users.
Thanks for any help you can provide.
Iain

Sounds like a race condition to me. Probably your observation 1 and 2 are arriving very close to each other, so that 1 is still processing when 2 arrives. Oracle is ACID-compliant, meaning your transaction for observation 2 cannot see the changes made in transaction one, unless this one was completed before transaction two started.
If you need a check-then-create functionality, you'd best synchronize this at a single point in your back end.
Also, add a constraint in your DB to avoid the duplication at all costs.

It's not an Oracle problem; Thread.sleep would be a poor solution, especially since you don't know root cause.
Your description is confusing. Are the three JSON strings sent in one HTTP request? Does the order matter, or does processing any of them first set up the new location for the ones that follow?
What's a "hanging table"?
Is this a parent-child relation between location and observation? So the unit of work is to INSERT a new location into the parent table followed by three observations in the child table that refer back to the parent?
I think it's a problem with your queries and how they're written. I can promise you that Oracle is fast enough for this trivial problem. If it can handle NASDAQ transaction rates, it can handle your site.
I'd write your DAO for Observation this way:
public interface ObservationDao {
void saveOrUpdate(Observation observation);
}
Keep all the logic inside the DAO. Test it outside the servlet and put it aside. Once you have it working you can concentrate on the web app.

Can any one explain how to Understanding Datastore Read Costs in App Engine?

I am doing geoquery among 300 user entities with a result range 10.
I've maked a query for 120 times. For each query I got 10 user entity objects.
After this my app engine read operations reached 52% (26000 operations).
My user entity has 12 single value properties and 3 multi-value properties(List type).
User entity have 2 indexes for single value properties and 2 indexes on list type properties.
Can any one please help me to understand how google's appengine counts the datastore read operations?

As a start, use appstats. It'll show you where your costs are coming from in your app:
https://developers.google.com/appengine/docs/java/tools/appstats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it cache data
instead of making repeated RPC calls to get the same data? Will your
application perform better if multiple requests are executed in
parallel rather than serially? The Appstats library helps you answer
these questions and verify that your application is using RPC calls in
the most efficient way by allowing you to profile your RPC calls.
Appstats allows you to trace all RPC calls for a given request and
reports on the time and cost of each call.
Once you understand where your costs are coming from you can optimise.
If you just want to know what the prices are, they are here:
https://developers.google.com/appengine/docs/billing

You can analyse what is going on under the hood with appstats: https://developers.google.com/appengine/docs/java/tools/appstats

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.