Making Existing Spring Batch Application run on multiple nodes

Making Existing Spring Batch Application run on multiple nodes - java

We have existing Spring Batch Application, that we want to make scalable to run on multiple nodes.
The scalabilty docs for Spring Batch involves code changes and configuration changes.
I am just wondering if this can be achieved by just configuration changes ( adding new classes and wiring it in configuration is fine but just want to avoid code changes to existing classes).
Thanks a lot for the help in advance.

It really depends on your situation. Specifically, why do you do you want to run on multiple nodes? What is the bottle neck you're attempting to overcome? The typical two scenarios that Spring Batch handles out of the box for scaling across multiple nodes are remote chunking and remote partitioning. Both are master/slave configurations, but each have a different use case.
Remote chunking is used when the processor in a step is the bottle neck. In this case, the master node reads the input and sends it via a Spring Integration channel to remote nodes for processing. Once the item has been processed, the result is returned to the master for writing. In this case, reading and writing are done locally to the master. While this helps parallelize processing, it takes an I/O hit because every item is being sent over the wire (and requires guaranteed delivery, ala JMS for example).
Remote partitioning is the other scenario. In this case, the master generates a description of the input to be processed for each slave and only that description is sent over the wire. For example, if you're processing records in a database, the master may send a range of row ids to each slave (1-100, 101-200, etc). Reading and writing occur local to the slaves and guaranteed delivery is not required (although useful in certain situations).
Both of these options can be done with minimal (or no) new classes depending on your use case. There are a couple different places to look for information on these capabilities:
Spring Batch Integration Github repository - Spring Batch Integration is the project that supports the above use cases. You can read more about it here: https://github.com/spring-projects/spring-batch-admin/tree/master/spring-batch-integration
My remote partitioning example - This talk walks though remote partitioning and provides a working example to run on CloudFoundry (currently only works on CF v1 but updates for CF2 are coming in a couple days). The configuration is almost the same, only the connection pool for Rabbit is different: https://github.com/mminella/Spring-Batch-Talk-2.0 The video for this presentation can be found on YouTube here: http://www.youtube.com/watch?v=CYTj5YT7CZU
Gunnar Hillert's presentation on Spring Batch and Spring Integration: This was presented at SpringOne2GX 2013 and contains a number of examples: https://github.com/ghillert/spring-batch-integration-sample
In any of these cases, remote chunking should be accomplishable with zero new classes. Remote partitioning typically requires you to implement one new class (the Partitioner).

Related

Microservices - Stubbing/Mocking

I am developing a product using microservices and am running into a bit of an issue. In order to do any work, I need to have all 9 services running on my local development environment. I am using Cloud Foundry to run the applications, but when running locally I am just running the Spring Boot Jars themselves. Is there anyway to setup a more lightweight environment so that I don't need everything running? Ideally, I would only like to have the service I am currently working on to have to be real.

I believe this is a matter of your testing strategy. If you have a lot of micro-services in your system, it is not wise to always perform end-to-end testing at development time -- it costs you productivity and the set up is usually complex (like what you observed).
You should really think about what is the thing you wanna test. Within one service, it is usually good to decouple core logic and the integration points with other services. Ideally, you should be able to write simple unit tests for your core logic. If you wanna test integration points with other services, use mock library (a quick google search shows this to be promising http://spring.io/blog/2007/01/15/unit-testing-with-stubs-and-mocks/)
If you don't have already, I would highly recommend to set up a separate staging area with all micro-services running. You should perform all your end-to-end testing there, before deploying to production.
This post from Martin Fowler has a more comprehensive take on micro-service testing stratey:
https://martinfowler.com/articles/microservice-testing

It boils down to a test technique that you use. Here my recent answer in another topic that you could find useful https://stackoverflow.com/a/44486519/2328781.
In general, I think that Wiremock is a good choice because of the following reasons:
It has out-of-the-box support by Spring Boot
It has out-of-the-box support by Spring Cloud Contract, which gives a possibility to use a very powerful technique called Consumer Driven Contracts.
It has a recording feature. Setup your Wiremock as a proxy and make requests through it. This will generate stubs for you automatically based on your requests and responses.

There are multiple tools out there that let you create mocked versions of your microservices.
When I encountered this exact problem myself I decided to create my own tool which is tailored for microservice testing. The goal is to never have to run all microservices at once, only the one that you are working on.
You can read more about the tool and how to use it to mock microservices here: https://mocki.io/mock-api-microservices. If you only want to run them locally, it is possible using the open source CLI tool

It can be solved if your microservices allow passing metadata along with requests.
Good microservice architecture should use central service discovery, also every service should be able to take metadata map along with request payload. Known fields of this map can be somehow interpreted and modified by the service then passed to next service.
Most popular usage of per-request metadata is request tracing (i.e. collecting tree of nodes used to process this request and timings for every node) but it also can be used to tell entire system which nodes to use
Thus plan is
register your local node in dev environment service discovery
send request to entry node of your system along with metadata telling everyone to use your local service instance instead of default one
metadata will propagate and your local node will be called by dev environment, then local node will pass processed results back to dev env
Alternatively:
use code generation for inter-service communication to reduce risk of failing because of mistakes in RPC code
resort to integration tests, mocking all client apis for microservice under development
fully automate deployment of your system to your local machine. You will possibly need to run nodes with reduced memory (which is generally OK as memory is commonly consumed only under load) or buy more RAM.

An approach would be to use / deploy an app which maps paths / urls to json response files. I personally haven't used it but I believe http://wiremock.org/ might help you

For java microservices, you should try Stybby4j. This will mock the json responses of other microservices using Stubby server. If you feel that mocking is not enough to map all the features of your microservices, you should setup a local docker environment to deploy the dependent microservices.

A way to analyze and compute a huge Oracle data on daily basis

I need to calculate the summary data from various transaction tables in the primary Oracle database of our core engine. I have planned write this as a multi-threaded Java program which will be scheduled as a job that runs every midnight; the program will extract the data from various transaction log tables joining other tables with it from the database, calculate and store back the result to a separate table. The log tables usually will contain millions of data with some tables partitioned daily and some on monthly.
The GUI (the dashboard) platform would request these information through a separate webservice that already exists in providing various other details. Almost all the modules in the project uses Spring framework, so I thought to use the Spring-Batch with the scheduling capability. As I started some research before starting the design on this, I found various other techniques used such as the ETL tools, scheduling in the database itself, real-time data analysis and other similar techniques.
Am I over approaching the problem in my hand? Did my earlier approach a right one? Or is there a way, a framework on Java, to do this process in real-time, say while the data is processed (while the core engine is processing the data), such that there is no need to write a separate job to do this calculation?

You can have a look at Spring XD which is a Engine to process high volume data.
Spring XD offers a lot readers (jdbc, file, jms), processors and writers (jdbc, file, jms) out of box and you can write your own readers, writers, processors easily.
Spring XD uses the Unix style source, pipe, sink to connect multiple processors. You can see this post for a small example of the application of Spring XD with High volume twitter data.

Is Spring Integration suitable for web-farm processing of "reliable queue"?

Sorry if title is confusing, let me explain my question.
Our team need to develop web service which is suppose to run on several nodes (web farm - horizontal scaling). We know how to implement this "manually", but we're pretty excited about Spring Integration which is new to us - so we really trying to understand whether this is good fit for our scenario - and if so we'll try to make use of it.
Typical scenario:
Sevaral servers ("nodes") running same web application (lets call it "OurWebService")
We need to pull files from external systems ("InboundExtSystems")
Process this data with help of other external systems (involves local resource-consuming operations) ("UtilityExtServices")
Submit processing results to another set of external systems ("OutboundExtSystems")
Non-functional requirements:
Due to performance reasons we cannot query UtilityExtServices by demand -AND- local processing also CPU-intensive. So we need to have queue, in order to control pace at which we performing requests and process results
We expect several nodes will equally pull tasks from this queue and process them
We need to make sure that every queued task pulled from InboundExtSystems will be handled - we need to guarantee that none of them will disappear.
We need to make sure timeouts are handled as well. If task processing timed out - we need to "requeue" this task (and make sure previous handled will not submit results for this task)
We need to be able to perform rolling updates. Like let's say 5 nodes are processing queue. We want to be able to sequentially stop-upgrade-start each node without noticeably impacting system performance.
So question is: is spring integration perfect fit for such case?
If answer is "Yes", could you kindly name primary components we should use primarily?
p.s. Sure enough we would probably also need to pick something as a message bus and queue acessible by every node (maybe redis, hazelcast or maybe rabbitmq, not sure what is more appropriate)

Yes, it's a good fit. I would suggest rabbitmq for the transport/queuing and the Spring Integration AMQP enpoints.
Rolling updates shouldn't be an issue unless you change the format of the messages sent between nodes). But even then you could handle it relatively easily by moving to a new set of queues.

How can I avoid duplication of business logic when batch processing?

I have a web application dedicated to batch processing (batch service here on out, api driven) and I have the main web application that is dedicated to everything else. I've been struggling with making a decision on what the best way is to avoid duplication of business logic in the batch service. Both applications are clustered. The separation for batch processing has been okay for simple jobs, but I have more complex jobs where it would just cause chaos if the business logic were duplicated. Here's my use case for the purposes of this question.
Customer schedules a cron job for user updates.
Batch service is given a CSV file with 20,000 user records.
The batch service rips through the file performing validation on the records, basically a dry run.
The batch service will check the allowable change and error thresholds (percentages are counts)
If validation thresholds pass, the batch service will begin creating/updating users.
When users are created or updated, there are a number of modules/features that need to know about these events.
Job progress is tracked and customer can view progress, logs, and status of job.
Here are a few solutions I have been thinking about:
Jar up the business logic and share it across the two applications. This wouldn't necessarily be easy because the main application is a Grails application and it's got GORM littered throughout.
Have the batch service hit APIs on the main application for the create and updates and possibly the more complex validation scenarios. Worried about the toll this would take on tomcat, but calls would be going through the load balancer so they would be distributed.
Have the batch service hit APIs on the main application for validation, then queue create/update requests and let the main application retrieve them. Same as above, queue would help reduce http calls. Also would need a queue to report status back to batch service.
Duplicate some logic by having batch service do it's own validation and inserts/updates, but then fire a user created event or user updated event so modules/features in the main app can deal with the changes.
Embed the batch processing service into the main application
Other details:
The batch service and web application are both clustered
Both are running on AWS, so I have tools like SQS and SNS easily accessible
Java 1.7 applications
Tomcat containers
Main application is Grails
Batch service uses Spring Batch and Quartz at it's core
So my question is what are accepted ways to avoid duplication of business logic based on the details above? Can/Should the architecture be changed to better accommodate this?
Another idea to consider is what would this look like and a "microservices" architecture. That word has been tossed around a number of times in the office and we have been considering the idea of breaking up the main web application into services. So for example, we may end up with a service for user management.

Say for example you are using a Java EE 6 application.
Your CSV batch updater could be nothing more than a Timer that every once in a while reads a CSV file dumped in a folder and for each user update encoded on that file pumps a message to a queue encoding the update you want to do.
Somewhere else, you have a message driven bean that reacts to the update request message and triggers the update business logic for the user reported on the JMS message.
After the transaction is committed successfuly, if you have ten differn application that are interested in knowing that the user was updated, you could post a message to, for example, a notification topic with - say - messageType='userUpdated'.
Each of your 10 applications that cares about this could be a consumer on this topic.
They would be informed that a user was updated and maybe internally publish a local event (e.g. a CDI event - a guava event - whatever - and the internal satke holders would now of it).
Normally, there are always Java EE replacements in every technlogy stack.
Every decent technology stack offers ways to promote loose coupling between UI and business logic, precisely so that HTML / WEB is just viewed as one of many entry points to an application's business logic.
In scala, i there is an AKKA framework that looks super interesting.
The point is, as long as your business logic is not written in some place that only the web application can tap into, your fine. Otherwise, you've made the design decision to couple your business logic with UI.

In your case, I would suggest to make a separation by concern, I mean a plugin that gathers only the domain classes if using Grails, the other plugins that take care of Services ... these would represent the application core, I think it's much easier this way if your application contain too much KLOC, using microservices will take you time too much time if you have a lot of calls between modules.
Communication between functional modules aka. plugins can be made via events, see events-si or rabbit MQ plugin.

Is there a java monitoring/alerts framework for a cluster?

I have a cluster of servers. Common tasks I manualy code are:
collect various stats (failures, success, times) with metrics library.
aggregate those combine cross cluster.
depending on conditions check out the aggregated stats cross cluster and based on that send alerts. (instead of having each server send an alert, increase global metrics which are polled then to graphite).
if a specific node send an alert its first accumulated and base on alerts from other nodes (again cross cluster scenario) then I woudl decide which alert to send (so if i have 100 servers not each of them send a separate alert but single one).
I looked into a few frameworks but none of them that I see achieve this: metrics, javamelody, netflix servo, netflix zuul
but none of them support for example my cross cluster scenario where i want to aggregate stats over time and only if certain conditions apply send an alert (as a method to avoid duplicating alerts cross servers). Do I need to build my own framework for that? or is there already something existing?
(and in case my use case sounded specific so that I should just code it, i have many more such similar use cases which makes me think why isn't there such a framework, before i start coding something i don't want to find i just duplicated some other framework).

Have you looked at using a combination of either Graphite or OpenTSDB with Riemann? You can aggregate your information in Graphite (with or without statsd) or dump everything into OpenTSDB and use Riemann for event processing? Riemann's config is in Clojure but I believe you can use client libraries in multiple languages (unless you want to do the event processing yourself using Esper/Siddhi). Another option could be to look at Rocksteady (whcih uses Graphite/Esper). Graphite is a Python/Django application (there are multiple forks of statsd - not just the one in NodeJS & besides, you can simply use metrics in place of that). OpenTSDB is Java on HBase (if you're looking to store time series information). For event processing, you could also choose to look into Storm (and use Esper/Siddhi as a bolt in Storm).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.