Using Spring Batch to horizontally scale external process execution

Using Spring Batch to horizontally scale external process execution - java

One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.

What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0

Related

Java: Parallel stream pipeline -> iterator() with "prepare X elements"

I would like to use the Java Streams API to create a pipeline and then terminate it with iterator(), but I want it to "prepare X elements" asynchronously in advance. (Streams API not required, but preferred).
The situation I have is:
Our application loads data from remote files (over network) to a database
Opening a remote file (i.e. executing the pipeline synchronously for a single element) takes a non-trivial amount of time
I cannot open too many files at once (results in connection timeout)
The order in which the files are loaded matters
The Java Streams API can create the pipeline of commands to do, but to my knowledge, it cannot satisfy both requirements above. It can do either single execution:
files.stream().map(this::convertToInputStream).iterator()
which exacerbates the first requirement. Or it can do wholesale execution:
files.stream().map(this::convertToInputStream).collect(toList())
which fails the second.
I have implemented the requirements using a Deque<InputStream> and Thread logic to keep it populated up to a certain element count, but it is not pretty. I am asking if anyone knows of a way to create pipelines like so (perhaps using libraries) in a more elegant fashion. It seemed like a reasonable use case for the Streams API.

Batchlet vs EJB Timer

I did an application to do some testing on network nodes like ping test, retrieve disk space ans so on.
I use a scheduled batchlet to run the actions but I wonder if it is the rigth use of batchlet?
Does an EJB timer should be more relevant? Also, when I run a batchlet, my glassfish server keeps a log of the batch job and I don't necessary need it (especially with the amount of batch jobs genereted during a day).
If I need to run some job in the same schedule time, I think batchled can do it but EJB timer too?
Could you give me your input on the rigth way to achieve this?
Thanks,
Ersch

This isn't a question with a clear answer, but there is a bit of a cost in factoring your application as a batch job, and I would look at what I'm getting to see if it's worth doing so.
So you're thinking about a job consisting of a single Batchlet step. Well, there'd be nothing gained from "restart" functions, neither at the failing step within a job nor leveraging checkpoints within a chunk step. The batchlet programming model is quite simple... even if you really like #BatchProperty you'd have to deal with an XML now to do so.
This only starts to get more interesting if you want to start, view, and manage these executions along with the rest of your batch jobs. This might be because you're working with an implementation that offers some kind of implementation-specific add-on function. An example of this could be an integration with external scheduler software, allowing jobs to be scheduled by it. At the other extreme, if you found value in having a persisted record of all your batch job executions in one place (the job repository, usually a persistent DB), then that could also make this worthwhile for you.
But if you don't care for any of that, then an EJB timer could be the way to go instead.

Using an EJB timer is appropriate when your task executes in an eye blink (or thereabouts).
Otherwise use the batching mechanism.
Long running tasks executed from EJB timers can be problematical because they execute in transactions which normally time out after a short period of time. Increasing this transaction time out also increases the chances of database and perhaps other resource locks which can impact normal operation of your application.

Scheduled job in a multi node environment

I am working on a scheduled job that will run at certain interval (eg. once a day at 1pm), scheduled through Cron. I am working with Java and Spring.
Writing the scheduled job is easy enough - it does: grab list of people will certain criteria from db, for each person do some calculation and trigger a message.
I am working on a single-node environment locally and in testing, however when we go to production, it will be multi-node environment (with load balancer, etc). My concern is how would multi node environment affect the scheduled job?
My guess is I could (or very likely would) end up with triggering duplicate message.
Machine 1: Grab list of people, do calculation
Machine 2: Grab list of people, do calculation
Machine 1: Trigger message
Machine 2: Trigger message
Is my guess correct?
What would be the recommended solution to avoid the above issue? Do I need to create a master/slave distributed system solution to manage multi node environment?

If you have something like three Tomcat instances, each load balanced behind Apache, for example, and on each your application runs then you will have three different triggers and your job will run three times. I don't think you will have a multi-node environment with distributed job execution unless some kind of mechanism for distributing the parts of the job is in place.
If you haven't looked at this project yet, take a peek at Spring XD. It handles Spring Batch Jobs and can be run in distributed mode.

spring batch :How it works in distributed environemnt

I am planning to use spring batch in a distributed environment. to do some batch processing tasks.
Now when i mean distributed env i mean i have set of boxes with fronteneding web service. Loadbalancer distributes then distributes the job to boxes.
Now I have few questions:
1)What happends if job is terminated half way(say the box got restarted).Will spring batch automatically restart the job?Or do i need to write my own custom watcher and then call spring batch api to restart job?
2)If spring back has this kind of auto restart .Can 2 boxes pick and execute same job at once?
Is this the case?

Spring Batch has four strategies to handle scalability, see here for further details:
Multi-threaded Step (single process)
Parallel Steps (single process)
Remote Chunking of Step (multi process)
Partitioning a Step (single or multi process)
Yours is a multi-process scenario, so you can choose between step remote chunking and step partioning, depending on the cost of the read part compared to the process/write.
But in both cases there cannot be two instances that do duplicate work, it's all designed to avoid that. that could only happened if by accident deploying one of the two single process mechanisms in different machines, that would cause the problem you mention.
Restart logic is also foreseen, see here the Restartability section for further details.
Upon restart the job will go on reading, processing and writing the next chunk of data. If the reader/processor/writer are configured/written taken into that the task is chunked, it will all work out of the box.
Usually it involves including in the write part marking the read items in that chunk as 'processed'.

java load balancer and Queue management (like PBS)

We have a simple java app which runs user requests (commands) on the command line and outputs results. So, as an example, say the user submits a job along the lines of:
cmd.exe ping -10 google.com
Typically, we have around 100 users, submitting 100's of requests to be exceuted. Currently, the java app simply queues the jobs and run them sequentially, without being "democratic" or "greedy" about it.
What we would like to do is have the ability to prioritize jobs, run jobs in a more "even" fashion (say 2 users have submitted 100 jobs each, we would like to run 1 job per user and switch back and forth).
To this end, I was wondering if there are any opensource tools such as PBS:
http://code.google.com/p/pbs4java/
which could integrate with java. A search on google does not reveal a lot - any comments or suggestions would be much appreciated.
UPDATE: The most important criterions I am after are:
[1] Should be opensource
[2] Should be able to integrate with Java.

I suggest taking a look at Akka Dispatch and priority features. You should manage how you want to assign the priorities to the tasks. So, in your case, as users submit more tasks, it could lead to degrading to lower priorities through time. The metrics should be defined by your problem requirement and injected into the library.
As a side node, Akka uses HawtDispatch ideas, so you might also take a look at the library if it suits you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.