Avoiding concurrency in Spring Batch jobs in a cluster environment

Avoiding concurrency in Spring Batch jobs in a cluster environment - java

I want to ensure that a Spring job is not started a second time while it still runs. This would be trivial in a single JVM environment.
However, how can I achieve this in a cluster environment (more specifically, in JBoss 5.1 - I know a bit antiquated; if solutions exist for later versions, I'd be interested in those as well).
So, it should be kind of a Singleton pattern across all cluster nodes.
I am considering using database locks or a message queue. Is there a simpler / better performing solution?

You need to synchronize threads that doesn't know nothing each other, so the easiest way is to share some information on a common place. Valid alternatives are:
A shared database
A shared file
An external web service holding the status of the batch process
If you prefer to use a shared database try to use a database like Redis to improve your performance. It is an in memory database with persistence on disk, so accessing the status of the batch process should be enough fast.

This is too late but for future lookups: spring batch uses a jpa repository to synchronize jobs, so you can avoid concurrency.

You can add a Job Listener and in the before step and use JobExecutionDao in it to find all JobExecutions. If there are more than one running - throw an exception and exit the job.

Related

Synchronize Batch Jobs across multiple Application Instances

I am writing a spring batch application which should only run one Job Instance at a time. This should also be true if multiple application instances are started. Sadly, the jobs can’t be parallelized and are invoked at random.
So, what I am looking for is a spring boot configuration which allows me to synchronize the job execution within one processor as well as in the distributed case. I have already found some approaches like the JobLauncherSynchronizer (https://docs.spring.io/spring-batch-admin/trunk/apidocs/org/springframework/batch/admin/launch/JobLauncherSynchronizer.html) but all the solutions I have found work either only on one processor or protect just a fraction of the job execution.
Is there any spring boot configuration which prevents multiple concurrent executions of the same job, even across multiple concurrently running application instances (which share the same database)?
Thank you in advance.

Is there any spring boot configuration which prevents multiple concurrent executions of the same job, even across multiple concurrently running application instances (which share the same database)?
Not to my knowledge. If you really want to have a global synchronization at the job level (ie a single job instance at a time), you need a global synchronizer like the JobLauncherSynchronizer you linked to.

ETL process to transfer data from one Db to another using Apache Spark

I need to create an ETL process that will extract, tranform & then load 100+ tables from several instances of SQLServer to as many instances of Oracle in parallel on a daily basis. I understand that I can create multiple threads in Java to accomplish this but if all of them run on the same machine this approach won't scale. Another approach could be to get a bunch of ec2 instances & start transferring tables for each instance on a different ec2 instance. With this approach, though, I would have to take care of "elasticity" by adding/removing machines from my pool.
Somehow I think I can use "Apache Spark on Amazon EMR" to accomplish this, but in the past I've used Spark only to handle data on HDFS/Hive, so not sure if transferring data from one Db to another Db is a good use case for Spark - or - is it?

Starting from your last question:
"Not sure if transferring data from one Db to another Db is a good use case for Spark":
It is, within the limitation of the JDBC spark connector. There are some limitations such as the missing support in updates, and the parallelism when reading the table (requires splitting the table by a numeric column).
Considering the IO cost and the overall performance of RDBMS, running the jobs in a FIFO mode does not sound like a good idea. You can submit each one of the jobs with a configuration that requires 1/x of cluster resources so x tables will be processed in parallel.

Threads on Multiple VMs accessing a table on single Instance of DB causing low performance and Exceptions occasionally

Application is hosted on multiple Virtual Machines and DB is on single server. All VMs are pointing to single Instance of DB.
In this architecture, I have a table having very few record. But this table is accessed and updated by threads running on VMs very heavily. This is causing a performance bottleneck and sometimes record level exception. Database level locking does not seem to be best option as it is introducing significant delays in request processing.
Please suggest if there is any other technique to solve this problem.

Few questions first!
Is your application using connection pooling? If not, please use it. Creating a JDBC connection is expensive!
Is your application read heavy/write heavy?
What kind of storage engine are you using in your MySQL tables? InnoDB or MyISAM. If your application is write heavy, please use InnoDB based tables as it uses row level locking and will serve concurrent requests better.

One special case - if you are implementing queues on top of database tables, find a database that has a built-in queue operation and use that, or use a reliable messaging service. Building queues on top of databases is typically not efficient. See e.g. http://mikehadlow.blogspot.co.uk/2012/04/database-as-queue-anti-pattern.html
In general, running transactions on databases is slow because at the end of each transaction the database needs to be sure that enough has been saved out to disk that if the system died right now the changes made by the transaction would be safely preserved. If you don't need this you might find it faster to write a single non-database application that does what the database does but doesn't write anything out to disk, or still does database IO but does the minimum possible. Then instead of all of the VMs talking to the database directly they would all talk to this application.

Batch jobs - prevent concurrency

I have several batch jobs running on a SAP Java system using SAP Java Scheduler. Unfortunately, I haven't come across any documentation that shows how to prevent concurrent executions for periodic jobs. All I've seen is "a new instance of the job will be executed at the next interval". This ruins my fi-fo processing logic so I need to find a way to prevent it. If the scheduler API had a way of checking for the same job executions, this would be solved but haven't seen an example yet.
As a general architectural approach, other means to do this seems like using a DB table - an indicator table for marking current executions - or a JNDI parameter which would be checked first when the job starts. I could also "attempt" to use a static integer but that would fail me on clustered instances. The system is J2EE5 compliant and supports EJB 3.0, so a "singleton EJB" is not available either. I could set the max pool size for a bean and achieve a similar result maybe.
I'd like to hear your opinions on how to achieve this goal using different architectures.
Kind Regards,
S. Gökhan Topçu

you can route the jobs to every node just like db sharding, and you can handle it just like running in one node.
or you have to use a center node, something like db or memory cache or zookeeper to prevent same job running on different nodes;

Maintaining a single instance over multiple JVM

I am creating a distributed service and i am looking at restricting a set of time consuming operations to a single thread of execution across all JVMs at any given time. (I will have to deal with 3 JVMs max).
My initial investigations point me towards java.util.concurrent.Executors , java.util.concurrent.Semaphore. Using singleton pattern and Executors or Semaphore does not guarantee me a single thread of execution across Multiple JVMs.
I am looking for a java core API (or at least a Pattern) that i can use to accomplish my task.
P.S: I have access to ActiveMQ within my existing project which i was planning to use in order to achieve single thread of execution across multiple JVM Machines only if i dont have another choice.

There is no simple solution for this with a core java API. If the 3 JVMs have access to a shared file system you could use it to track state across JVMs.
So basically you do something like create a lock file when you start the expensive operation and delete it at the conclusion. And then have each JVM check for the existence of this lock file before starting the operation. However there are some issues with this approach like what if the JVM dies in the middle of the expensive operation and the file isn't deleted.
ZooKeeper is a nice solution for problems like this and any other cross process synchronization issue. Check it out if that is a possibility for you. I think it's a much more natural way to solve a problem like than a JMS queue.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Avoiding concurrency in Spring Batch jobs in a cluster environment - java

This is too late but for future lookups: spring batch uses a jpa repository to synchronize jobs, so you can avoid concurrency.

You can add a Job Listener and in the before step and use JobExecutionDao in it to find all JobExecutions. If there are more than one running - throw an exception and exit the job.

Related

Synchronize Batch Jobs across multiple Application Instances

ETL process to transfer data from one Db to another using Apache Spark

Threads on Multiple VMs accessing a table on single Instance of DB causing low performance and Exceptions occasionally

Batch jobs - prevent concurrency

Maintaining a single instance over multiple JVM

Categories

Resources