Design and Solution Question - Spring Boot with Kafka Application

Design and Solution Question - Spring Boot with Kafka Application - java

I am a newbie to Springboot and micro service development and have a question related to Springbot Service designing.
Requirement -
We have a requirement where a Springboot Service need to listen to 3 or 4 different Kafka topic individually and create 3 or 4 csv files respectively (After filtering few of the attributes from event message and upload files to FTP server at different time of day).
Requirement Summary
Design and Inputs Needed
I am thinking of solution like below -
#1 I am thinking of having a Kafka Consumer, which can read from one Kafka topics and apply filters and keep on creating files whole day. Once file reaches to100 MB, it will get rotated like A1.csv, A2.csv etc
#2 Also to have job manager which can create cron jobs, which can stitch files once in a day and upload to FTP.
https://spring.io/guides/gs/scheduling-tasks/
#3 Would like jobs to be created on the basis of configuration as tomorrow if we want to add new jobs, it can be quick.
#4 How to design this so as scalability can be achieved. As event coming would be huge in numbers.
#5 Will it be recommended to use elastic cache instead of creating multiple files and then sticking them together to a single file.
#6 I also want to have a fail safe logic so as if service fails, I should be able to continue from where I am left off.
Please point me to any existing solution etc which I can refer to. Any API which can help with job/batch scheduler and also to manage configurations.
Regards,
Dan

You can have checkpointing in Kafka, so if application fails, you can start consuming events from the same point. Checkpoint only shifts once your job is done.
At this scale, it is good to have spark jobs etc to speed up the computation.
You can have distributed cache cluster like Redis or Aerospike to cache some of the files for fast reads.
Its good to have job trigger time as configurable which can be changed tomorrow.
Let me know what else would you need to know.

Related

How do I do coordination between multiple application instances?

Need help in deciding what frameworks I can use in this scenario. I'm exploring Zookeeper, but not completely sure on how to solution this usecase.
background :
Say there is application that connects to a streaming source(Kafka or Activemq etc) and writes messages that were processed from the stream
to a file.
This application is deployed as 4 instances.Each instance is processing messages and writing to file that were processed in last 1 hr.
Each instance creates a file that stores messages that it processed last 1
hour. example -filename is servername_8.00 for messages processed from 8-9
Requirement is to transfer all the files that were created last 1 hour if every instance created a file in that window and also send only one consolidated file which lists all the 4 file names and
number of records.
what i'm looking for :
1. How do I make sure application instances know if other instances also created files and if every instance created then only they should transmit file
2. whatever instance sending, consolidated file should know what was transmitted.
what frameworks I can use to solve this?

You can definitely use ZooKeeper for this. I would use Apache Curator as well (note: I'm the main author of Curator).
Do all the instances share a file server? i.e. can each instance see all of the created files? If so, you can nominate a leader using ZooKeeper/Curator and only the leader does all of the work. You can see sample leader election code here: https://github.com/apache/curator/tree/master/curator-examples/src/main/java/leader
If the instances do not share a file server, you could still use ZooKeeper to coordinate the writing of the shared file. You'd again nominate a leader who exposes an API of some kind that all the instances can write to and the leader creates the shared file.
You also might find the Curator barrier recipes useful: http://curator.apache.org/curator-recipes/double-barrier.html and http://curator.apache.org/curator-recipes/barrier.html
You'd have to give a lot more detail on your use case if you want a more detail design.

Scheduler in a java spring boot microservice

We have a microservice written using Spring boot which has its own NoSQL datastore. We are working on functionality whereby we want to delete some old data (in magnitude of 0.5 million documents) and want to do it on a regular basis(once a day) based on presence of records of particular type in data store.
Is having a scheduler which runs once everyday and does the deletion, a correct approach for it ? Also since its a microservice and several instances of it will be running, how do we control that this scheduler runs on only 1 instance ?

There are multiple options I can think of now:
If there is a single instance of micro-service deployed, you can use something like quartz to time the job.
Create a RESTful API for cleanup, invoke it using a script, please refer to https://stackoverflow.com/a/15090893/2817980 for example. This will make sure that only one instance of the service works on cleanup.
If there is a master-slave replica, ask the master to allocate to only 1 instance
Create a scheduled job using something like quartz and then check if the job already taken up by some other scheduler in zookeeper/redis/db or any other storage.
I can discuss more on this.

Is Spring Integration suitable for web-farm processing of "reliable queue"?

Sorry if title is confusing, let me explain my question.
Our team need to develop web service which is suppose to run on several nodes (web farm - horizontal scaling). We know how to implement this "manually", but we're pretty excited about Spring Integration which is new to us - so we really trying to understand whether this is good fit for our scenario - and if so we'll try to make use of it.
Typical scenario:
Sevaral servers ("nodes") running same web application (lets call it "OurWebService")
We need to pull files from external systems ("InboundExtSystems")
Process this data with help of other external systems (involves local resource-consuming operations) ("UtilityExtServices")
Submit processing results to another set of external systems ("OutboundExtSystems")
Non-functional requirements:
Due to performance reasons we cannot query UtilityExtServices by demand -AND- local processing also CPU-intensive. So we need to have queue, in order to control pace at which we performing requests and process results
We expect several nodes will equally pull tasks from this queue and process them
We need to make sure that every queued task pulled from InboundExtSystems will be handled - we need to guarantee that none of them will disappear.
We need to make sure timeouts are handled as well. If task processing timed out - we need to "requeue" this task (and make sure previous handled will not submit results for this task)
We need to be able to perform rolling updates. Like let's say 5 nodes are processing queue. We want to be able to sequentially stop-upgrade-start each node without noticeably impacting system performance.
So question is: is spring integration perfect fit for such case?
If answer is "Yes", could you kindly name primary components we should use primarily?
p.s. Sure enough we would probably also need to pick something as a message bus and queue acessible by every node (maybe redis, hazelcast or maybe rabbitmq, not sure what is more appropriate)

Yes, it's a good fit. I would suggest rabbitmq for the transport/queuing and the Spring Integration AMQP enpoints.
Rolling updates shouldn't be an issue unless you change the format of the messages sent between nodes). But even then you could handle it relatively easily by moving to a new set of queues.

How to see all jobs in a cluster using Spring Batch?

I'm trying to determine all the things I need to consider when deploying jobs to a clustered environment.
I'm not concerned about parallel processing or other scaling things at the moment; I'm more interested in how I make everything act as if it was running on a single server.
So for I've determined that triggering a job should be done via messaging.
The thing that's throwing me for a loop right now is how to utilize something like the Spring Batch Admin UI (even if it's a hand rolled solution) in a clustered deployment. Getting the job information from a JobExplorer seems like one of the keys.
Is Will Schipp's spring-batch-cluster project the answer, or is there a more agreed upon community answer?
Or do I not even need to worry because the JobRepository will be pulling from a shared database?
Or do I need to publish job execution info to a message queue to update the separate Job Repositories?
Are there other things I should be concerned about, like the jobIncrementers?
BTW, if it wasn't clear that I'm a total noob to Spring batch, let it now be known :-)

Spring XD (http://projects.spring.io/spring-xd/) provides a distributed runtime for deploying clusters of containers for batch jobs. It manages the job repository as well as provides way to deploy, start, restart, etc the jobs on the cluster. It addresses fault tolerance (if a node goes down, the job is redeployed for example) as well as many other necessary features that are needed to maintain a clustered Spring Batch environment.

I'm adding the answer that I think we're going to roll with unless someone comments on why it's dumb.
If Spring Batch is configured to use a shared database for all the DAOs that the JobExplorer will use, then running is a cluster isn't much of a concern.
We plan on using Quarts jobs to create JobRequest messages which will be put on a queue. The first server to get to the message will actually kick off the Spring Batch job.
Monitoring running jobs will not be an issue because the JobExplorer gets all of it's information from the database and it doesn't look like it's caching information, so we won't run into cluster issues there either.
So to directly answer the questions...
Is Will Schipp's spring-batch-cluster project the answer, or is there a more agreed upon community answer?
There is some cool stuff in there, but it seems like over-kill when just getting started. I'm not sure if there is "community" agreed upon answer.
Or do I not even need to worry because the JobRepository will be pulling from a shared database?
This seems correct. If using a shared database, all of the nodes in the cluster can read and write all the job information. You just need a way to ensure a timer job isn't getting triggered more than once. Quartz already has a cluster solution.
Or do I need to publish job execution info to a message queue to update the separate Job Repositories?
Again, this shouldn't be needed because the execution info is written to the database.
Are there other things I should be concerned about, like the jobIncrementers?
It doesn't seem like this is a concern. When using the JDBC DAO implementations, it uses a database sequence to increment values.

Making Existing Spring Batch Application run on multiple nodes

We have existing Spring Batch Application, that we want to make scalable to run on multiple nodes.
The scalabilty docs for Spring Batch involves code changes and configuration changes.
I am just wondering if this can be achieved by just configuration changes ( adding new classes and wiring it in configuration is fine but just want to avoid code changes to existing classes).
Thanks a lot for the help in advance.

It really depends on your situation. Specifically, why do you do you want to run on multiple nodes? What is the bottle neck you're attempting to overcome? The typical two scenarios that Spring Batch handles out of the box for scaling across multiple nodes are remote chunking and remote partitioning. Both are master/slave configurations, but each have a different use case.
Remote chunking is used when the processor in a step is the bottle neck. In this case, the master node reads the input and sends it via a Spring Integration channel to remote nodes for processing. Once the item has been processed, the result is returned to the master for writing. In this case, reading and writing are done locally to the master. While this helps parallelize processing, it takes an I/O hit because every item is being sent over the wire (and requires guaranteed delivery, ala JMS for example).
Remote partitioning is the other scenario. In this case, the master generates a description of the input to be processed for each slave and only that description is sent over the wire. For example, if you're processing records in a database, the master may send a range of row ids to each slave (1-100, 101-200, etc). Reading and writing occur local to the slaves and guaranteed delivery is not required (although useful in certain situations).
Both of these options can be done with minimal (or no) new classes depending on your use case. There are a couple different places to look for information on these capabilities:
Spring Batch Integration Github repository - Spring Batch Integration is the project that supports the above use cases. You can read more about it here: https://github.com/spring-projects/spring-batch-admin/tree/master/spring-batch-integration
My remote partitioning example - This talk walks though remote partitioning and provides a working example to run on CloudFoundry (currently only works on CF v1 but updates for CF2 are coming in a couple days). The configuration is almost the same, only the connection pool for Rabbit is different: https://github.com/mminella/Spring-Batch-Talk-2.0 The video for this presentation can be found on YouTube here: http://www.youtube.com/watch?v=CYTj5YT7CZU
Gunnar Hillert's presentation on Spring Batch and Spring Integration: This was presented at SpringOne2GX 2013 and contains a number of examples: https://github.com/ghillert/spring-batch-integration-sample
In any of these cases, remote chunking should be accomplishable with zero new classes. Remote partitioning typically requires you to implement one new class (the Partitioner).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.