I have 50,000,000 files that need to be processed using 3-5 different filters configured in workflows
I plan to use microservice architecture
My Questions
i want to use spring integration and batch, to run the workflows. and design the workflows, do you agree or is there another java based system you recommend?
can spring batch can handle "long running i.e. days" workflows.
can spring batch/integration load xml files on the fly
I think Spring Batch is pretty good for this job, below my answers.
I recommend you Spring Batch for this job. It's easy to use and in combination with Spring Workflow are good for the workflow desing.
Yes, it's really good. You need to configure it well.
I'm not sure what are you saying with on the fly. (batch files or configuration files). For batch files yes. For configuration files, it depends on how you load the configuration and how you will use the context.
IMHO Spring Batch can process files based on multiple filters. It can also be easily customized to fit most of your needs and has really fast processing speeds. However, I haven't tried it with anything close to 50,000,000 files, so can't vouch for that.
To run a Spring Batch application as a microservice, take a look at Spring Boot and Spring Cloud Task. Also, look into Spring Cloud Dataflow for orchestration.
Related
Hi folks need some opinions here.
I already have a spring boot application holding all my rest APIs running on tomcat that ships in with spring-boot-starter-web.
I would like to set up jobs using spring batch that will be schedule via kubernetes. The idea is to share the same business logic instead of creating a standalone batch project which i need to maintain double business logic.
Question, scheduling via kubernetes meaning i will be firing java -jar someJar --spring.batch.jobNames=xxx in container, doing that it will also start up all my RestApis right? which in turn unnecessary and waste of resources. Anyway to mitigate this or my understanding is wrong?
The way I would implement this is by extracting the common business logic in a separate module, and make the batch app and the webapp depend on that module.
We're running a Spring Batch Web-Application for Importing CSV Files into a Database. This Web-Application is currently evolving and is constantly extended by new jobs.
the current update procedure looks like this:
1. Write new Code
2. Build a war file
3. Deploy the newly build war file and replace the whole Web Application on the Tomcat Webserver
This might bring us into trouble, when the running system is currently importing / writing Files to the Database.
I wanted to know if there is a smart way to maybe upgrade the spring batch jobs seperately ?
I already thought about splitting the Project into many different Web-Applications but this might be a lot of overhead with all the libraries bundled into the war file(s).
Are there any best practices for building that sort of Application ?
Thanks for your Help !
This packaging model is known to cause a lot of issues like the one you are facing. I recommend to package your jobs as separate jars and make your application launch those jobs in separate processes. With this model, you can deploy/upgrade jobs without impacting the web application used to launch them.
For the record, Spring Batch Admin suffered from this packaging model (as described here) and the recommended replacement is Spring Cloud Data Flow (which uses the model I described previously)
I wanted to build a Application which listens to a queue and does a series of steps.
Basically the application should listen to Queue1 and:
- Get some data from ServiceA[Small amount of data]
- Get some data from ServiceB[Small amount of data]
- Update Some information in Service C [Based on the data]
- Create number of messages[based on the data] on a Queue2.
Due to the flow based nature of this application I was looking into Job Execution system in Spring. However all the steps are designed to be idempotent and the data being transferred between steps is small, hence I did not want a Database with this application.
I started exploring Spring Batch or Spring Task for this. Spring batch provides really good constructs like Tasklet and Steps but there are number of comments recommending connecting Spring Batch to database and how it is designed to manage massive amounts of data, reliably(I don't need reliability here since the queue and idempotent nature provides that.). While I can pass data using the Execution Context there were recommendations against it.
Question:
- Are there simpler starters in the Spring Boot ecosystem which provide workflows/Job like interface which I should use ?
- Is this a valid use case for spring Batch or is that over engineering/misuse of the steps ?
Thanks a lot for the Help
Ayushman
P.S: I can provide exact details of the job but did not want to conflate the question.
I had two projects worth of experience with Spring Batch. I haven't tried Spring Task.
Having said that, my answer is somewhat bias. Spring Batch is a bit notorious to configure. If your application is simple enough, just use "spring-boot-starter-amqp". It will be enough.
By any chance, you decide to use Spring Batch (for its Job and Step Aspects features or other features), you may want to configure to just use an in-memory database (because you don't need any retry/roll-back feature it is providing).
I am using Spring batch to download a big file in order to process it.
the scenario is pretty simple:
1. Download the file via http
2. process it(validations,transformations)
3. send it into queue
no need to save the input file data.
we might have multiple job instances(of the same scenario) running in the same time
I am looking for best practice to handle this situation.
Should I create Tasklet to download the file locally and than start processing it via regular steps?
in that case I need to consider some temp-file-concerns
(make sure I delete it, make sure i am not overriding other temp filename, etc..)
In other hand I could download it and keep it in-memory but than I afraid that if I run many jobs instances ill be out of memory very soon.
How would you suggest to nail this scenario ?? Should I use tasklet at all?
thank you.
If you have a large file, I'd recommend storing it to disk unless there is a good reason not to. By saving the file to disk it allows you to restart the job without the need to re-download the file if an error occurs.
With regards to the Tasklet vs Spring Integration, we typically recommend Spring Integration for this type of functionality since FTP functionality is already available there. That being said, Spring XD uses a Tasklet for FTP functionality so it's not uncommon to take that approach as well.
A good video to watch about the integration of Spring Batch and Spring Integration is the talk Gunnar Hillert and I gave at SpringOne2GX. You can find the entire video here: https://www.youtube.com/watch?v=8tiqeV07XlI. The section that talks about using Spring Batch Integration for FTP before Spring Batch is at about 29:37.
I believe below example is a classic solution to your problem
http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#launching-batch-jobs-through-messages
I am looking for a pattern and/or framework which can model the following problem in an easily configurable way.
Every say 3 minutes, I needs to have a set of jobs kick off in a web application context that will concurrently hit web services to obtain the latest version of data, and push it off to a database. The problem is the database will be being heavily used to read the data from to do tons of complex calculations on the data. We are currently using spring so I have been looking at Spring Batch to run this process does anyone have any suggestions/patterns/examples of using Spring or other technologies of a similar system?
We have used ServletContextlisteners to kick off TimerTasks in our web applications when we needed processes to run repeatedly. The ServletContextListener kicks off when the app server starts the application or when the application is restarted. Then the timer tasks act like a separate thread that repeats your code for the specified period of time.
ServletContextListener
http://www.javabeat.net/examples/2009/02/26/servletcontextlistener-example/
TimerTask
http://enos.itcollege.ee/~jpoial/docs/tutorial/essential/threads/timer.html
Is refactoring the job out of the web application and into a standalone app a possibility?
That way you could stick the batch job onto a separate batch server (so that the extra load of the batch job wouldn't impact your web application), which then calls the web services and updates the database. The job can then be kicked off using something like cron or Autosys.
We're using Spring-Batch for exactly this purpose.
The database design would also depend on what the batched data is used for. If it is for reporting purposes, I would recommend separating the operational database from the reporting database, using a database link to obtain the required data from the operational database into the reporting database and then running the complex queries on the reporting database. That way the load is shifted off the operational database.
I think it's worth also looking into frameworks like camel-integration. Also take a look at the so called Enterprise Integration Patterns. Check the catalog - it might provide you with some useful vocabulary to think about the scaling/scheduling problem at hand.
The framework itself integrates really well with Spring.