Best approach using Spring batch to process big file - java

I am using Spring batch to download a big file in order to process it.
the scenario is pretty simple:
1. Download the file via http
2. process it(validations,transformations)
3. send it into queue
no need to save the input file data.
we might have multiple job instances(of the same scenario) running in the same time
I am looking for best practice to handle this situation.
Should I create Tasklet to download the file locally and than start processing it via regular steps?
in that case I need to consider some temp-file-concerns
(make sure I delete it, make sure i am not overriding other temp filename, etc..)
In other hand I could download it and keep it in-memory but than I afraid that if I run many jobs instances ill be out of memory very soon.
How would you suggest to nail this scenario ?? Should I use tasklet at all?
thank you.

If you have a large file, I'd recommend storing it to disk unless there is a good reason not to. By saving the file to disk it allows you to restart the job without the need to re-download the file if an error occurs.
With regards to the Tasklet vs Spring Integration, we typically recommend Spring Integration for this type of functionality since FTP functionality is already available there. That being said, Spring XD uses a Tasklet for FTP functionality so it's not uncommon to take that approach as well.
A good video to watch about the integration of Spring Batch and Spring Integration is the talk Gunnar Hillert and I gave at SpringOne2GX. You can find the entire video here: https://www.youtube.com/watch?v=8tiqeV07XlI. The section that talks about using Spring Batch Integration for FTP before Spring Batch is at about 29:37.

I believe below example is a classic solution to your problem
http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#launching-batch-jobs-through-messages

Related

Identifying the Right Spring Boot Starter

I wanted to build a Application which listens to a queue and does a series of steps.
Basically the application should listen to Queue1 and:
- Get some data from ServiceA[Small amount of data]
- Get some data from ServiceB[Small amount of data]
- Update Some information in Service C [Based on the data]
- Create number of messages[based on the data] on a Queue2.
Due to the flow based nature of this application I was looking into Job Execution system in Spring. However all the steps are designed to be idempotent and the data being transferred between steps is small, hence I did not want a Database with this application.
I started exploring Spring Batch or Spring Task for this. Spring batch provides really good constructs like Tasklet and Steps but there are number of comments recommending connecting Spring Batch to database and how it is designed to manage massive amounts of data, reliably(I don't need reliability here since the queue and idempotent nature provides that.). While I can pass data using the Execution Context there were recommendations against it.
Question:
- Are there simpler starters in the Spring Boot ecosystem which provide workflows/Job like interface which I should use ?
- Is this a valid use case for spring Batch or is that over engineering/misuse of the steps ?
Thanks a lot for the Help
Ayushman
P.S: I can provide exact details of the job but did not want to conflate the question.
I had two projects worth of experience with Spring Batch. I haven't tried Spring Task.
Having said that, my answer is somewhat bias. Spring Batch is a bit notorious to configure. If your application is simple enough, just use "spring-boot-starter-amqp". It will be enough.
By any chance, you decide to use Spring Batch (for its Job and Step Aspects features or other features), you may want to configure to just use an in-memory database (because you don't need any retry/roll-back feature it is providing).

Java Spring/Workflow

I have 50,000,000 files that need to be processed using 3-5 different filters configured in workflows
I plan to use microservice architecture
My Questions
i want to use spring integration and batch, to run the workflows. and design the workflows, do you agree or is there another java based system you recommend?
can spring batch can handle "long running i.e. days" workflows.
can spring batch/integration load xml files on the fly
I think Spring Batch is pretty good for this job, below my answers.
I recommend you Spring Batch for this job. It's easy to use and in combination with Spring Workflow are good for the workflow desing.
Yes, it's really good. You need to configure it well.
I'm not sure what are you saying with on the fly. (batch files or configuration files). For batch files yes. For configuration files, it depends on how you load the configuration and how you will use the context.
IMHO Spring Batch can process files based on multiple filters. It can also be easily customized to fit most of your needs and has really fast processing speeds. However, I haven't tried it with anything close to 50,000,000 files, so can't vouch for that.
To run a Spring Batch application as a microservice, take a look at Spring Boot and Spring Cloud Task. Also, look into Spring Cloud Dataflow for orchestration.

Is it best practice to design Spring Application in such a way that we need to create same context multiple times?

I am going to start a new project using Spring framework. As I dont have much experience in Spring I need your help to sort out few confusions.
Lets look at use case
My application uses Spring integration framework. The core functionality of my app is,
I need to poll multiple directories from file system,
read the files(csv mostly),
process some operations on them and insert them to database.
Currently I have set up spring integration flow for it. Which has inbound-chaneell-adapter for polling and then file traverse through the channels and at the end inserted into database.
My concerns are
Number of directories application supposed to poll will be decided at runtime. Hence I need to create inbound-chanell-adapter at runtime (as one chanell adapter can poll only one directory at once) and cant define them statically in my spring context xml (As I dont know the how many I need).
Each directory has certain properties which should be applied to the file while processing.(While going through the integration flow)
So right now what I am doing is I am loading new ClassPathXmlApplicationContext("/applicationContext.xml"); for each directory. And cache the required properties in that newly created context. And use them at the time of processing (in <int:service-activator>).
Drawbacks of current design
Separate context is created for each directory.
Unnecessary beans are duplicated. (Database session factories and like)
So is there any way to design the application in such a way that context will not be duplicated. And still I can use properties of each directory throughout the integration flow at the same time???
Thanks in advance.
See the dynamic ftp sample and the links in its readme about creating child contexts when needed, containing new inbound components.
Also see my answer to a similar question for multiple IMAP mail adapters using Java configuration and then a follow-up question.
You can also use a message source advice to reconfigure the FileReadingMessageSource on each poll to look at different directories. See Smart polling.

What is the simplest transaction framework in Java?

Given I have a simple task: process some piece of data and append it to the file. Its ok if I dont have exceptions, but this may happen. If something goes wrong I would like to remove all the changes from the file.
Also, may be I have set some variables during the processing and I would like to return their previous state too.
Also, may be I work with a database that doesn't support transactions (to the best of my knowledge MongoDB does not), so I would like to rollback it from DB somehow.
Yes, I can fix the issue with my file manually just by backuping the file and then replacing it. But generally looks like I need a transaction framework.
I dont want to use Spring monster for this. Its too much. And I dont have ELB container to manage EJB. I have a simple Java stand-alone application, but it needs transaction support.
Do I have some other options instead of plugging Spring or EJB?
If you don't want to use spring, try to implements a simple Two-phase commit mechanism: Two-Phase Commit Protocol
I am no Java expert but this sounds simple.
In fact I would not use transactions in an ACID compliant database since it doesn't sound like the right action.
Instead I would write to a temporary file, when your records have been written merge with the original file. That way if some records cannot be written to the file for whatever reason you just drop the old file and merging and saving the new file will be atomic within the program's memory and the OS's file system.

How do I save server state between webapp deployments not relying on a database?

We have a utility spring-mvc application that doesn't use a database, it is just a soap/rest wrapper. We would like to store an arbitrary message for display to users that persists between deployments. The application must be able to both read and write this data. Are there any best practices for this?
Multiple options.
Write something to the file system - Great for persistence. A little slow. Primary drawback is that it would probably have to be a shared file system, as any type of clustering wouldn't deal well with this. Then you get into file locking issues. Very easy implementation
Embedded DB - Similar benefits and pitfalls as just writing to the file system, but probably deals better with locking/transactional issues. Somewhat more difficult implementation.
Distributed Cache - Like Memcached - A bit faster than file, though not much. Deals with the clustering and locking issues. However, it's not persistent. Fairly reliable for a short webapp restart, but definitely not 100%. More difficult implementation, plus you need another server.
Why not use an embedded database? Options are:
H2
HSQL
Derby
Just include the jar file in the webapps classdir and configure the JDBC URL as normal.
Perfect for demos and easy to substitute when you want to switch to a bigger database server
I would simple store that in a file on a filesystem. It's possible to use an embedded database, or something like that, but for 1 message, a file will be fine.
I'd recommend you store the file outside of the application directory.
It might be alongside (next to) it, but don't go storing it inside your "webapps/" directory, or anything like that.
You'll probably also need to manage concurrency. A global (static) read/write lock should do fine.
I would use JNDI. Why over-complicate?

Categories

Resources