Springbatch read from csv, how does it work? - java

I am new to springbatch, and I wonder how this reader/processor/writer works if I am reading a csv file which contains 10k rows, use a chunk size of 10 and output to a csv file.
My questions is:
Does springbatch loads all 10k rows from csv in one time, process individually(10k times), and then store all of them into the destination file in one go? If so, what's the point of using springbatch? I can have three methods doing the same job right?
Or:
Does springbatch opens up a stream reading 10k rows from csv, each time it reads 10 rows, process 10 rows, and open a output stream write/append those 10 rows into destination file? Basically repeats 10k/10 = 1k times.
#Configuration
public class SampleJob3 {
#Bean
public Job job3(JobRepository jobRepository, PlatformTransactionManager transactionManager) {
return new JobBuilder("Job3", jobRepository)
.incrementer(new RunIdIncrementer()) // work with program args
.start(step(jobRepository, transactionManager))
.build();
}
private Step step(JobRepository jobRepository, PlatformTransactionManager transactionManager) {
return new StepBuilder("Job3 Step started ")
.<Student, Student>chunk(3)
.repository(jobRepository)
.transactionManager(transactionManager)
.reader(reader(true))
.processor(student -> {
System.out.println("processor");
return new Student(student.getId(), student.getFirstName() + "!", student.getLastName() + "!", student.getEmail() + "!");
})
.writer(writer())
.build();
}
private FlatFileItemReader<Student> reader(boolean isValid) {
System.out.println("reader");
FlatFileItemReader<Student> reader = new FlatFileItemReader<>();
// using FileSystemResource if file stores in a directory instead of resource folder
reader.setResource(new PathMatchingResourcePatternResolver().getResource(isValid ? "input/students.csv" : "input/students_invalid.csv"));
reader.setLineMapper(new DefaultLineMapper<>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {{
setNames("ID", "First Name", "Last Name", "Email");
}});
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Student.class);
}});
}
});
reader.setLinesToSkip(1);
return reader;
}
//#Bean
public FlatFileItemWriter<Student> writer() {
System.out.println("writer");
FlatFileItemWriter<Student> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource("output/students.csv"));
writer.setHeaderCallback(writer1 -> writer1.write("Id,First Name,Last Name, Email"));
writer.setLineAggregator(new DelimitedLineAggregator<>() {{
setFieldExtractor(new BeanWrapperFieldExtractor<>() {{
setNames(new String[]{"id", "firstName", "lastName", "email"});
}});
}});
writer.setFooterCallback(writer12 -> writer12.write("Created # " + Instant.now()));
return writer;
}
}
My last question basically the same, but datasource is database. e.g. reading a table contains 10k data from dbA and write to dbB. Am I able to read 10 rows at a time, process them and write them to dbB? If so, can you share some sudocode?

A chunk-oriented step in Spring Batch will not read the entire file or table at once. It will rather stream data from the source in chunks (of a configurable size).
You can find more details about the processing model in the reference documentation here: Chunk-oriented Processing.

Related

Spring Batch EmptyResultDataAccessException while deleting

I have a batch job that reads data from multiple tables on a datasource with a complicated select query with many joins and writes to a table on a different datasource using an insert query.
#Bean
public JdbcCursorItemReader<Employee> myReader2() {
JdbcCursorItemReader<Employee> reader = new JdbcCursorItemReader<>();
reader.setSql(COMPLICATED_QUERY_WITH_MANY_JOINS);
reader.setDataSource(dataSourceOne);
reader.setPreparedStatementSetter(new MyPrepStSetterOne());
reader.setRowMapper(new EmployeeRowMapper());
return reader;
}
Bean
public JdbcBatchItemWriter<Employee> myWriter2(DataSource dataSource) {
JdbcBatchItemWriter<Employee> writer = new JdbcBatchItemWriter<>();
writer.setSql(INSERT_QUERY);
writer.setPreparedStatementSetter(new MyPrepStSetterTwo());
writer.setDataSource(dataSourceTwo);
return writer;
}
I have the above reader and writer in a step.
I want to delete the employee records that were inserted from previous day's job (not all of them) if it could be duplicated by today's records.
So I added another step before the above step with same select query in the reader, but a delete query in the writer.
#Bean
public JdbcCursorItemReader<Employee> myReader1() {
JdbcCursorItemReader<Employee> reader = new JdbcCursorItemReader<>();
reader.setSql(COMPLICATED_QUERY_WITH_MANY_JOINS);
reader.setDataSource(dataSourceOne);
reader.setPreparedStatementSetter(new MyPrepStSetterOne());
reader.setRowMapper(new EmployeeRowMapper());
return reader;
}
Bean
public JdbcBatchItemWriter<Employee> myWriter1(DataSource dataSource) {
JdbcBatchItemWriter<Employee> writer = new JdbcBatchItemWriter<>();
writer.setSql(DELETE_QUERY_WHERE_EMPLOYEE_NAME_IS);
writer.setPreparedStatementSetter(new MyPrepStSetterZero());
writer.setDataSource(dataSourceTwo);
return writer;
}
I am getting EmptyResultDataAccessException: Item 3 of 10 did not update any rows
Because not all of today's records may have been inserted yesterday.
How can I make myWriter1 to ignore if a record does not exist already and proceed to next?
Your approach seems correct. You can set JdbcBatchItemWriter#setAssertUpdates to false in your writer and this should ignore the case where no records have been updated by your query (which is a valid business case according to your description).

How to read and parse only 1 column from a CSV file Java

So I have a CSV file in which I have several attributes of books, in face of bookName, authorName, yearOfPublishing, etc. In order to process the file I am implementing batch framework. The problem is that I want to read and parse only the authorName field, and my app doesn't recognise it, it only passes the first column of my CSV file into my authorName field. Here is my item reader and line mapper.
#Bean
public FlatFileItemReader<Author> reader(){
FlatFileItemReader<Author> itemReader = new FlatFileItemReader<>();
itemReader.setResource(new FileSystemResource("src/main/resources/BX-Books.csv"));
itemReader.setName("csvReader");
itemReader.setLinesToSkip(1);
itemReader.setLineMapper(lineMapper());
itemReader.setStrict(false);
return itemReader;
}
private LineMapper<Author> lineMapper() {
DefaultLineMapper<Author> lineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
lineTokenizer.setDelimiter(";");
lineTokenizer.setStrict(false);
lineTokenizer.setNames("Book-Author");
BeanWrapperFieldSetMapper<Author> fieldSetMapper = new BeanWrapperFieldSetMapper<>();
fieldSetMapper.setTargetType(Author.class);
lineMapper.setLineTokenizer(lineTokenizer);
lineMapper.setFieldSetMapper(fieldSetMapper);
return lineMapper;
}
THIS would be the way to do it.
That would be the method setIncludedFields in class org.springframework.batch.item.file.transform.DelimitedLineTokenizer in case the above link goes bad.

Spring batch doesn't seem to be closing item writers properly

I have a job that writes each item in one separated file. In order to do this, the job uses a ClassifierCompositeItemWriter whose the ClassifierCompositeItemWriter returns a new FlatFileItemWriter for each item (code bellow).
#Bean
#StepScope
public ClassifierCompositeItemWriter<ProcessorResult> writer(#Value("#{jobParameters['outputPath']}") String outputPath) {
ClassifierCompositeItemWriter<MyItem> compositeItemWriter = new ClassifierCompositeItemWriter<>();
compositeItemWriter.setClassifier((item) -> {
String filePath = outputPath + "/" + item.getFileName();
BeanWrapperFieldExtractor<MyItem> fieldExtractor = new BeanWrapperFieldExtractor<>();
fieldExtractor.setNames(new String[]{"content"});
DelimitedLineAggregator<MyItem> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setFieldExtractor(fieldExtractor);
FlatFileItemWriter<MyItem> itemWriter = new FlatFileItemWriter<>();
itemWriter.setResource(new FileSystemResource(filePath));
itemWriter.setLineAggregator(lineAggregator);
itemWriter.setShouldDeleteIfEmpty(true);
itemWriter.setShouldDeleteIfExists(true);
itemWriter.open(new ExecutionContext());
return itemWriter;
});
return compositeItemWriter;
}
Here's how the job is configured:
#Bean
public Step step1() {
return stepBuilderFactory
.get("step1")
.<String, MyItem>chunk(1)
.reader(reader(null))
.processor(processor(null, null, null))
.writer(writer(null))
.build();
}
#Bean
public Job job() {
return jobBuilderFactory
.get("job")
.incrementer(new RunIdIncrementer())
.flow(step1())
.end()
.build();
}
Everything works perfectly. All the files are generated as I expected. However, one of the file cannot be deleted. Just one. If I try to delete it, I get a message saying that "OpenJDK Platform binary" is using it. If I increase the chunk to a size bigger that the amount of files I'm generating, none of the files can be deleted. Seems like there's an issue to delete the files generated in the last chunk, like if the respective writer is not being closed properly by the Spring Batch lifecycle or something.
If I kill the application process, I can delete the file.
Any I idea why this could be happening? Thanks in advance!
PS: I'm calling this "itemWriter.open(new ExecutionContext());" because if I don't, I get a "org.springframework.batch.item.WriterNotOpenException: Writer must be open before it can be written to".
EDIT:
If someone is facing a similar problem, I suggest reading the Mahmoud's answer to this question Spring batch : ClassifierCompositeItemWriter footer not getting called .
Probably you are using the itemwriter outside of the step scope when doing this:
itemWriter.open(new ExecutionContext());
Please check this question, hope that this helps you.

How to process larger files from JSON To CSV using Spring batch

I am trying to implement a batch job for the following use-case. (New to spring batch)
Use-case
From one source system every day I will get 200+ compressed(.gz) files. Each(.gz) file will gives a 1GB of file on unzip. Which means 200GB of files in my input directory. Here the content type is JSON.
Sample Format of JSON File
{"name":"abc1","age":20}
{"name":"abc2","age":20}
{"name":"abc3","age":20}
.....
I need to process these files from JSON TO CSV to output directory. And the these csv generation should be similar like size based rolling in Log4J. After writing I need to remove the each file from input directory.
Question 1
Does the spring batch can handle this huge data? Because for single day I am getting nearly 200GB?
Question 2
I am thinking Spring batch can handle.So Implemented a code with partitioner using spring batch . But while reading I am seeing some dirty lines with out any end of line.
Faulty lines structure
{"name":"abc1","age":20,....}
{"name":"abc2","age":20......}
{"name":"abc3","age":20
{"name":"abc1","age":20,....}
{"name":"abc1","age":20,....}
.....
For this I have written a skip policy but its not working as expected. Its skipping all line from the error line on-wards instead one line. How to skip only that error line?
I am sharing my sample snippet below please give some suggestions or corrections on my code and to above questions and issues.
JobConfig.java
#Bean
public Job myJob() throws Exception {
return joubBuilderFactory.get(COnstants.JOB.JOB_NAME)
.incrementer(new RunIdIncrementer())
.listener(jobCompleteListener())
.start(masterStep())
.build();
//master
#Bean
public Step masterStep() throws Exception{
return stepBuilderFactory.get("step")
.listener(new UnzipListener())
.partitioner(slaveStep())
.partitioner("P",partitioner())
.gridSize(10).
taskExecutor(executor())
.build();
}
//slaveStep
#Bean
public Step slaveStep() throws Exception{
return stepBuilderFactory.get("slavestep")
.reader(reader(null))
.writer(customWriter)
.faultTolerant()
.skipPolicy(fileVerificationSkipper())
.build();
}
#Bean
public SkipPolicy fileVerificatoinSkipper(){
return new FileVerficationSkipper();
}
#Bean
#StepScop
public Partitioner partitioner() throws Exception{
MutliResourcePartitioner part = new MultiResourcePartitioner();
PathMatching ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
Resource[] res = resolver.getResource("...path of files...");
part.setResoruces(res);
part.partition(20);
return part;
}
Skip Policy Code
public class LineVerificationSkipper implements SkipPolicy {
#Override
public boolean shouldSkip(Throwable exception, int skipCount) throws SkipLimitExceededException {
if (exception instanceof FileNotFoundException) {
return false;
} else if (exception instanceof FlatFileParseException && skipCount <= 5) {
FlatFileParseException ffpe = (FlatFileParseException) exception;
StringBuilder errorMessage = new StringBuilder();
errorMessage.append("An error occured while processing the " + ffpe.getLineNumber()
+ " line of the file. Below was the faulty " + "input.\n");
errorMessage.append(ffpe.getInput() + "\n");
System.err.println(errorMessage.toString());
return true;
} else {
return false;
}
}
Question 3
How to delete the input source files are processing each file?. Because I am not getting any info like file path or name in ItemWriter.?

Flink Dynamic Update Stream job

I am receiving a set of events in Avro format on different topics. I want to consume these and write to s3 in parquet format.
I have written a below job that creates a different stream for each event and fetches it schema from the confluent schema registry to create a parquet sink for an event.
This is working fine but the only problem I am facing is whenever a new event start coming I have to change in the YAML config and restart the job every time. Is there any way I do not have to restart the job and it start consuming a new set of events.
YamlReader reader = new YamlReader(topologyConfig);
EventTopologyConfig eventTopologyConfig = reader.read(EventTopologyConfig.class);
long checkPointInterval = eventTopologyConfig.getCheckPointInterval();
topics = eventTopologyConfig.getTopics();
List<EventConfig> eventTypesList = eventTopologyConfig.getEventsType();
CachedSchemaRegistryClient registryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000);
FlinkKafkaConsumer flinkKafkaConsumer = new FlinkKafkaConsumer(topics,
new KafkaGenericAvroDeserializationSchema(schemaRegistryUrl),
properties);
DataStream<GenericRecord> dataStream = streamExecutionEnvironment.addSource(flinkKafkaConsumer).name("source");
try {
for (EventConfig eventConfig : eventTypesList) {
LOG.info("creating a stream for ", eventConfig.getEvent_name());
final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema(eventConfig.getSchema_subject(), registryClient)))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
DataStream<GenericRecord> outStream = dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord != null && genericRecord.get(EVENT_NAME).toString().equals(eventConfig.getEvent_name())) {
return true;
}
return false;
});
outStream.addSink(sink).name(eventConfig.getSink_id()).setParallelism(parallelism);
}
} catch (Exception e) {
e.printStackTrace();
}
Yaml file :
!com.bounce.config.EventTopologyConfig
eventsType:
- !com.bounce.config.EventConfig
event_name: "search_list_keyless"
schema_subject: "search_list_keyless-com.bounce.events.keyless.bookingflow.search_list_keyless"
topic: "search_list_keyless"
- !com.bounce.config.EventConfig
event_name: "bike_search_details"
schema_subject: "bike_search_details-com.bounce.events.keyless.bookingflow.bike_search_details"
topic: "bike_search_details"
- !com.bounce.config.EventConfig
event_name: "keyless_bike_lock"
schema_subject: "analytics-keyless-com.bounce.events.keyless.bookingflow.keyless_bike_lock"
topic: "analytics-keyless"
- !com.bounce.config.EventConfig
event_name: "keyless_bike_unlock"
schema_subject: "analytics-keyless-com.bounce.events.keyless.bookingflow.keyless_bike_unlock"
topic: "analytics-keyless"
checkPointInterval: 1200000
topics: ["search_list_keyless","bike_search_details","analytics-keyless"]
Thanks.
I think you want to use a custom BucketAssigner that uses the genericRecord.get(EVENT_NAME).toString() value as the bucket ID, along with whatever event time bucketing was being done by the EventTimeBucketAssigner.
Then you don't need to create multiple streams, and it should be dynamic (whenever a new event name value occurs in a record being written, you'll get a new output sink).

Categories

Resources