I have a batch job that reads data from multiple tables on a datasource with a complicated select query with many joins and writes to a table on a different datasource using an insert query.
#Bean
public JdbcCursorItemReader<Employee> myReader2() {
JdbcCursorItemReader<Employee> reader = new JdbcCursorItemReader<>();
reader.setSql(COMPLICATED_QUERY_WITH_MANY_JOINS);
reader.setDataSource(dataSourceOne);
reader.setPreparedStatementSetter(new MyPrepStSetterOne());
reader.setRowMapper(new EmployeeRowMapper());
return reader;
}
Bean
public JdbcBatchItemWriter<Employee> myWriter2(DataSource dataSource) {
JdbcBatchItemWriter<Employee> writer = new JdbcBatchItemWriter<>();
writer.setSql(INSERT_QUERY);
writer.setPreparedStatementSetter(new MyPrepStSetterTwo());
writer.setDataSource(dataSourceTwo);
return writer;
}
I have the above reader and writer in a step.
I want to delete the employee records that were inserted from previous day's job (not all of them) if it could be duplicated by today's records.
So I added another step before the above step with same select query in the reader, but a delete query in the writer.
#Bean
public JdbcCursorItemReader<Employee> myReader1() {
JdbcCursorItemReader<Employee> reader = new JdbcCursorItemReader<>();
reader.setSql(COMPLICATED_QUERY_WITH_MANY_JOINS);
reader.setDataSource(dataSourceOne);
reader.setPreparedStatementSetter(new MyPrepStSetterOne());
reader.setRowMapper(new EmployeeRowMapper());
return reader;
}
Bean
public JdbcBatchItemWriter<Employee> myWriter1(DataSource dataSource) {
JdbcBatchItemWriter<Employee> writer = new JdbcBatchItemWriter<>();
writer.setSql(DELETE_QUERY_WHERE_EMPLOYEE_NAME_IS);
writer.setPreparedStatementSetter(new MyPrepStSetterZero());
writer.setDataSource(dataSourceTwo);
return writer;
}
I am getting EmptyResultDataAccessException: Item 3 of 10 did not update any rows
Because not all of today's records may have been inserted yesterday.
How can I make myWriter1 to ignore if a record does not exist already and proceed to next?
Your approach seems correct. You can set JdbcBatchItemWriter#setAssertUpdates to false in your writer and this should ignore the case where no records have been updated by your query (which is a valid business case according to your description).
Related
I am new to springbatch, and I wonder how this reader/processor/writer works if I am reading a csv file which contains 10k rows, use a chunk size of 10 and output to a csv file.
My questions is:
Does springbatch loads all 10k rows from csv in one time, process individually(10k times), and then store all of them into the destination file in one go? If so, what's the point of using springbatch? I can have three methods doing the same job right?
Or:
Does springbatch opens up a stream reading 10k rows from csv, each time it reads 10 rows, process 10 rows, and open a output stream write/append those 10 rows into destination file? Basically repeats 10k/10 = 1k times.
#Configuration
public class SampleJob3 {
#Bean
public Job job3(JobRepository jobRepository, PlatformTransactionManager transactionManager) {
return new JobBuilder("Job3", jobRepository)
.incrementer(new RunIdIncrementer()) // work with program args
.start(step(jobRepository, transactionManager))
.build();
}
private Step step(JobRepository jobRepository, PlatformTransactionManager transactionManager) {
return new StepBuilder("Job3 Step started ")
.<Student, Student>chunk(3)
.repository(jobRepository)
.transactionManager(transactionManager)
.reader(reader(true))
.processor(student -> {
System.out.println("processor");
return new Student(student.getId(), student.getFirstName() + "!", student.getLastName() + "!", student.getEmail() + "!");
})
.writer(writer())
.build();
}
private FlatFileItemReader<Student> reader(boolean isValid) {
System.out.println("reader");
FlatFileItemReader<Student> reader = new FlatFileItemReader<>();
// using FileSystemResource if file stores in a directory instead of resource folder
reader.setResource(new PathMatchingResourcePatternResolver().getResource(isValid ? "input/students.csv" : "input/students_invalid.csv"));
reader.setLineMapper(new DefaultLineMapper<>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {{
setNames("ID", "First Name", "Last Name", "Email");
}});
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Student.class);
}});
}
});
reader.setLinesToSkip(1);
return reader;
}
//#Bean
public FlatFileItemWriter<Student> writer() {
System.out.println("writer");
FlatFileItemWriter<Student> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource("output/students.csv"));
writer.setHeaderCallback(writer1 -> writer1.write("Id,First Name,Last Name, Email"));
writer.setLineAggregator(new DelimitedLineAggregator<>() {{
setFieldExtractor(new BeanWrapperFieldExtractor<>() {{
setNames(new String[]{"id", "firstName", "lastName", "email"});
}});
}});
writer.setFooterCallback(writer12 -> writer12.write("Created # " + Instant.now()));
return writer;
}
}
My last question basically the same, but datasource is database. e.g. reading a table contains 10k data from dbA and write to dbB. Am I able to read 10 rows at a time, process them and write them to dbB? If so, can you share some sudocode?
A chunk-oriented step in Spring Batch will not read the entire file or table at once. It will rather stream data from the source in chunks (of a configurable size).
You can find more details about the processing model in the reference documentation here: Chunk-oriented Processing.
So I have a CSV file in which I have several attributes of books, in face of bookName, authorName, yearOfPublishing, etc. In order to process the file I am implementing batch framework. The problem is that I want to read and parse only the authorName field, and my app doesn't recognise it, it only passes the first column of my CSV file into my authorName field. Here is my item reader and line mapper.
#Bean
public FlatFileItemReader<Author> reader(){
FlatFileItemReader<Author> itemReader = new FlatFileItemReader<>();
itemReader.setResource(new FileSystemResource("src/main/resources/BX-Books.csv"));
itemReader.setName("csvReader");
itemReader.setLinesToSkip(1);
itemReader.setLineMapper(lineMapper());
itemReader.setStrict(false);
return itemReader;
}
private LineMapper<Author> lineMapper() {
DefaultLineMapper<Author> lineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
lineTokenizer.setDelimiter(";");
lineTokenizer.setStrict(false);
lineTokenizer.setNames("Book-Author");
BeanWrapperFieldSetMapper<Author> fieldSetMapper = new BeanWrapperFieldSetMapper<>();
fieldSetMapper.setTargetType(Author.class);
lineMapper.setLineTokenizer(lineTokenizer);
lineMapper.setFieldSetMapper(fieldSetMapper);
return lineMapper;
}
THIS would be the way to do it.
That would be the method setIncludedFields in class org.springframework.batch.item.file.transform.DelimitedLineTokenizer in case the above link goes bad.
I am writing a Service that obtains data from large sql query in database (over 100,000 records) and streams into an API CSV File. Is there any java library function that does it, or any way to make the code below more efficient? Currently using Java 8 in Spring boot environment.
Code is below with sql repository method, and service for csv. Preferably trying to write to csv file, while data is being fetched from sql concurrently as query make take 2-3 min for user .
We are using Snowflake DB.
public class ProductService {
private final ProductRepository productRepository;
private final ExecutorService executorService;
public ProductService(ProductRepository productRepository) {
this.productRepository = productRepository;
this.executorService = Executors.newFixedThreadPool(20);
}
public InputStream getproductExportFile(productExportFilters filters) throws IOException {
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
executorService.execute(() -> {
try {
Stream<productExport> productStream = productRepository.getproductExportStream(filters);
Field[] fields = Stream.of(productExport.class.getDeclaredFields())
.peek(f -> f.setAccessible(true))
.toArray(Field[]::new);
String[] headers = Stream.of(fields)
.map(Field::getName).toArray(String[]::new);
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
.setHeader(headers)
.build();
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(os);
CSVPrinter csvPrinter = new CSVPrinter(outputStreamWriter, csvFormat);
productStream.forEach(productExport -> writeproductExportToCsv(productExport, csvPrinter, fields));
outputStreamWriter.close();
csvPrinter.close();
} catch (Exception e) {
logger.warn("Unable to complete writing to csv stream.", e);
} finally {
try {
os.close();
} catch (IOException ignored) { }
}
});
return is;
}
private void writeProductExportToCsv(productExport productExport, CSVPrinter csvPrinter, Field[] fields) {
Object[] values = Stream.of(fields).
map(f -> {
try {
return f.get(productExport);
} catch (IllegalAccessException e) {
return null;
}
})
.toArray();
try {
csvPrinter.printRecord(values);
csvPrinter.flush();
} catch (IOException e) {
logger.warn("Unable to write record to file.", e);
}
}
public Stream<PatientExport> getProductExportStream(ProductExportFilters filters) {
MapSqlParameterSource parameterSource = new MapSqlParameterSource();
parameterSource.addValue("customerId", filters.getCustomerId().toString());
parameterSource.addValue("practiceId", filters.getPracticeId().toString());
StringBuilder sqlQuery = new StringBuilder("SELECT * FROM dbo.Product ");
sqlQuery.append("\nWHERE CUSTOMERID = :customerId\n" +
"AND PRACTICEID = :practiceId\n"
);
Streaming allows you to transfer the data, little by little, without having to load it all into the server’s memory. You can do your operations by using the extractData() method in ResultSetExtractor. You can find javadoc about ResultSetExtractor here.
You can view an example using ResultSetExtractor here.
You can also easily create your JPA queries as ResultSet using JdbcTemplate. You can take a look at an example here. to use ResultSetExtractor.
There is product which we bought some time ago for our company, we got even the source code back then. https://northconcepts.com/ We were also evaluating Apache Camel which had similar support but it didnt suite our goal. If you really need speed you should go to lowest level possible - pure JDBC and as simple as possible csv writer.
Nortconcepts library itself provides capability to read from jdbc and write to CSV on lower level. We found few tweaks which have sped up the transmission and processing. With single thread we are able to stream 100 000 records (with 400 columns) within 1-2 minutes.
Given that you didn't specify which database you use I can give you only generic answers.
In general code like this is network limited, as JDBC resultset is usually transferred in "only n rows" packages, and when you exhaust one, only then database triggers fetching of next packet. This property is often called fetch-size, and you should greatly increase it. By default settings, most of databases transfer 10-100 rows in one fetch. In spring you can use setFetchSize property. Some benchmarks here.
There are other similar low level stuff which you could do. For example, Oracle jdbc driver has "InsensitiveResultSetBufferSize" - how big in bytes is a buffer which holds result set. But dose things tend to be database specific.
Thus being said, the best way to really increase speed of your transfer is to actually launch multiple queries. Divide your data on some column value, and than launch multiple parallel queries. Essentially, if you can design data to support parallel queries working on easily distinguished subsets, bottleneck can be transferred to a network or CPU throughput.
For example one of your columns might be 'timestamp'. Instead having one query to fetch all rows, fetch multiple subset of rows with query like this:
SELECT * FROM dbo.Product
WHERE CUSTOMERID = :customerId
AND PRACTICEID = :practiceId
AND :lowerLimit <= timestamp AND timestamp < :upperLimit
Launch this query in parallel with different timestamp ranges. Aggregate result of those subqueries in shared ConcurrentLinkedQueue and build CSV there.
With similar approach I regularly read 100000 rows/sec on 80 column table from Oracle DB. That is 40-60 MB/sec sustained transfer rate from a table which is not even locked.
I have a job that writes each item in one separated file. In order to do this, the job uses a ClassifierCompositeItemWriter whose the ClassifierCompositeItemWriter returns a new FlatFileItemWriter for each item (code bellow).
#Bean
#StepScope
public ClassifierCompositeItemWriter<ProcessorResult> writer(#Value("#{jobParameters['outputPath']}") String outputPath) {
ClassifierCompositeItemWriter<MyItem> compositeItemWriter = new ClassifierCompositeItemWriter<>();
compositeItemWriter.setClassifier((item) -> {
String filePath = outputPath + "/" + item.getFileName();
BeanWrapperFieldExtractor<MyItem> fieldExtractor = new BeanWrapperFieldExtractor<>();
fieldExtractor.setNames(new String[]{"content"});
DelimitedLineAggregator<MyItem> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setFieldExtractor(fieldExtractor);
FlatFileItemWriter<MyItem> itemWriter = new FlatFileItemWriter<>();
itemWriter.setResource(new FileSystemResource(filePath));
itemWriter.setLineAggregator(lineAggregator);
itemWriter.setShouldDeleteIfEmpty(true);
itemWriter.setShouldDeleteIfExists(true);
itemWriter.open(new ExecutionContext());
return itemWriter;
});
return compositeItemWriter;
}
Here's how the job is configured:
#Bean
public Step step1() {
return stepBuilderFactory
.get("step1")
.<String, MyItem>chunk(1)
.reader(reader(null))
.processor(processor(null, null, null))
.writer(writer(null))
.build();
}
#Bean
public Job job() {
return jobBuilderFactory
.get("job")
.incrementer(new RunIdIncrementer())
.flow(step1())
.end()
.build();
}
Everything works perfectly. All the files are generated as I expected. However, one of the file cannot be deleted. Just one. If I try to delete it, I get a message saying that "OpenJDK Platform binary" is using it. If I increase the chunk to a size bigger that the amount of files I'm generating, none of the files can be deleted. Seems like there's an issue to delete the files generated in the last chunk, like if the respective writer is not being closed properly by the Spring Batch lifecycle or something.
If I kill the application process, I can delete the file.
Any I idea why this could be happening? Thanks in advance!
PS: I'm calling this "itemWriter.open(new ExecutionContext());" because if I don't, I get a "org.springframework.batch.item.WriterNotOpenException: Writer must be open before it can be written to".
EDIT:
If someone is facing a similar problem, I suggest reading the Mahmoud's answer to this question Spring batch : ClassifierCompositeItemWriter footer not getting called .
Probably you are using the itemwriter outside of the step scope when doing this:
itemWriter.open(new ExecutionContext());
Please check this question, hope that this helps you.
Im using repository item reader to read transactions from DB, process and write to file using flatfileitem writer. Im using fault tolerance to skip any record if it is faulty. To check if a record is faulty, we do some validations in processor and throw custom exception.
batch config
return stepBuilderFactory.get("test")
.<Transaction, Transaction>chunk(1)
.reader(this.reader.read())
.processor(transactionProcessor)
.writer(transactionWriter)
.faultTolerant()
.skipPolicy(new UnlimitedSkipPolicy()
.listener(new TransactionSkipListenerTransaction())
.stream(this.transactionwriter)
transaction reader bean
#Override
#StepScope
public RepositoryItemReader<Transaction> read() throws Exception {
final ZonedDateTime zonedDateTime = LocalDate.now().atStartOfDay(ZoneId.systemDefault());
final Date today = Date.from(zonedDateTime.toInstant());
final Date yesterday = Date.from(zonedDateTime.minusDays(1L).toInstant());
RepositoryItemReader<Transaction> reader = new RepositoryItemReader<>();
reader.setRepository(transactionDao);
reader.setMethodName("findByTransactionDateGreaterThanEqualAndTransactionDateLessThan");
reader.setArguments(Arrays.asList(yesterday, today));
reader.setSort(Collections.singletonMap("transactionId", Sort.Direction.ASC));
reader.setPageSize(10000);
return reader;
}
Processor bean
#Override
public Transaction process(Transaction transaction) throws Exception {
if(utility.isInvalidTransaction(transaction)) {
log.error("Invalid transaction {} and skipped from reporting", transaction.getTransactionId());
throw new CustomException("invalid transaction" + transaction.getTransactionid());
}
And writer is just a flatfile writer.
The issue is, after few transactions are read, in my case after reading and processing 9 invalid transactions, the batch job is stuck. not sure why.
One thing to notice is if i add noRetry(Exception.class) and noRollbBack(Exception.class) to config the batch is running fine. can anyone pls explain what is happening internally.
.reader(this.reader.read())
.processor(transactionProcessor)
.writer(transactionWriter)
.faultTolerant()
.skipPolicy(new UnlimitedSkipPolicy()
.noRetry(Exception.class)
.noRollBack(Exception.class)
Output
Invalid transaction 1
Invalid transaction 2
Invalid transaction 9
Stuck here
In my view the responsibilities of process bean is to add custom transformation or logic on the basis of which the data will be passed to the writer in that case I simply log the error and return null in that case where the writer handles the null object reference i.e. invalid transaction (null) is not passed to writer.