Multithreading JPArepository Bulk Insert

Multithreading JPArepository Bulk Insert - java

A process I've been working on for a little while now. Process was running fine until the performance was taking a hit. I figured out a way to get it to perform very fast, but I'm really unsure what is happening behind the scenes. And it's now throwing warnings and errors and I'm not sure what to do. File is getting porocessed but I'm not sure if all threads are complete, and I don't believe I am shutting down the app correctly. Here is everything you need to know...
File is read using a buffered reader, we then run some data quality checks on each record, every record that is read and passes data quality checks we create a java object out of it and insert into a List. Once the List is 1000 objects big, we then call an OracleService class which has a Repo autowired and we execute a saveAll method with the List. We then continue to read the file and do this until the file is done being read. I am passing in, to the service, and ExecutorService object. So every time we call that service it is getting a new List object containing my objects (this object is basically the table we are loading) and a new ExecutorService Object. Process is running fine but getting a ton of exceptions being thrown once I try to shutdown. Here is all my code...
My Controller class run method. This will get called from another class which implements CommandLineRunner
public void run() throws ParseException, IOException, InterruptedException {
logger.info("******************** Aegis Check Inclearing DDA Trial Balance Table Load starting ********************");
try (BufferedReader reader = new BufferedReader(new FileReader(inputFile))) {
String line = reader.readLine();
int count = 0;
TrialBalanceBuilder builder = new TrialBalanceBuilder();
while (line != null) {
if (line.startsWith("D")) {
if (dataQuality(line)) {
TrialBalance trialBalance = builder.buildTrialBalanceObject(line, procDt, time);
insertList.add(trialBalance);
count++;
if (count == 1000) {
oracleService.loadToTableTrialBalance(insertList, executorService);
count = 0;
insertList.clear();
}
} else {
logger.info("Data quality check FAILED for record: " + line);
oracleService.revertInserts("DDA_TRIAL_BAL_STG",procDt.toString());
System.exit(111);
}
}
line = reader.readLine();
}
logger.info("Leftover record count is " + insertList.size());
oracleService.loadToTableTrialBalance(insertList, executorService);
} catch (IOException e) {
e.printStackTrace();
}
logger.info("Updating Metadata table with new batch proc date");
InclearingBatchMetadataBuilder inclearingBatchMetadataBuilder = new InclearingBatchMetadataBuilder();
InclearingBatchMetadata inclearingBatchMetadata = inclearingBatchMetadataBuilder.buildInclearingBatchMetadataObject("DDA_TRIAL_BAL_STG", procDt, time, Constants.bankID);
oracleService.insertBatchProcDtIntoMetaTable(inclearingBatchMetadata);
logger.info("Successfully updated Metadata table with new batch proc date: " + procDt);
Thread.sleep(10000);
oracleService.cleanUpGOS("DDA_TRIAL_BAL_STG",1);
executorService.shutdownNow();
logger.info("******************** Aegis Check Inclearing DDA Trial Balance Table Load ended successfully ********************");
}
I'm passing in an ExecutorService object to the service class. This is defined as...
private final ThreadFactory threadFactory = new ThreadFactoryBuilder().setNameFormat("Orders-%d").setDaemon(true).build();
private ExecutorService executorService = Executors.newFixedThreadPool(10, threadFactory);
My service class looks as such....
#Service("oracleService")
public class OracleService {
private static final Logger logger = LoggerFactory.getLogger(OracleService.class);
#Autowired
TrialBalanceRepo trialBalanceRepo;
#Transactional
public void loadToTableTrialBalance(List<TrialBalance> trialBalanceList, ExecutorService executorService) {
logger.debug("Let's load to the database");
logger.debug(trialBalanceList.toString());
List<TrialBalance> multiThreadList = new ArrayList<>(trialBalanceList);
try {
executorService.execute(() -> trialBalanceRepo.saveAll(multiThreadList));
} catch (ConcurrentModificationException | DataIntegrityViolationException ignored) {}
logger.debug("Successfully loaded to database");
}
In my run method i then call a few more methods in that Service class which create nativequeries and execute on the database (for purging etc.)
Anyway, I never know when the threads are complete. And I am finding in pre-production, when running with a lot of data, we shut down the app and not all the data is completely loaded. Also I don't know if this is even the best design. Do I keep passing in these executorservice objects? The whole point of this was to get optimal parallelism going so that our performance was better. Perhaps there is a better way (preferably without redesigning the entire app and using something other than JPA)

Related

Java Stream a Large SQL Query into API CSV File

I am writing a Service that obtains data from large sql query in database (over 100,000 records) and streams into an API CSV File. Is there any java library function that does it, or any way to make the code below more efficient? Currently using Java 8 in Spring boot environment.
Code is below with sql repository method, and service for csv. Preferably trying to write to csv file, while data is being fetched from sql concurrently as query make take 2-3 min for user .
We are using Snowflake DB.
public class ProductService {
private final ProductRepository productRepository;
private final ExecutorService executorService;
public ProductService(ProductRepository productRepository) {
this.productRepository = productRepository;
this.executorService = Executors.newFixedThreadPool(20);
}
public InputStream getproductExportFile(productExportFilters filters) throws IOException {
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
executorService.execute(() -> {
try {
Stream<productExport> productStream = productRepository.getproductExportStream(filters);
Field[] fields = Stream.of(productExport.class.getDeclaredFields())
.peek(f -> f.setAccessible(true))
.toArray(Field[]::new);
String[] headers = Stream.of(fields)
.map(Field::getName).toArray(String[]::new);
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
.setHeader(headers)
.build();
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(os);
CSVPrinter csvPrinter = new CSVPrinter(outputStreamWriter, csvFormat);
productStream.forEach(productExport -> writeproductExportToCsv(productExport, csvPrinter, fields));
outputStreamWriter.close();
csvPrinter.close();
} catch (Exception e) {
logger.warn("Unable to complete writing to csv stream.", e);
} finally {
try {
os.close();
} catch (IOException ignored) { }
}
});
return is;
}
private void writeProductExportToCsv(productExport productExport, CSVPrinter csvPrinter, Field[] fields) {
Object[] values = Stream.of(fields).
map(f -> {
try {
return f.get(productExport);
} catch (IllegalAccessException e) {
return null;
}
})
.toArray();
try {
csvPrinter.printRecord(values);
csvPrinter.flush();
} catch (IOException e) {
logger.warn("Unable to write record to file.", e);
}
}
public Stream<PatientExport> getProductExportStream(ProductExportFilters filters) {
MapSqlParameterSource parameterSource = new MapSqlParameterSource();
parameterSource.addValue("customerId", filters.getCustomerId().toString());
parameterSource.addValue("practiceId", filters.getPracticeId().toString());
StringBuilder sqlQuery = new StringBuilder("SELECT * FROM dbo.Product ");
sqlQuery.append("\nWHERE CUSTOMERID = :customerId\n" +
"AND PRACTICEID = :practiceId\n"
);

Streaming allows you to transfer the data, little by little, without having to load it all into the server’s memory. You can do your operations by using the extractData() method in ResultSetExtractor. You can find javadoc about ResultSetExtractor here.
You can view an example using ResultSetExtractor here.
You can also easily create your JPA queries as ResultSet using JdbcTemplate. You can take a look at an example here. to use ResultSetExtractor.

There is product which we bought some time ago for our company, we got even the source code back then. https://northconcepts.com/ We were also evaluating Apache Camel which had similar support but it didnt suite our goal. If you really need speed you should go to lowest level possible - pure JDBC and as simple as possible csv writer.
Nortconcepts library itself provides capability to read from jdbc and write to CSV on lower level. We found few tweaks which have sped up the transmission and processing. With single thread we are able to stream 100 000 records (with 400 columns) within 1-2 minutes.

Given that you didn't specify which database you use I can give you only generic answers.
In general code like this is network limited, as JDBC resultset is usually transferred in "only n rows" packages, and when you exhaust one, only then database triggers fetching of next packet. This property is often called fetch-size, and you should greatly increase it. By default settings, most of databases transfer 10-100 rows in one fetch. In spring you can use setFetchSize property. Some benchmarks here.
There are other similar low level stuff which you could do. For example, Oracle jdbc driver has "InsensitiveResultSetBufferSize" - how big in bytes is a buffer which holds result set. But dose things tend to be database specific.
Thus being said, the best way to really increase speed of your transfer is to actually launch multiple queries. Divide your data on some column value, and than launch multiple parallel queries. Essentially, if you can design data to support parallel queries working on easily distinguished subsets, bottleneck can be transferred to a network or CPU throughput.
For example one of your columns might be 'timestamp'. Instead having one query to fetch all rows, fetch multiple subset of rows with query like this:
SELECT * FROM dbo.Product
WHERE CUSTOMERID = :customerId
AND PRACTICEID = :practiceId
AND :lowerLimit <= timestamp AND timestamp < :upperLimit
Launch this query in parallel with different timestamp ranges. Aggregate result of those subqueries in shared ConcurrentLinkedQueue and build CSV there.
With similar approach I regularly read 100000 rows/sec on 80 column table from Oracle DB. That is 40-60 MB/sec sustained transfer rate from a table which is not even locked.

Using FileWatcher with Multithreading

I am trying to integrate Multithreading with FileWatcher service in java. i.e., I am constantly listening to a particular directory -> whenever a new file is created, I need to spawn a new thread which processes the file (say it prints the file contents). I kind of managed to write a code which compiles and works (but not as expected). It works sequentially meaning file2 is processed after file1 and file 3 is processed after file 2. I want this to be executed in parallel.
Adding the code snippet:
while(true) {
WatchKey key;
try {
key = watcher.take();
Path dir = keys.get(key);
for (WatchEvent<?> event: key.pollEvents()) {
WatchEvent.Kind<?> kind = event.kind();
if (kind == StandardWatchEventKinds.OVERFLOW) {
continue;
}
if(kind == StandardWatchEventKinds.ENTRY_CREATE){
boolean valid = key.reset();
if (!valid) {
break;
}
log.info("New entry is created in the listening directory, Calling the FileProcessor");
WatchEvent<Path> ev = (WatchEvent<Path>)event;
Path newFileCreatedResolved = dir.resolve(ev.context());
try{
FileProcessor processFile = new FileProcessor(newFileCreatedResolved.getFileName().toString());
Future<String> result = executor.submit(processFile);
try {
System.out.println("Processed File" + result.get());
} catch (ExecutionException e) {
e.printStackTrace();
}
//executor.shutdown(); add logic to shut down
}
}
}
}
}
and the FileProcessor class
public class FileProcessor implements Callable <String>{
FileProcessor(String triggerFile) throws FileNotFoundException, IOException{
this.triggerFile = triggerFile;
}
public String call() throws Exception{
//logic to write to another file, this new file is specific to the input file
//returns success
}
What is happening now -> If i transfer 3 files at a time, they are sequentially. First file1 is written to its destination file, then file2, file3 so on.
Am I making sense? Which part I need to change to make it parallel? Or Executor service is designed to work like that.

The call to Future.get() is blocking. The result isn't available until processing is complete, of course, and your code doesn't submit another task until then.
Wrap your Executor in a CompletionService and submit() tasks to it instead. Have another thread consume the results of the CompletionService to do any processing that is necessary after the task is complete.
Alternatively, you can use the helper methods of CompletableFuture to set up an equivalent pipeline of actions.
A third, simpler, but perhaps less flexible option is simply to incorporate the post-processing into the task itself. I demonstrated a simple task wrapper to show how this might be done.

How do you handle EmptyResultDataAccessException with Spring Integration?

I have a situation where before I process an input file I want to check if certain information is setup in the database. In this particular case it is a client's name and parameters used for processing. If this information is not setup, the file import shall fail.
In many StackOverflow pages, the users resolve handling EmptyResultDataAccessException exceptions generated by queryForObject returning no rows by catching them in the Java code.
The issue is that Spring Integration is catching the exception well before my code is catching it and in theory, I would not be able to tell this error from any number of EmptyResultDataAccessException exceptions which may be thrown with other queries in the code.
Example code segment showing try...catch with queryForObject:
MapSqlParameterSource mapParameters = new MapSqlParameterSource();
// Step 1 check if client exists at all
mapParameters.addValue("clientname", clientName);
try {
clientID = this.namedParameterJdbcTemplate.queryForObject(FIND_BY_NAME, mapParameters, Long.class);
} catch (EmptyResultDataAccessException e) {
SQLException sqle = (SQLException) e.getCause();
logger.debug("No client was found");
logger.debug(sqle.getMessage());
return null;
}
return clientID;
In the above code, no row was returned and I want to properly handle it (I have not coded that portion yet). Instead, the catch block is never triggered and instead, my generic error handler and associated error channel is triggered instead.
Segment from file BatchIntegrationConfig.java:
#Bean
#ServiceActivator(inputChannel="errorChannel")
public DefaultErrorHandlingServiceActivator errorLauncher(JobLauncher jobLauncher){
logger.debug("====> Default Error Handler <====");
return new DefaultErrorHandlingServiceActivator();
}
Segment from file DefaultErrorHandlingServiceActivator.java:
public class DefaultErrorHandlingServiceActivator {
#ServiceActivator
public void handleThrowable(Message<Throwable> errorMessage) throws Throwable {
// error handling code should go here
}
}
Tested Facts:
queryForObject expects a row to be returned and will thrown an
exception if otherwise, therefore you have to handle the exception
or use a different query which returns a row.
Spring Integration is monitoring exceptions and catching them before
my own code can hand them.
What I want to be able to do:
Catch the very specific condition and log it or let the end user know what they need to do to fix the problem.
Edit on 10/26/2016 per recommendation from #Artem:
Changed my existing input channel to Spring provided Handler Advice:
#Transformer(inputChannel = "memberInputChannel", outputChannel = "commonJobGateway", adviceChain="handleAdvice")
Added support Bean and method for the advice:
#Bean
ExpressionEvaluatingRequestHandlerAdvice handleAdvice() {
ExpressionEvaluatingRequestHandlerAdvice advice = new ExpressionEvaluatingRequestHandlerAdvice();
advice.setOnFailureExpression("payload");
advice.setFailureChannel(customErrorChannel());
advice.setReturnFailureExpressionResult(true);
advice.setTrapException(true);
return advice;
}
private QueueChannel customErrorChannel() {
return new DirectChannel();
}
I initially had some issues with wiring up this feature, but in the end, I realized that it is creating yet another channel which will need to be monitored for errors and handled appropriately. For simplicity, I have chosen to not use another channel at this time.

Although potentially not the best solution, I switched to checking for row counts instead of returning actual data. In this situation, the data exception is avoided.
The main code above moved to:
MapSqlParameterSource mapParameters = new MapSqlParameterSource();
mapParameters.addValue("clientname", clientName);
// Step 1 check if client exists at all; if exists, continue
// Step 2 check if client enrollment rules are available
if (this.namedParameterJdbcTemplate.queryForObject(COUNT_BY_NAME, mapParameters, Integer.class) == 1) {
if (this.namedParameterJdbcTemplate.queryForObject(CHECK_RULES_BY_NAME, mapParameters, Integer.class) != 1) return null;
} else return null;
return findClientByName(clientName);
I then check the data upon return to the calling method in Spring Batch:
if (clientID != null) {
logger.info("Found client ID ====> " + clientID);
}
else {
throw new ClientSetupJobExecutionException("Client " +
fileNameParts[1] + " does not exist or is improperly setup in the database.");
}
Although not needed, I created a custom Java Exception which could be useful at a later point in time.

Spring Integration Service Activator can be supplied with the ExpressionEvaluatingRequestHandlerAdvice, which works like a try...catch and let you to perform some logic onFailureExpression: http://docs.spring.io/spring-integration/reference/html/messaging-endpoints-chapter.html#expression-advice
Your problem might be that you catch (EmptyResultDataAccessException e), but it is a cause, not root on the this.namedParameterJdbcTemplate.queryForObject() invocation.

How to improve speed of DynamoDB requests

I've been testing out DynamoDB as a potential option for a scalable and steady throughput database for a site that will be hit pretty frequently and requires a very fast response time (< 50ms). I'm seeing pretty slow responses (both locally and on an EC2 instance) for the following code:
public static void main(String[] args) {
try {
AWSCredentials credentials = new PropertiesCredentials(new File("aws_credentials.properties"));
long start = System.currentTimeMillis();
AmazonDynamoDBClient client = new AmazonDynamoDBClient(credentials);
System.out.println((System.currentTimeMillis() - start) + " (ms) to connect");
DynamoDBMapper mapper = new DynamoDBMapper(client);
start = System.currentTimeMillis();
Model model = mapper.load(Model.class, "hashkey1", "rangekey1");
System.out.println((System.currentTimeMillis() - start) + " (ms) to load Model");
} catch (Exception e) {
e.printStackTrace();
}
}
The connection to the DB alone takes about 800 (ms) on average and the loading using the mapper takes an additional 200 (ms). According to Amazon's page about DynamoDB we should expect "Average service-side latencies...typically single-digit milliseconds." I wouldn't expect the full round-trip HTTP request to add that much overhead. Are these expected numbers even on an EC2 instance?

I think a better test would be to avoid the initial costs/latency incurred in starting up the JVM and loading the classes. Something like:
public class TestDynamoDBMain {
public static void main(String[] args) {
try {
AWSCredentials credentials = new PropertiesCredentials(new File("aws_credentials.properties"));
AmazonDynamoDBClient client = new AmazonDynamoDBClient(credentials);
DynamoDBMapper mapper = new DynamoDBMapper(client);
// Warm up
for (int i=0; i < 10; i++) {
testrun(mapper, false);
}
// Time it
for (int i=0; i < 10; i++) {
testrun(mapper, true);
}
} catch (Exception e) {
e.printStackTrace();
}
}
private static void testrun(DynamoDBMapper mapper, boolean timed) {
long start = System.nanoTime();
Model model = mapper.load(Model.class, "hashkey1", "rangekey1");
if (timed)
System.out.println(
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start)
+ " (ms) to load Model");
}
}
Furthermore, you may consider enabling the default metrics of the AWS SDK for Java to see the fine grain time allocation in Amazon CloudWatch. For more details, see:
http://java.awsblog.com/post/Tx1O0S3I51OTZWT/Taste-of-JMX-Using-the-AWS-SDK-for-Java
Hope this helps.

Dynamo DB is located in a specific region (they dont yet support cross region replication). This is chosen by you when you create a table. Unless you are calling the APIs from the same region, it is bound to be slow.
It looks like you are trying to call Dynamo from your development desktop. You can re-do the same test from an EC2 instance started in the "same region". This will considerably speed up the responses. This is a more realistic test, since any way when you deploy your production system it will be in the same region as Dynamo.
Again, if you really need very quick response, consider using ElastiCache between your code and Dynamo. On every read, store on cache before returning the results. Next read should read from the cache (say for an expiry time of 10 mins). For "read-heavy" apps this is the suggested route. I have seen many fold better response using this approach.

Couchbase: net.spy.memcached.internal.CheckedOperationTimeoutException

I'm loading local Couchbase instance with application specific json objects.
Relevant code is:
CouchbaseClient getCouchbaseClient()
{
List<URI> uris = new LinkedList<URI>();
uris.add(URI.create("http://localhost:8091/pools"));
CouchbaseConnectionFactoryBuilder cfb = new CouchbaseConnectionFactoryBuilder();
cfb.setFailureMode(FailureMode.Retry);
cfb.setMaxReconnectDelay(1500); // to enqueue an operation
cfb.setOpTimeout(10000); // wait up to 10 seconds for an operation to succeed
cfb.setOpQueueMaxBlockTime(5000); // wait up to 5 seconds when trying to
// enqueue an operation
return new CouchbaseClient(cfb.buildCouchbaseConnection(uris, "my-app-bucket", ""));
}
Method to store entry (I'm using suggestions from Bulk Load and Exponential Backoff):
void continuosSet(CouchbaseClient cache, String key, int exp, Object value, int tries)
{
OperationFuture<Boolean> result = null;
OperationStatus status = null;
int backoffexp = 0;
do
{
if (backoffexp > tries)
{
throw new RuntimeException(MessageFormat.format("Could not perform a set after {0} tries.", tries));
}
result = cache.set(key, exp, value);
try
{
if (result.get())
{
break;
}
else
{
status = result.getStatus();
LOG.warn(MessageFormat.format("Set failed with status \"{0}\" ... retrying.", status.getMessage()));
if (backoffexp > 0)
{
double backoffMillis = Math.pow(2, backoffexp);
backoffMillis = Math.min(1000, backoffMillis); // 1 sec max
Thread.sleep((int) backoffMillis);
LOG.warn("Backing off, tries so far: " + tries);
}
backoffexp++;
}
}
catch (ExecutionException e)
{
LOG.error("ExecutionException while doing set: " + e.getMessage());
}
catch (InterruptedException e)
{
LOG.error("InterruptedException while doing set: " + e.getMessage());
}
}
while (status != null && status.getMessage() != null && status.getMessage().indexOf("Temporary failure") > -1);
}
When continuosSet method called for a large amount of objects to store (single thread) e.g.
CouchbaseClient cache = getCouchbaseClient();
do
{
SerializableData data = queue.poll();
if (data != null)
{
final String key = data.getClass().getSimpleName() + data.getId();
continuosSet(cache, key, 0, gson.toJson(data, data.getClass()), 100);
...
it generates CheckedOperationTimeoutException inside of continuosSet method in result.get() operation.
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: 127.0.0.1/127.0.0.1:11210
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:160) ~[spymemcached-2.8.12.jar:2.8.12]
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:133) ~[spymemcached-2.8.12.jar:2.8.12]
Can someone shed light into this how to overcome and recover from this situation? Is there a good technique/workaround on how to bulk load in Java client for Couchbase? I already explored documentation Performing a Bulk Set which is unfortunately for PHP Couchbase client.

My suspicion is that you may be running this in a JVM spawned from the command line that doesn't have that much memory. If that's the case, you could hit longer GC pauses which could cause the timeout you're mentioning.
I think the best thing to do is to try a couple of things. First, raise the -Xmx argument to the JVM to use more memory. See if the timeout happens later or goes away. If so, then my suspicion about memory is correct.
If that doesn't work, raise the setOpTimeout() and see if that reduces the error or makes it go away.
Also, make sure you're using the latest client.
By the way, I don't think this is directly bulk loading related. It may happen owing to a lot of resource consumption during bulk loading, but it looks like the regular backoff must be working or you're not ever hitting it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.