JPA SQL Server Batch Inserts - java

Problem Statement
I am trying to improve the performance of an insert process in JPA. Currently it takes 4 minutes to insert ~350,000 records into my database. To speed up performance I want to use batch inserting. I have made an outline that shows the code I started with. The modifications I made to try and improve performance as well as fix memory issues. The results of those modifications. And some of my other attempts not shown in modifications. Please let me know how I can improve my code to allow for large amounts of inserts using sql server and hibernate. I can provide more information about the overall insert process if needed.
Starting Code
to enable this, in the application.yml I put:
jpa:
properties:
hibernate:
jdbc:
batch_size: 1000
batch_versioned_data: true
order_inserts: true
I started with code that looks like this in my Loader class:
try(Stream<String> lines = Files.lines("/path/to/file")) {
Iterators.partition(lines.iterator(), 1000).forEachRemaining(batchList -> {
List<CustomEntity> mappedEntities = list.stream().map(mapLineToEntity).collect(Collectors.toList());
//Insert batch
repository.saveAll(mappendEntities);
repository.flush();
})
}
Code Modifications
But this has caused memory issues, prompting a custom sql implementation using EntityManager to flush and clear the persisted entities. To do this I created a CustomEntityServiceCustom.java interface and its implementation. Here are my two attempts below with the modified Loader class:
public interface CustomEntityServiceCustom {
void batchInsertProcess(List<CustomEntity> customEntities, int start); //try 1
void batchInsertProcess(List<CustomEntity> customEntities, AtomicInteger start); //try 2
}
public class Custom CustomEntityServiceCustomImpl implements CustomEntityServiceCustom {
#PersistenceContext
private EntityManager em;
//Try 1
#Override
#Transactional
void batchInsertProcess(List<CustomEntity> customEntities, int start) {
for(CustomEntity ent : customEntities) {
em.persist(ent)
}
em.flush();
em.clear();
}
//Try 2
#Override
#Transactional
void batchInsertProcess(List<CustomEntity> customEntities, AtomicInteger start) {
final int numRecsPerInsert = 25;
Iterators.partition(customEntities.iterator(), numRecsPerInsert).forEachRemaining(batchList -> {
/*
code not included but createInsert will create an insert statement like the following:
INSERT INTO table (col1, col2) VALUES (rec1val1, rec1val2), (rec2val, rec2val2)
for 25 records at a time and then update the AtomicInteger
*/
String multiLineInsert = createInsert(batchList, start.get());
start.addAndGet(numRecsPerInsert);
em.createNativeQuery(multiLineInsert).executeUpdate();
})
em.flush();
em.clear();
}
}
//updated loader
try(Stream<String> lines = Files.lines("/path/to/file")) {
AtomicInteger start = new AtomicInteger(1);
Iterators.partition(lines.iterator(), 1000).forEachRemaining(batchList -> {
List<CustomEntity> mappedEntities = list.stream().map(mapLineToEntity).collect(Collectors.toList());
repository.batchInsertProcess(mappedEntities, start);
})
}
Both try 1 and 2 took advantage of not using the autogenerated ids by not using:
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
Current Metrics
With all these changes there was no significant performance improvement from the starting code.
Try 1 was achieving an insert of ~50,000 records in 35 seconds and Try 2 was performing same insert in 1 min. I do not understand this as Try 1 was inserting 1 record at a time like:
INSERT INTO table (COL1, COL2) VALUES (VAL1, VAL2); whereas Try 2 was inserting 25 records in 1 statement. I also tried 4 records per insert as recommended here but this was still 52 seconds which is much larger than the 35 without using batch inserts.
Other Considerations
I attempted to allow hibernate handle the multiple records per statement following This but I do not see an option for SQL Server to rewriteBatchedStatements. I have tried adding useBulkCopyForBatchInsert=true; to the connection string as detailed here, but am unsure if I will have to modify my code to see the benefits of this change?
I also am unsure if EntityManager needs to be flushed after exectuteUpdate() since in logs I got a message that 0 nanoseconds spent executing 0 flushes and 6596474 nanoseconds spent executing 2074 partial flushes. This could be another bottleneck, but Im not sure exactly what happens behind the scenes.

If you run the insert statement with a multi-values clause like this, you will probably do hard-parses all the time. You should be using parameters instead so that the statement is cached on the server side. Also, you should reuse the Query object and just re-bind values. This is what Hibernate does for you behind the scenes when using batch inserting. Also, it will use the JDBC Batch API which can do some tricks on the protocol level as well to improve performance further.
To wrap this up, I don't think that a multi-values clause can be better than batch inserts through the JDBC Batch API. If that really performs better, I'd say it's a bug in the JDBC driver and should be fixed. Even if the fix is to just use the multi-values clause statement behind the scenes.
Anyway, if you want to try it out, you should probably structure this in the following way:
#Override
#Transactional
void batchInsertProcess(List<CustomEntity> customEntities, AtomicInteger start) {
final int numRecsPerInsert = 25;
// creates "insert into ... values (?, ?, ?), (?, ?, ?), ..."
String multiLineInsert = createInsert(numRecsPerInsert);
Query query = em.createNativeQuery(multiLineInsert);
Iterators.partition(customEntities.iterator(), numRecsPerInsert).forEachRemaining(batchList -> {
bindValues(query, batchList, start.get(), numRecsPerInsert);
start.addAndGet(numRecsPerInsert);
query.executeUpdate();
});
}

Related

Spring JPA and JDBC Template - very slow select query execution with IN clause

I am trying to execute the following query from my Java project.
I am using MySQL and data store and have configured Hikari CP as Datasource.
SELECT iv.* FROM identifier_definition id
INNER JOIN identifier_list_values iv on id.definition_id = iv.definition_id
where
id.status IN (:statuses)
AND id.type = :listType
AND iv.identifier_value IN (:valuesToAdd)
MySQL connection String:
jdbc:mysql://hostname:3306/DBNAME?useSSL=true&allowPublicKeyRetrieval=true&useServerPrepStmts=true&generateSimpleParameterMetadata=true
When I execute this same query from MySQL workbench it returns results in 0.5 sec.
However when I do the same from JPA Repository or Spring JDBC Template its taking almost 50 secs to execute.
This query has 2 IN clauses, where statuses collection has 3 only items whereas identifierValues collection has 10000 items.
When I execute raw SQL query without named params using JDBC template it got results in 2 secs. However, this approach is suseptible to SQL injection.
Both JPA and JDBC Templete under the hood makes used of Java PreparedStatement. My hunch is the underlying PreparedStatement while adding large params set is causing performance issue.
How do I improve my query performance?
Following is the JDBC template code that I am using:
#Component
public class ListValuesDAO {
private static final Logger LOGGER = LoggerFactory.getLogger(ListValuesDAO.class);
private final NamedParameterJdbcTemplate jdbcTemplate;
#Autowired
public ListValuesDAO(DataSource dataSource) {
jdbcTemplate = new NamedParameterJdbcTemplate(dataSource);
}
public void validateListOverlap(List<String> valuesToAdd, ListType listType) {
String query = "SELECT iv.* FROM identifier_definition id " +
"INNER JOIN identifier_list_values iv on id.definition_id = iv.definition_id where " +
"id.status IN (:statuses) AND id.type = :listType AND iv.identifier_value IN (:valuesToAdd)";
List<String> statuses = Arrays.stream(ListStatus.values())
.map(ListStatus::getValue)
.collect(Collectors.toList());
MapSqlParameterSource parameters = new MapSqlParameterSource();
parameters.addValue("statuses", statuses);
parameters.addValue("listType", listType.toString());
parameters.addValue("valuesToAdd", valuesToAdd);
List<String> duplicateValues = jdbcTemplate.query(query, parameters, new DuplicateListValueMapper());
if (isNotEmpty(duplicateValues)) {
LOGGER.info("Fetched duplicate list value entities");
} else {
LOGGER.info("Could not find duplicate list value entities");
}
}
EDIT - 1
I came across this post where other's faced similar issue while running select query using PreparedStatement on MS SQL Server. Is there any such property like "sendStringParametersAsUnicode" available in MySQL?
EDIT - 2
Tried enabling few MySQL Performance related properties. Still the same result.
jdbc:mysql://localhost:3306/DBNAME?useSSL=true&allowPublicKeyRetrieval=true&useServerPrepStmts=true&generateSimpleParameterMetadata=true&rewriteBatchedStatements=true&cacheResultSetMetadata=true&cachePrepStmts=true&cacheCallableStmts=true
I think should enable "show_sql" to true in JPA and then try, I think its running multiple queries because of lazy loading because of which it may be taking time.
Composite indexes to add to the tables:
id: INDEX(type, status, definition_id)
id: INDEX(definition_id, type, status)
iv: INDEX(identifier_value, definition_id)
iv: INDEX(definition_id, identifier_value)
For jdbc, the connection parameters should include something like
?useUnicode=yes&characterEncoding=UTF-8
For further discussion, please provide SHOW CREATE TABLE for each table and EXPLAIN SELECT... for any query in question.
Instead passing the list to IN clause, pass the list as comma seperated string and split it in the query using
select value from string_split(:valuesToAdd, ',')
So your query will look like this
SELECT iv.* FROM identifier_definition id
INNER JOIN identifier_list_values iv on id.definition_id = iv.definition_id
where id.status IN (:statuses) AND id.type = :listType AND iv.identifier_value
IN (select value from string_split(:valuesToAdd, ','))
string_split is a function in SQL Server, MySQL might have similar one

Insert row if query condition met JPA and spring

I think this question is similar to Data base pessimistic locks with Spring data JPA (Hibernate under the hood) but thought I would ask separately as not exactly the same.
I have a multi threaded/node springboot application on top of a mariadb database with a table like
CREATE TABLE job (
id INT PRIMARY KEY AUTO_INCREMENT,
owner VARCHAR(50),
status VARCHAR(10) );
Have a Job domain class as you'd expect.
Have a JobRepository interface which extends CrudRepository<Job,Integer> and a service class.
The application rule is we cannot insert a new job if same owner and set of status values. e.g. If this was old school native sql I would just:
START TRANSACTION;
INSERT INTO job (owner, status)
SELECT 'fred', 'init' FROM DUAL
WHERE NOT EXISTS
( SELECT 1 FROM job
WHERE owner = 'fred' AND status IN ('init', 'running')
);
COMMIT;
But how to I do this in JPA/CrudRepository.
I split into DB operations. Defined a repository method:
#Lock(LockModeType.READ)
long countByOwnerAndStatusIn(String owner, List<String> status);
And then had a service method like:
#Transactional
public Job createJob(Job job) {
if (jobRepository.countByOwnerAndStatusIn(job.getOwner(), Job.PROGRESS_STATUS) == 0) {
// Sleeps just to ensure conflicts
Thread.sleep(1000);
Job newJob = jobRepository.save(job);
Thread.sleep(1000);
return newJob
}
else {
return null;
}
}
But with this I do not get the desired effect.
LockModeType of READ and WRITE allow duplicates to be created.
LockModeType of PESSIMISTIC_READ and PESSIMISTIC_WRITE can result in deadlock errors.
So I guess I am after one of two options:
Is there a way to make get the INSERT...SELECT WHERE NOT EXISTS
into a JPA/CrudRepository method?
Is there a way to get the serivce
method to effectively wrap the count check and the insert in the same
lock?
If there is no way to do either I guess I'll try and get access to the underlying jdbc connection and explicity run a LOCK TABLE statement (or the insert...select, but I don't like the idea of that, keeping it JPA like is probably better).
Hope I have explained myself properly. Thanks in advance for your help.

Java MySQL query execution time

I'm having a problem with mysql queries taking a little too long to execute.
private void sqlExecute(String sql, Map<Integer, Object> params) {
try (Connection conn = dataSource.getConnection(); PreparedStatement statement = conn.prepareStatement(sql)) {
if (params.size() > 0) {
for (Integer key : params.keySet()) {
statement.setObject(key, params.get(key));
}
}
statement.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
}
}
I've narrowed the problem down to the executeUpdate() line specifically. Everything else runs smoothly, but this particular line (and when I run executeQuery() as well) takes around 70ms to execute. This may not seem like an unreasonable amount of time, but currently this is a small test db table with under 100 rows. Columns are indexed, so a typical query is only looking at around 15 rows.
Ultimately however, we'll need to scan much larger tables with thousands of rows. Additionally, we're running numerous queries at a time (they can't really be batched because the results of each query are used for future queries), so all of those queries together are taking more like 7s.
Here's an example of a method for running a mysql query:
public void addRating(String db, int user_id, int item_id) {
parameters = new HashMap<>();
parameters.put(1, user_id);
parameters.put(2, item_id);
sql = "INSERT IGNORE INTO " + db + " (user_id, item_id) VALUES (?, ?)";
sqlExecute(sql, parameters);
}
A couple of things to note:
The column indexing is probably not the problem. When running the same mysql statements in our phpMyAdmin console, execution time is more like 0.3ms.
Also of note is that the execution time is consistently 70ms, regardless of the actual mysql statement.
I'm using connection pooling and wonder if this is possibly a source of the problem. In other tests, dataSource.getConnection() also takes about 70ms. I'm running this code locally.
The above example is for an INSERT using executeUpdate(), but the same problem happens for SELECT statements using executeQuery().
I have tried using /dev/urandom per Oracle's suggestion, but this made no difference.

JDBC batch update skipping out on initial set of update statements

We are using JDBC batch update (Statement - void addBatch( String sql ) and int[] executeBatch()) in our Java code. The job is supposed to insert about 27k records in a table and then update about 18k records in a subsequent batch.
When our job runs at 6am, it is missing a few thousand records (we observed this from the database audit logs). We can see from the job logs that the update statements are being generated for all the 18k records. We understand that all the update statements get added in sequence to the batch, However, only records from the beginning of the batch seem to be missing. Also, it is not a fixed number everyday - one day, it skips out on the first 4534 update statements and another day it skips out on the first 8853 records and another day, it skips out on 5648 records.
We initially thought this could be a thread issue but have since moved away from that thought process as the block being skipped out does not always contain the same number of update statements. If we assume that the first few thousand updates are happening even before the insert, then the updates should at least show up in the database audit logs. However, this is not the case.
We are thinking this is due to a memory/heap issue as running the job at any other time picks up all the 18k update statements and they are executed successfully. We reviewed the audit logs from the Oracle database and noticed that the missing update statements are never executed on the table during the 6am run. At any other time, all the update statements are showing up in the database audit logs.
This job was running successfully for almost 3 years now and this behavior started only from a few weeks ago. We tried to look at any changes to the server/environment but nothing jumps out at us.
We are trying to pinpoint why this is happening, specifically, if there are any processes that are using up too much of the JVM heap and as a result, our update statements are getting overwritten/not being executed.
Database: Oracle 11g Enterprise Edition Release 11.2.0.3.0 - 64bit
Java: java version "1.6.0_51"
Java(TM) SE Runtime Environment (build 1.6.0_51-b11)
Java HotSpot(TM) Server VM (build 20.51-b01, mixed mode)
void main()
{
DataBuffer dataBuffer;//assume that all the selected data to be updated is stored in this object
List<String> TransformedList = transform(dataBuffer);
int status = bulkDML(TransformedList);
}
public List<String> transform(DataBuffer i_SourceData)
{
//i_SourceData has all the data selected from
//the source table, that has to be updated
List<Row> AllRows = i_SourceData.getAllRows();
List<String> AllColumns = i_SourceData.getColumnNames();
List<String> transformedList = new ArrayList<String>();
for(Row row: AllRows)
{
int index = AllColumns.indexOf("unq_idntfr_col");
String unq_idntfr_val = (String)row.getFieldValues().get(index);
index = AllColumns.indexOf("col1");
String val1 = (String)row.getFieldValues().get(index);
String query = null;
query = "UPDATE TABLE SET col1 = " + val1 + " where unq_idntfr_col=" + unq_idntfr_val;//this query is not the issue either - it is parameterized in our code
transformedList.add(query);
}
return transformedList;
}
public int bulkDML(List<String> i_QueryList)
{
Connection connection = getConnection();
Statement statement = getStatement(connection);
try
{
connection.setAutoCommit(false);
for (String Query: i_QueryList)
{
statement.addBatch(Query);
}
statement.executeBatch();
connection.commit();
}
//handle various exceptions and all of them return -1
//not pertinent to the issue at hand
catch(Exception e)
{
return -1;
}
CloseResources(connection, statement, null);
return 0;
}
Any suggestions would be greatly appreciated, thank you.
If you want to execute multiple updates on the same table then I suggest modifying your query to use binds and a PreparedStatement because that's really the only way to do real DML batching with the Oracle Database. For example your query would become:
UPDATE TABLE SET col1=? WHERE unq_idntfr_col=?
and then use JDBC batching with the same PreparedStatement. This change would require to you revisit your bulkDML method to make it take bind values as parameter instead of SQL.
The JDBC pseudo code would then look like this:
PreparedStatement pstmt = connection.prepareCall("UPDATE TABLE SET col1=? WHERE unq_idntfr_col=?");
pstmt.setXXX(1, x);
pstmt.setYYY(2, y);
pstmt.addBatch();
pstmt.setXXX(1, x);
pstmt.setYYY(2, y);
pstmt.addBatch();
pstmt.setXXX(1, x);
pstmt.setYYY(2, y);
pstmt.addBatch();
pstmt.executeBatch();

How should I reuse prepared statements to perform multiple inserts in Java?

So, I have a collection of DTOs that I need to save off. They are backed with a temporary table, and they also need to have their data inserted into a "real" table.
I don't have time to do the proper batch process of these records, and the expected number of results, while it can be theoretically very high, is probably around 50 or less anyways. There are several other issues with this application (it's a real cluster**), so I just want to get something up and running for testing purposes.
I was thinking of doing the following psuedocode (in a transaction):
PreparedStatement insert1 = con.prepareStatement(...);
PreparedStatement insert2 = con.prepareStatement(...);
for(DTO dto : dtos) {
prepareFirstInsertWithParameters(insert1, dto);
insert1.executeUpdate();
prepareSecondInsertWithParameters(insert2, dto);
insert2.executeUpdate();
}
FIrst off, will this work as is - can I reuse the prepared statement without executing clearParameters(), or do I have to do a close() on them, or keep getting more prepared statements?
Secondly, aside from batching, is there a more efficient (and cleaner) way of doing this?
This is easy:
conn = dataSource.getConnection();
conn.setAutoCommit( false );
pStatement = conn.prepareStatement( sqlStr );
ListIterator<DTO> dtoIterator = dtoList.listIterator();
while( dtoIterator.hasNext() ) {
DTO myDTO = dtoIterator.next();
pStatement.setInt( 1, myDTO.getFlibble() );
pStatement.setInt( 2, myDTO.getNuts() );
pStatement.addBatch();
}
int[] recordCount = pStatement.executeBatch();
conn.commit();
MetroidFan2002,
I don't know what you mean by 'aside from batching', but I'm assuming you mean executing a single batch SQL statement. You can however, batch the prepared statement calls which will improve performance by submitting multiple calls at a time:
PreparedStatement insert1 = con.prepareStatement(...);
PreparedStatement insert2 = con.prepareStatement(...);
for(DTO dto : dtos) {
prepareFirstInsertWithParameters(insert1, dto);
prepareSecondInsertWithParameters(insert2, dto);
insert1.addBatch();
insert2.addBatch();
}
insert1.executeBatch();
insert2.executeBatch();
// cleanup
Now if your dataset can grow large, like you alluded to, you'll want to put some logic in to flush the batch every N number of rows, where N is a value tuned to the optimal performance for your setup.
JDBC supports Batch Insert/Update. See example here.

Categories

Resources