I've got an application that parses log files and inserts a huge amount of data into database. It's written in Java and talks to a MySQL database over JDBC. I've experimented with different ways to insert the data to find the fastest for my particular use case. The one that currently seems to be the best performer is to issue an extended insert (e.g. a single insert with multiple rows), like this:
INSERT INTO the_table (col1, col2, ..., colN) VALUES
(v1, v2, v3, ..., vN),
(v1, v2, v3, ..., vN),
...,
(v1, v2, v3, ..., vN);
The number of rows can be tens of thousands.
I've tried using prepared statements, but it's nowhere near as fast, probably because each insert is still sent to the DB separately and the tables needs to be locked and whatnot. My colleague who worked on the code before me tried using batching, but that didn't perform well enough either.
The problem is that using extended inserts means that as far as I can tell I need to build the SQL string myself (since the number of rows is variable) and that means that I open up all sorts of SQL injection vectors that I'm no where intelligent enough to find myself. There's got to be a better way to do this.
Obviously I escape the strings I insert, but only with something like str.replace("\"", "\\\""); (repeated for ', ? and \), but I'm sure that isn't enough.
prepared statements + batch insert:
PreparedStatement stmt = con.prepareStatement(
"INSERT INTO employees VALUES (?, ?)");
stmt.setInt(1, 101);
stmt.setString(2, "Paolo Rossi");
stmt.addBatch();
stmt.setInt(1, 102);
stmt.setString(2, "Franco Bianchi");
stmt.addBatch();
// as many as you want
stmt.executeBatch();
I would try batching your inserts and see how that performs.
Have a read of this (http://www.onjava.com/pub/a/onjava/excerpt/javaentnut_2/index3.html?page=2) for more information on batching.
If you are loading tens of thousands of records then you're probably better off using a bulk loader.
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
Regarding the difference between extended inserts and batching single inserts, the reason I decided to use extended inserts is because I noticed that it took my code alot longer time to insert alot of rows than mysql does from the terminal. This was even though I was batching inserts in batches of 5000. The solution in the end was to use extended inserts.
I quickly retested this theory.
I took two dumps of a table with 1.2 million rows. One using the default extended insert statements you get with mysqldump and the other using:
mysqldump --skip-extended-insert
Then I simply imported the files again into new tables and timed it.
The extended insert test finished in 1m35s and the other in 3m49s.
The full answer is to use the rewriteBatchedStatements=true configuration option along with dfa's answer of using a batched statement.
The relevant mysql documentation
A worked MySQL example
Related
Requirement:
Have to insert a row to database table multiple times (may be like 50,000 times in a batch job)
Database used: MS SQL Server
Approach taken:
Used NamedParameterJdbcTemplate and PreparedStatement to implement the above.
For example,
String query = "insert into table_Name (Column1, Column2, column3) values (?,?,?)";
Then by the help of PreparedStatement, I have assigned dynamic values to the ? fields in the insert query and then triggered executeUpdate function of NamedParameterJdbcTemplate to execute the insert query.
This above process is repeated multiple times; every time creating new insert queries with different values in the fields ? by the help of PreparedStatement.
Issue:
Performance of the executeUpdate function of NamedParameterJdbcTemplate is very slow.
After few research, I got the below explanation:
"The cost based optimizer (that 's what we 're talking about, not) makes its
choices based on the availability of indexes (among other objects), and the
distribution of values in the indexes (how selective the index will be for a
given value). Obviuosly, when working with bind variables, the suitability
of the index from a distribution point of view is harder to determine. The
optimizer has no way to determine beforehand to what value matches will be
sought. This might (should) lead to another execution plan. No surprise
here, as far as I am concerned."
https://bytes.com/topic/oracle/answers/65559-jdbc-oracle-beware-bind-variables
A PreparedStatement has two advantages over a regular Statement:
We add parameters to the SQL using methods instead of doing it inside the SQL query itself. With this we avoid SQL injection attacks and also let the driver to do type conversions for us.
The same PreparedStatement can be called with different parameters, and the database engine can reuse the query execution plan.
It seems that NamedParameterJdbcTemplate helps us with the first advantage, but does nothing for the latter.
Query:
If NamedParameterJdbcTemplate does not helps us with the second advantage, then what is the alternative solution instead of NamedParameterJdbcTemplate
This question already has answers here:
Difference between Statement and PreparedStatement
(15 answers)
Closed 7 years ago.
I came across below statement that tells about the performance improvement that we get with JDBC PreparedStatement class.
If you submit a new, full SQL statement for every query or update to
the database, the database has to parse the SQL and for queries create
a query plan. By reusing an existing PreparedStatement you can reuse
both the SQL parsing and query plan for subsequent queries. This
speeds up query execution, by decreasing the parsing and query
planning overhead of each execution.
Let's say I am creating the statement and providing different values while running the queries like this :
String sql = "update people set firstname=? , lastname=? where id=?";
PreparedStatement preparedStatement =
connection.prepareStatement(sql);
preparedStatement.setString(1, "Gary");
preparedStatement.setString(2, "Larson");
preparedStatement.setLong (3, 123);
int rowsAffected = preparedStatement.executeUpdate();
preparedStatement.setString(1, "Stan");
preparedStatement.setString(2, "Lee");
preparedStatement.setLong (3, 456);
int rowsAffected = preparedStatement.executeUpdate();
Then will I still get performance benefit, because I am trying to set different values so I can the final query generated is changing based on values.
Can you please explain exactly when we get the performance benefit? Should the values also be same?
When you use prepared statement(i.e pre-compiled statement), As soon as DB gets this statement, it compiles it and caches it so that it can use the last compiled statement for successive call of same statement. So it becomes pre-compiled for successive calls.
You generally use prepared statement with bind variables where you provide the variables at run time. Now what happens for successive execution of prepared statements, you can provide the variables which are different from previous calls. From DB point of view, it does not have to compile the statement every time, will just insert the bind variables at rum time. So becomes faster.
Other advantages of prepared statements is its protection against SQL-injection attack
So the values does not have to be same
Although it is not obvious SQL is not scripting but a "compiled" language. And this compilation aka. optimization aka hard-parse is very exhaustive task. Oracle has a lot of work to do, it must parse the query, resolve table names, validate access privileges, perform some algebraic transformations and then it has to find effective execution plan. Oracle (and other databases too) can join only TWO tables - not more. It means then when you join several tables in SQL, Oracle has to join them one-by-one. i.e. if you join n tables in a query there can be at least up to n! possible execution plans. By default Oracle is limited up to 8000 permutations when search for "optimal" (not the best one) execution plan.
So the compilation(hard-parse) might be more exhaustive then query execution itself. In order to spare resources, Oracle shares execution plans between sessions in a memory structure called library cache. And here another problem might occur, too many parsing require exclusive access to a shared resource.
So if you do too many (hard) parsing your application can not scale - sessions are blocking each other.
On the other hand, there are situations where bind variables are NOT helpful.
Imagine such a query:
update people set firstname=? , lastname=? where group=? and deleted='N'
Since the column deleted is indexed and Oracle knows that there are 98% of values ='Y' and only 2% of values = 'N' it will deduce to use and index in the column deleted. If you used bind variable for condition on deleted column Oracle could not find effective execution plan, because it also depends on input which is unknown in the time of the compilation.
(PS: since 11g it is more complicated with bind variable peeking)
I was writing test cases for query that uses connect by hierarchical clause.
It seems that there is no support for this clause in HSQL Db.
Are there any alternatives for testing the query or writing a different query that does the same thing.
The query is simple
SELECT seq.nextval
FROM DUAL
CONNECT BY level <= ?
Thanks.
You don't need a recursive query for that.
To generate a sequence of numbers you can use sequence_array
select *
from unnest(sequence_array(1, ?, 1))
More details are in the manual:
http://hsqldb.org/doc/2.0/guide/builtinfunctions-chapt.html#N14088
If you need that to advance a sequence a specific number of entries, you can use something like this:
select NEXT VALUE FOR seq
from unnest(sequence_array(1, 20, 1));
If you need that to set the sequence to a new value, this is much easier in HSQLDB:
ALTER SEQUENCE seq restart with 42;
If you are looking for a recursive query, then HSQLDB supports the ANSI SQL standard for that: recursive common table expressions, which are documented in the manual:
http://hsqldb.org/doc/2.0/guide/dataaccess-chapt.html#dac_with_clause
According to this 2-year-old ticket, only Oracle and a database called CUBRID have CONNECT BY capability. If you really want it, maybe you could vote on the ticket. However, as far as I have been able to tell, there are only two people working on the project, so don't hold your breath.
I'm trying to set up a while loop that inserts multiple rows into a MySQL table using the jdbc drivers in Java. The idea is that I end up with a statement along the lines of:
INSERT INTO table (column1, column2) VALUES (column1, column2), (column1, column2);
I want to set up this statement using a java.sql.PreparedStatement, but I'd like to prepare small bits of the statement, one row at a time - mainly because the number of entries will be dynamic, and this seems like the best way to create one big query.
This requires the small parts to be 'merged' together every time another chunk is generated. How do I merge these together? Or would you suggest to forget about this idea, and simply execute thousands of INSERT statements at once?
Thank you,
Patrick
It sort of depends on how often you plan to run this loop to execute thousands of statements, but that is one of the exact purposes of prepared statements and stored procedures - since the query does not have to be recompiled on each execution, you get potentially massive performance gains when querying in a loop over a simple SQL statement execution, which must be compiled and executed on every loop iteration.
Those gains may still not match the performance of a prepared statement built up into a long multi-insert in a loop as you're asking, but will be simpler to code. I would recommend staying with an execution loop unless the performance becomes problematic.
Better to prepare one PreparedStatement and reuse it as much as you want:
INSERT INTO (table) (column1, column2) VALUES (column1, column2)
All,
I have to redesign an existing logging system being used in web application. The existing system reads an Excel sheet for records, processes(data validation) it, records the error messages for each entry in the Excel sheet into the database as soon as an error is found and displays the result in the end for all the records. So,
If I have 2 records in the excelsheet, R1 and R2, both fail with 3 validation error each, an insert query is fired 6 times for each validation message and the user sees all the 6 messages in the end of the validation process.
This method worked for smaller set of entries. But for 20,000 records, this obviously has become a bottleneck.
As per my initial redesign approach, following are the options I need suggestion on from everyone at SO:
1> Create a custom logger class with all the required information for logging and for each record in error, store the record ID as key and the Logger class object as value in a HashMap. When all the records are processed completely, perform database inserts for all the records in the HashMap in one shot.
2> Fire SQL inserts periodically i.e. for X records in total, process Y <= X records each time, perform insert operation once. and processing remaining records again.
We really do not have a set criteria at this point except for definitely improving the performance.
Can everyone please provide your feedback as to what would be an efficient logging system design and if there are better approaches than what I mentioned above ?
I would guess your problems are due to the fact you are doing row based operations, rather than set based ?
A set based operation would be the quickest way to load the data. If that is not possible I would go with the insert x records at a time as it is more scalable , inserting them all at once would require ever increasing amounts of memory (but would probably be quicker).
good discussion here on ask tom: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1583402705463
Instead of memorizing every error in a HashMap, you could try (provided the DBMS supports it) to batch all those insert statements together and fire it at the end. Somewhat like this:
PreparedStatement ps = connection.prepareStatement("INSERT INTO table (...) values (?, ?, ...)");
for(...) {
ps.setString(1, ...);
...
ps.addBatch();
}
int[] results = ps.executeBatch();