Insert to database a lot of data with batch insert

Insert to database a lot of data with batch insert - java

I create a program that inserts to MySql database millions of values.
I read about batch insert that will optimize my program and make it faster but when I tried to do it, it worked in the same way.
Instead of inserting each value to the database I kept in a list each time 500 values and then insert them in one big loop like this:
for(int i=0;i<500;i++)
{
insertData(list.get(i));
}
Then i remove all the values in the list and start collecting 500 values again.
Shouldn't it work better?
My Insert code is:
public void insertToNameTable(String id,String name) throws SQLException
{
PreparedStatement ps=null;
ps= conn.prepareStatement("INSERT INTO NameTable values(?,?,?)",user.getId(),user.getName());
ps.setString(1,id);
ps.setString(2,name);
ps.setBoolean(3,false);
ps.executeUpdate();
}
I have some questions:
1.why it isn't work faster when i do batch insert?
2.how many values I should enter each time in order to make it faster?(500,1000,10000) the more values enter together is better?
3. is the way I insert the values to my database is the best way?

This is the efficient way for batch insert.
Connection connection = new getConnection();
Statement statement = connection.createStatement();
for (String query : queries) {
statement.addBatch(query);
}
statement.executeBatch();
statement.close();
connection.close();

Questions 1&2:
User Neil Coffey some time ago said:
The notion that prepared statements are primarily about performance is something of a misconception, although it's quite a common one.
Another poster mentioned that he noted a speed improvement of about 20% in Oracle and SQL Server. I've noted a similar figure with MySQL. It turns out that parsing the query just isn't such a significant part of the work involved. On a very busy database system, it's also not clear that query parsing will affect overall throughput: overall, it'll probably just be using up CPU time that would otherwise be idle while data was coming back from the disk.
So as a reason for using prepared statements, the protection against SQL injection attacks far outweighs the performance improvement. And if you're not worried about SQL injection attacks, you probably should be...
Here is the original post:
PreparedStatements and performanceand in my opinion all answers are worth reading. I think you expect PreparedStatement to be some kind of magician that would increase your inserting speed significantly and that is why you are disappointed with the improvement you get.
Question 3:
The proper way of using PreparedStatement is preparing a statement and then setting values and updating database in a loop. Here is a good example: Reusing a PreparedStatement multiple times

Related

How prepared statement improves performance [duplicate]

This question already has answers here:
Difference between Statement and PreparedStatement
(15 answers)
Closed 7 years ago.
I came across below statement that tells about the performance improvement that we get with JDBC PreparedStatement class.
If you submit a new, full SQL statement for every query or update to
the database, the database has to parse the SQL and for queries create
a query plan. By reusing an existing PreparedStatement you can reuse
both the SQL parsing and query plan for subsequent queries. This
speeds up query execution, by decreasing the parsing and query
planning overhead of each execution.
Let's say I am creating the statement and providing different values while running the queries like this :
String sql = "update people set firstname=? , lastname=? where id=?";
PreparedStatement preparedStatement =
connection.prepareStatement(sql);
preparedStatement.setString(1, "Gary");
preparedStatement.setString(2, "Larson");
preparedStatement.setLong (3, 123);
int rowsAffected = preparedStatement.executeUpdate();
preparedStatement.setString(1, "Stan");
preparedStatement.setString(2, "Lee");
preparedStatement.setLong (3, 456);
int rowsAffected = preparedStatement.executeUpdate();
Then will I still get performance benefit, because I am trying to set different values so I can the final query generated is changing based on values.
Can you please explain exactly when we get the performance benefit? Should the values also be same?

When you use prepared statement(i.e pre-compiled statement), As soon as DB gets this statement, it compiles it and caches it so that it can use the last compiled statement for successive call of same statement. So it becomes pre-compiled for successive calls.
You generally use prepared statement with bind variables where you provide the variables at run time. Now what happens for successive execution of prepared statements, you can provide the variables which are different from previous calls. From DB point of view, it does not have to compile the statement every time, will just insert the bind variables at rum time. So becomes faster.
Other advantages of prepared statements is its protection against SQL-injection attack
So the values does not have to be same

Although it is not obvious SQL is not scripting but a "compiled" language. And this compilation aka. optimization aka hard-parse is very exhaustive task. Oracle has a lot of work to do, it must parse the query, resolve table names, validate access privileges, perform some algebraic transformations and then it has to find effective execution plan. Oracle (and other databases too) can join only TWO tables - not more. It means then when you join several tables in SQL, Oracle has to join them one-by-one. i.e. if you join n tables in a query there can be at least up to n! possible execution plans. By default Oracle is limited up to 8000 permutations when search for "optimal" (not the best one) execution plan.
So the compilation(hard-parse) might be more exhaustive then query execution itself. In order to spare resources, Oracle shares execution plans between sessions in a memory structure called library cache. And here another problem might occur, too many parsing require exclusive access to a shared resource.
So if you do too many (hard) parsing your application can not scale - sessions are blocking each other.
On the other hand, there are situations where bind variables are NOT helpful.
Imagine such a query:
update people set firstname=? , lastname=? where group=? and deleted='N'
Since the column deleted is indexed and Oracle knows that there are 98% of values ='Y' and only 2% of values = 'N' it will deduce to use and index in the column deleted. If you used bind variable for condition on deleted column Oracle could not find effective execution plan, because it also depends on input which is unknown in the time of the compilation.
(PS: since 11g it is more complicated with bind variable peeking)

JAVA Getting data from Database Without a loop

I am Programming a software with JAVA and using the Oracle DB.
Normally we obtain the values from the Database using a Loop like
Resultset rt = (Resultset) cs.getObject(1);
while(rt.next){
....
}
But it sound is more slowly when fetch thousand of data from the database.
My question is:
In Oracle DB: I created a Procedure like this and it is the Iterating data and assign to the cursor.
Ex.procedure test_pro(sysref_cursor out info) as
open info select * from user_tbl ......
end test_pro;
In JAVA Code: As I mentioned before I Iterate a the resultset for obtain values, but the side of database, even I select the values, why should I use a loop for getting that values?
(another fact in the .net frameworks, there are using the database binding concept. So is any way in the java, binding the database procedures like .net 's, without the iterating.
)

Depending on what you are going to do with that data and at which frequence, the choice for a ref_cursor might be a good or a bad one. Ref_cursors are intended to give non Oracle aware programs a way to pass it data, for reporting purposes.
In you case, stick to the looping but don't forget to implement array fetching because this has a tremendous effect on the performance. The database passes blocks of rows to your jdbc buffer at the client and your code fetches rows from that buffer. By the time you hit the end of the buffer, the Jdbc layer requests the next chunk of rows from the database, eliminating lot's of network round trips. The default already fetches 10 rows at a time. For larger sets, use bigger numbers, if memory can provide the room.
See Oracle® Database JDBC Developer's Guide and Reference

If you know for sure there will always be exactly one result, like in this case, you can even skip the if and just call rs.next() once:
For example :
ResultSet resultset = statement.executeQuery("SELECT MAX (custID) FROM customer");
resultset.next(); // exactly one result so allowed
int max = resultset.getInt(1); // use indexed retrieval since the column has no name

Yes,you can call procedure in java.
http://www.mkyong.com/jdbc/jdbc-callablestatement-stored-procedure-out-parameter-example/

You can't avoid looping. For performance reasons you need to adjust your prefetch on Statement or Resultset object (100 is a solid starting point).
Why is done this way? It's similar to reading streams - you never know how big it can be - so you read by chunk/buffer, one after another...

CachedRowSet slower than ResultSet?

In my java code, I access an oracle database table with an select statement.
I receive a lot of rows (about 50.000 rows), so the rs.next() needs some time to process all of the rows.
using ResultSet, the processing of all rows (rs.next) takes about 30 secs
My goal is to speed up this process, so I changed the code and now using a CachedRowSet:
using CachedRowSet, the processing of all rows takes about 35 secs
I don't understand why the CachedRowSet is slower than the normal ResultSet, because the CachedRowSet retrieves all data at once, while the ResultSet retrieves the data every time the rs.next is called.
Here is a part of the code:
try {
stmt = masterCon.prepareStatement(sql);
rs = stmt.executeQuery();
CachedRowSet crset = new CachedRowSetImpl();
crset.populate(rs);
while (rs.next()) {
int countStar = iterRs.getInt("COUNT");
...
}
} finally {
//cleanup
}

CachedRowSet caches the results in memory i.e. that you don't need the connection anymore. Therefore it it "slower" in the first place.
A CachedRowSet object is a container for rows of data that caches its
rows in memory, which makes it possible to operate without always
being connected to its data source.
-> http://download.oracle.com/javase/1,5.0/docs/api/javax/sql/rowset/CachedRowSet.html

There is an issue with CachedRowSet coupled together with a postgres jdbc driver.
CachedRowSet needs to know the types of the columns so it knows which java objects to create
(god knows what else it fetches from DB behind the covers!).
It therefor makes more roundtrips to the DB to fetch column metadata.
In very high volumes this becomes a real problem.
If the DB is on a remote server, this is a real problem as well because of network latency.
We've been using CachedRowSet for years and just discovered this. We now implement our own CachedRowSet, as we never used any of it's fancy stuff anyway.
We do getString for all types and convert ourselves as this seems the quickest way.
This clearly wasn't an issue with fetch size as postgres driver fetches everything by default.

What makes you think that ResultSet will retrieve the data each time rs.next() is called? It's up to the implementation exactly how it works - and I wouldn't be surprised if it fetches a chunk at a time; quite possibly a fairly large chunk.
I suspect you're basically seeing the time it takes to copy all the data into the CachedRowSet and then access it all - basically you've got an extra copying operation for no purpose.

Using normal ResultSet you can get more optimization options with RowPrefetch and FetchSize.
Those optimizes the network transport chunks and processing in the while loop, so the rs.next() has always a data to work with.
FetchSize has a default set to 10(Oracle latest versions), but as I know RowPrefetch is not set. Thus means network transport is not optimized at all.

What is the fastest way to retrieve sequential data from database?

I have a lot of rows in a database and it must be processed, but I can't retrieve all the data to the memory due to memory limitations.
At the moment, I using LIMIT and OFFSET to retrieve the data to get the data in some especified interval.
I want to know if the is the faster way or have another method to getting all the data from a table in database. None filter will be aplied, all the rows will be processed.

SELECT * FROM table ORDER BY column
There's no reason to be sucking the entire table in to RAM. Simply open a cursor and start reading. You can play games with fetch sizes and what not, but the DB will happily keep its place while you process your rows.
Addenda:
Ok, if you're using Java then I have a good idea what your problem is.
First, just by using Java, you're using a cursor. That's basically what a ResultSet is in Java. Some ResultSets are more flexible than others, but 99% of them are simple, forward only ResultSets that you call 'next' upon to get each row.
Now as to your problem.
The problem is specifically with the Postgres JDBC driver. I don't know why they do this, perhaps it's spec, perhaps it's something else, but regardless, Postgres has the curious characteristic that if your Connection has autoCommit set to true, then Postgres decides to suck in the entire result set on either the execute method or the first next method. Not really important as to where, only that if you have a gazillion rows, you get a nice OOM exception. Not helpful.
This can easily be exactly what you're seeing, and I appreciate how it can be quite frustrating and confusing.
Most Connection default to autoCommit = true. Instead, simply set autoCommit to false.
Connection con = ...get Connection...
con.setAutoCommit(false);
PreparedStatement ps = con.prepareStatement("SELECT * FROM table ORDER BY columm");
ResultSet rs = ps.executeQuery();
while(rs.next()) {
String col1 = rs.getString(1);
...and away you go here...
}
rs.close();
ps.close();
con.close();
Note the distinct lack of exception handling, left as an exercise for the reader.
If you want more control over how many rows are fetched at a time into memory, you can use:
ps.setFetchSize(numberOfRowsToFetch);
Playing around with that might improve your performance.
Make sure you have an appropriate index on the column you use in the ORDER BY if you care about sequencing at all.

Since its clear your using Java based on your comments:
If you are using JDBC you will want to use:
http://download.oracle.com/javase/1.5.0/docs/api/java/sql/ResultSet.html
If you are using Hibernate it gets trickier:
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/batch.html

PreparedStatements and performance

So I keep hearing that PreparedStatements are good for performance.
We have a Java application in which we use the regular 'Statement' more than we use the 'PreparedStatement'. While trying to move towards using more PreparedStatements, I am trying to get a more thorough understanding of how PreparedStatements work - on the client side and the server side.
So if we have some typical CRUD operations and update an object repeatedly in the application, does it help to use a PS? I understand that we will have to close the PS every time otherwise it will result in a cursor leak.
So how does it help with performance? Does the driver cache the precompiled statement and give me a copy the next time I do connection.prepareStatement? Or does the DB server help?
I understand the argument about the security benefits of PreparedStatements and I appreciate the answers below which emphasize it. However I really want to keep this discussion focused on the performance benefits of PreparedStatements.
Update: When I say update data, I really mean more in terms of that method randomly being called several times. I understand the advantage in the answer offered below which asks to re-use the statement inside a loop.
// some code blah blah
update();
// some more code blah blah
update();
....
public void update () throws SQLException{
try{
PreparedStatement ps = connection.prepareStatement("some sql");
ps.setString(1, "foobar1");
ps.setString(2, "foobar2");
ps.execute();
}finally {
ps.close();
}
}
There is no way to actually reuse the 'ps' java object and I understand that the actual connection.prepareStatement call is quite expensive.
Which is what brings me back to the original question. Is this "some sql" PreparedStatement still being cached and reused under the covers that I dont know about?
I should also mention that we support several databases.
Thanks in advance.

The notion that prepared statements are primarily about performance is something of a misconception, although it's quite a common one.
Another poster mentioned that he noted a speed improvement of about 20% in Oracle and SQL Server. I've noted a similar figure with MySQL. It turns out that parsing the query just isn't such a significant part of the work involved. On a very busy database system, it's also not clear that query parsing will affect overall throughput: overall, it'll probably just be using up CPU time that would otherwise be idle while data was coming back from the disk.
So as a reason for using prepared statements, the protection against SQL injection attacks far outweighs the performance improvement. And if you're not worried about SQL injection attacks, you probably should be...

Prepared statements can improve performance when re-using the same statement that you prepared:
PreparedStatement ps = connection.prepare("SOME SQL");
for (Data data : dataList) {
ps.setInt(1, data.getId());
ps.setString(2, data.getValue();
ps.executeUpdate();
}
ps.close();
This is much faster than creating the statement in the loop.
Some platforms also cache prepared statements so that even if you close them they can be reconstructed more quickly.
However even if the performance were identical you should still use prepared statements to prevent SQL Injection. At my company this is an interview question; get it wrong and we might not hire you.

Prepared statements are indeed cached after their first use, which is what they provide in performance over standard statements. If your statement doesn't change then it's advised to use this method. They are generally stored within a statement cache for alter use.
More info can be found here:
http://www.theserverside.com/tt/articles/article.tss?l=Prepared-Statments
and you might want to look at Spring JDBCTemplate as an alternative to using JDBC directly.
http://static.springframework.org/spring/docs/2.0.x/reference/jdbc.html

Parsing the SQL isn't the only thing that's going on. There's validating that the tables and columns do indeed exist, creating a query plan, etc. You pay that once with a PreparedStatement.
Binding to guard against SQL injection is a very good thing, indeed. Not sufficient, IMO. You still should validate input prior to getting to the persistence layer.

So how does it help with performance? Does the driver cache the
precompiled statement and give me a copy the next time I do
connection.prepareStatement? Or does the DB server help?
I will answer in terms of performance. Others here have already stipulated that PreparedStatements are resilient to SQL injection (blessed advantage).
The application (JDBC Driver) creates the PreparedStatement and passes it to the RDBMS with placeholders (the ?). The RDBMS precompiles, applying query optimization (if needed) of the received PreparedStatement and (in some) generally caches them. During execution of the PreparedStatement, the precompiled PreparedStatement is used, replacing each placeholders with their relevant values and calculated. This is in contrast to Statement which compiles it and executes it directly, the PreparedStatement compiles and optimizes the query only once. Now, this scenario explained above is not an absolute case by ALL JDBC vendors but in essence that's how PreparedStatement are used and operated on.

Anecdotally: I did some experiments with prepared vs. dynamic statements using ODBC in Java 1.4 some years ago, with both Oracle and SQL Server back-ends. I found that prepared statements could be as much as 20% faster for certain queries, but there were vendor-specific differences regarding which queries were improved to what extent. (This should not be surprising, really.)
The bottom line is that if you will be re-using the same query repeatedly, prepared statements may help improve performance; but if your performance is bad enough that you need to do something about it immediately, don't count on the use of prepared statements to give you a radical boost. (20% is usually nothing to write home about.)
Your mileage may vary, of course.

Which is what brings me back to the original question. Is this "some sql" PreparedStatement still being cached and reused under the covers that I dont know about?
Yes at least with Oracle. Per Oracle® Database JDBC Developer's Guide Implicit Statement Caching (emphasis added),
When you enable implicit Statement caching, JDBC automatically caches the prepared or callable statement when you call the close method of this statement object. The prepared and callable statements are cached and retrieved using standard connection object and statement object methods.
Plain statements are not implicitly cached, because implicit Statement caching uses a SQL string as a key and plain statements are created without a SQL string. Therefore, implicit Statement caching applies only to the OraclePreparedStatement and OracleCallableStatement objects, which are created with a SQL string. You cannot use implicit Statement caching with OracleStatement. When you create an OraclePreparedStatement or OracleCallableStatement, the JDBC driver automatically searches the cache for a matching statement.

1. PreparedStatement allows you to write dynamic and parametric query
By using PreparedStatement in Java you can write parametrized sql queries and send different parameters by using same sql queries which is lot better than creating different queries.
2. PreparedStatement is faster than Statement in Java
One of the major benefits of using PreparedStatement is better performance. PreparedStatement gets pre compiled
In database and there access plan is also cached in database, which allows database to execute parametric query written using prepared statement much faster than normal query because it has less work to do. You should always try to use PreparedStatement in production JDBC code to reduce load on database. In order to get performance benefit its worth noting to use only parametrized version of sql query and not with string concatenation
3. PreparedStatement prevents SQL Injection attacks in Java
Read more: http://javarevisited.blogspot.com/2012/03/why-use-preparedstatement-in-java-jdbc.html#ixzz3LejuMnVL

Short answer:
PreparedStatement helps performance because typically DB clients perform the same query repetitively, and this makes it possible to do some pre-processing for the initial query to speed up the following repetitive queries.
Long answer:
According to Wikipedia, the typical workflow of using a prepared statement is as follows:
Prepare: The statement template is created by the application and sent
to the database management system (DBMS). Certain values are left
unspecified, called parameters, placeholders or bind variables
(labelled "?" below): INSERT INTO PRODUCT (name, price) VALUES (?, ?)
(Pre-compilation): The DBMS parses, compiles, and performs query optimization on the
statement template, and stores the result without executing it.
Execute: At a later time, the application supplies (or binds) values
for the parameters, and the DBMS executes the statement (possibly
returning a result). The application may execute the statement as many
times as it wants with different values. In this example, it might
supply 'Bread' for the first parameter and '1.00' for the second
parameter.
Prepare:
In JDBC, the "Prepare" step is done by calling java.sql.Connection.prepareStatement(String sql) API. According to its Javadoc:
This method is optimized for handling parametric SQL statements that benefit from precompilation. If the driver supports precompilation, the method prepareStatement will send the statement to the database for precompilation. Some drivers may not support precompilation. In this case, the statement may not be sent to the database until the PreparedStatement object is executed. This has no direct effect on users; however, it does affect which methods throw certain SQLException objects.
Since calling this API may send the SQL statement to database, it is an expensive call typically. Depending on JDBC driver's implementation, if you have the same sql statement template, for better performance, you may have to avoiding calling this API multiple times in client side for the same sql statement template.
Precompilation:
The sent statement template will be pre-compiled on database and cached in db server. The database will probably use the connection and sql statement template as the key, and the pre-compiled query and the computed query plan as value in the cache. Parsing query may need to validate table, columns to be queried, so it could be an expensive operation, and computation of query plan is an expensive operation too.
Execute:
For following queries from the same connection and sql statement template, the pre-compiled query and query plan will be looked up directly from cache by database server without re-computation again.
Conclusion:
From performance perspective, using prepare statement is a two-phase process:
Phase 1, prepare-and-precompilation, this phase is expected to be
done once and add some overhead for the performance.
Phase 2,
repeated executions of the same query, since phase 1 has some pre
processing for the query, if the number of repeating query is large
enough, this can save lots of pre-processing effort for the same
query.
And if you want to know more details, there are some articles explaining the benefits of PrepareStatement:
http://javarevisited.blogspot.com/2012/03/why-use-preparedstatement-in-java-jdbc.html
http://docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html

Prepared statements have some advantages in terms of performance with respect to normal statements, depending on how you use them. As someone stated before, if you need to execute the same query multiple times with different parameters, you can reuse the prepared statement and pass only the new parameter set. The performance improvement depends on the specific driver and database you are using.
As instance, in terms of database performance, Oracle database caches the execution plan of some queries after each computation (this is not true for all versions and all configuration of Oracle). You can find improvements even if you close a statement and open a new one, because this is done at RDBMS level. This kind of caching is activated only if the two subsequent queries are (char-by-char) the same. This does not holds for normal statements because the parameters are part of the query and produce different SQL strings.
Some other RDBMS can be more "intelligent", but I don't expect they will use complex pattern matching algorithms for caching the execution plans because it would lower performance. You may argue that the computation of the execution plan is only a small part of the query execution. For the general case, I agree, but.. it depends. Keep in mind that, usually, computing an execution plan can be an expensive task, because the rdbms needs to consult off-memory data like statistics (not only Oracle).
However, the argument about caching range from execution-plans to other parts of the extraction process. Giving to the RDBMS multiple times the same query (without going in depth for a particular implementation) helps identifying already computed structures at JDBC (driver) or RDBMS level. If you don't find any particular advantage in performance now, you can't exclude that performance improvement will be implemented in future/alternative versions of the driver/rdbms.
Performance improvements for updates can be obtained by using prepared statements in batch-mode but this is another story.

Ok finally there is a paper that tests this, and the conclusion is that it doesn't improve performance, and in some cases its slower:
https://ieeexplore.ieee.org/document/9854303
PDF: https://www.bib.irb.hr/1205158/download/1205158.Performance_analysis_of_SQL_Prepared_Statements_in_CRUD_operations_final.pdf

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.