I was curious about how fast we can execute MySQL query through a loop in Java and well it is taking extremely long time with the code I wrote. I am sure there is a faster way to do it. Here is the piece of code that is executing the query.
PreparedStatement ps=connection.prepareStatement("INSERT INTO books.author VALUES (?,?,?)");
for (int i=0;i<100000;i++)
{
ps.setString(1,test[i][0]);
ps.setString(2,test[i][1]);
ps.setString(3,test[i][2]);
ps.addBatch();
}
int []p=ps.executeBatch();
Any help would be much appreciated. Thank you
Your basic approach is correct. Since you are using MySQL you may want to add rewriteBatchedStatements=true to your connection URL as discussed in this answer. It has been known to significantly speed up batch INSERT operations to MySQL databases.
There are a number of other things that kick in when you have a huge batch. So going "too" big will slow down the rows/second. This depends on a few settings, the specific table schema, and other things. I have see a too-big batch run twice as slow as a more civilized batch.
Think of the overhead of parsing the query as being equivalent to inserting an extra 10 rows. Based on that Rule of Thumb, a batch insert of 1000 rows is within 1% of the theoretical maximum speed.
It may be faster to create a file with all the rows and do LOAD DATA. But, again, this depends on various settings, etc.
Related
I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.
I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !
In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.
I need to update / insert a large number of entries very fast. I see 2 options
creating many queries and send them via executeBatch
create one big query (contains all updates/inserts in db-specific syntax) and just execute it. Since the number of updates is fix ("batch size") i can prepare this statement too
The target db is oracle. The number of inserts/updates in a batch is a fixed number between 1000 and 10000 (does this number has some impact on performance?)
So what way to go?
Your options are essentially the same. In fact they may be identical, unless your second option is implemented in a poor way.
Using built in PreparedStatement batching is safer, since the driver will know what to do a lot better than you do. There's less chances for programmer error, and should it ever happen that you change your database provider, you won't need to double check whether your solution is still valid.
Make sure to check out how to properly perform the batching. For example the batch size is commonly 100 instead of the full amount of rows you wish to insert (so you would have 10 executeBatch()es to insert your 1000 rows).
I am new to cassandra and migrating my application from Mysql to cassandra.As i read of cassandra it says it reduces the read and write time compared to Mysql.when i tried a simple example with single node using hector, reading operation is quite faster compared to Mysql,and if i tried to insert a single column it is very fast compared to mysql.But when i tried to insert a single row with multiple columns its taking long time compared to mysql. Is there anyway to improve Write performance or please let me know if i am wrong with my way of coding.
My sql code is
INSERT into energy_usage(meter_id,reading_datetime,reading_date,reading_time,asset_id,assetid)
VALUES('164','2012-12-07 00:30:00','2012-12-07','00:00:00','1','1') "
my cassandra code is
Mutator<String> mutator = HFactory.createMutator(keyspaceOperator, StringSerializer.get());
mutator.addInsertion("888999", DYN_CF, HFactory.createStringColumn("assetid","1")).
addInsertion("888999", DYN_CF, HFactory.createStringColumn("meterid", "164")).
addInsertion("888999", DYN_CF, HFactory.createStringColumn("energyusage","10")).
addInsertion("888999", DYN_CF, HFactory.createStringColumn("readdate","2012-12-07")).
addInsertion("888999", DYN_CF, HFactory.createStringColumn("readdatetime","2012-12-07 00:30:00"));
mutator.execute();
Well, if you can provide us with some benchmarks stating what do you find as 'very slow', maybe we can help you further. But one important thing is to note that a cassandra feature is that as the amount of data in your cluster grows, you are guaranteed (loosely) to get somewhat more constant read/write performance times as compared to a relational db engine.
Have you tried executing the same query in a cassandra-cli/cqlsh, for example? What kind of performance do you get then? If the difference between the cassandra-cli/cqlsh statement and the piece of code you posted is dramatic, that means that it is probably a driver issue, so I would dig deeper in Hector code.
I've been looking around trying to determine some Hibernate behavior that I'm unsure about. In a scenario where Hibernate batching is properly set up, will it only ever use multiple insert statements when a batch is sent? Is it not possible to use a DB independent multi-insert statement?
I guess I'm trying to determine if I actually have the batching set up correctly. I see the multiple insert statements but then I also see the line "Executing batch size: 25."
There's a lot of code I could post but I'm trying to keep this general. So, my questions are:
1) What can you read in the logs to be certain that batching is being used?
2) Is it possible to make Hibernate use a multi-row insert versus multiple insert statements?
Hibernate uses multiple insert statements (one per entity to insert), but sends them to the database in batch mode (using Statement.addBatch() and Statement.executeBatch()). This is the reason you're seeing multiple insert statements in the log, but also "Executing batch size: 25".
The use of batched statements greatly reduces the number of roundtrips to the database, and I would be surprised if it were less efficient than executing a single statement with multiple inserts. Moreover, it also allows mixing updates and inserts, for example, in a single database call.
I'm pretty sure it's not possible to make Hibernate use multi-row inserts, but I'm also pretty sure it would be useless.
I know that this is an old question but i had the same problem that i thought that hibernate batching means that hibernate would combine multiple inserts into one statement which it doesn't seem to do.
After some testing i found this answer that a batch of multiple inserts is just as good as a multi-row insert. I did a test inserting 1000 rows one time using hibernate batch and one time without. Both tests took about 20s so there was no performace gain in using hibernate batch.
To be sure i tried using the rewriteBatchedStatements option from the MySQL Connector/J which actually combines multiple inserts into one statement. It reduced the time to insert 1000 records down to 3s.
So after all hibernate batch seems to be useless and a real multi-row insert to be much better. Am i doing something wrong or what causes my test results?
The Oracle bulk insert collect an array of entyty and pass in a single block to the db associating to it a unic ciclic insert/update/delete.
Is unic way to speed network throughput .
Oracle suggest to do it calling a stored procedure from hibernate passing it an array of datas.
http://biemond.blogspot.it/2012/03/oracle-bulk-insert-or-select-from-java.html?m=1
Is not only a software problem but infrastructural!
Problem is network data flow optimization and TCP stack fragmentation.
Mysql have function.
You have to do something like what is described in this article.
Normal transfer on network the correct volume of data is the solution
You have also to verify network mtu and Oracle sdu/tdu utilization respect data transferred between application and database
The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.
Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs
Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.
Have you looked into bulk operations bulk operations?
Have you considered to use batch updates?
Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.
Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.
I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.