MySQL Updates are taking forever - java

Hey, im trying to wirte about 600000 Tokens into my MySQL Database Table. The Engine I'm using is InnoDB. The update process is taking forever :(. So my best guess is that I'm totally missing something in my code and that what I'm doing is just plain stupid.
Perhaps someone has a spontaneous idea about what seems to eat my performance:
Here is my code:
public void writeTokens(Collection<Token> tokens){
try{
PreparedStatement updateToken = dbConnection.prepareStatement("UPDATE tokens SET `idTag`=?, `Value`=?, `Count`=?, `Frequency`=? WHERE `idToken`=?;");
for (Token token : tokens) {
updateToken.setInt(1, 0);
updateToken.setString(2, token.getWord());
updateToken.setInt(3, token.getCount());
updateToken.setInt(4, token.getFrequency());
updateToken.setInt(5, token.getNounID());
updateToken.executeUpdate();
}
}catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Thanks a lot!

I don't have a Java-specific answer for you, but wrap the whole shebang in a transaction. If you don't, then MySQL (when writing against InnoDB) starts and commits a new transaction per update statement.
Just execute START TRANSACTION before you start your calls, and execute COMMIT after all your updates/inserts are done. I also think that MySQL defers index updates until the end of the transaction, as well, which should help improve performance considerably if you're updating indexed fields.

If you have an index on one or more of the fields in your table, each update enforces a rebuild of those indices, which may in fact take a while as you approach several hundreds of thousands of entries.
PreparedStatement comes with an addBatch() method - I haven't used it but if I get it correctly, you can transmit several batches of records to your prepared statement and then update in one go. This reduces the number of index rebuilds from 600.000 to 1 - you should feel the difference :)

Each update statement requires a roundtrip to the database. This will give you a huge performance hit.
There are a couple of ways you insert this data into the database without performing hundreds of thousands of queries:
Use a bulk insert (LOAD DATA INFILE).
Use a single insert statement that inserts multiple rows at once. You could for example insert 100 rows per insert statement.
Then you can use a single update statement to copy the data into the target table. This will reduce the number of server roundtrips, improving the performance.

Related

fast way to insert into database using java

I was curious about how fast we can execute MySQL query through a loop in Java and well it is taking extremely long time with the code I wrote. I am sure there is a faster way to do it. Here is the piece of code that is executing the query.
PreparedStatement ps=connection.prepareStatement("INSERT INTO books.author VALUES (?,?,?)");
for (int i=0;i<100000;i++)
{
ps.setString(1,test[i][0]);
ps.setString(2,test[i][1]);
ps.setString(3,test[i][2]);
ps.addBatch();
}
int []p=ps.executeBatch();
Any help would be much appreciated. Thank you
Your basic approach is correct. Since you are using MySQL you may want to add rewriteBatchedStatements=true to your connection URL as discussed in this answer. It has been known to significantly speed up batch INSERT operations to MySQL databases.
There are a number of other things that kick in when you have a huge batch. So going "too" big will slow down the rows/second. This depends on a few settings, the specific table schema, and other things. I have see a too-big batch run twice as slow as a more civilized batch.
Think of the overhead of parsing the query as being equivalent to inserting an extra 10 rows. Based on that Rule of Thumb, a batch insert of 1000 rows is within 1% of the theoretical maximum speed.
It may be faster to create a file with all the rows and do LOAD DATA. But, again, this depends on various settings, etc.

Optimization for fetching from a bulky table

I have a PostgreSQL table that has millions of record. I need to process every row and for that I am using a column in that table namely 'isProcessed' so by default it's false and when I process it I change it to true.
Now the problem is that there are too many records and due to exceptions code bypasses some records leaving them isProcessed=false and that makes the execution really slow.
I was thinking to use indexing but with boolean it does not help.
Please provide some optimization technique or some better practice.
UPDATE:
I don't have the code, It just a problem my colleagues were asking for my opinion.
Normally an index on a Boolean isn't a good idea, but in PostgreSQL you can do an index where it contains only entries for one value using a partial index http://www.postgresql.org/docs/9.3/interactive/indexes-partial.html. It ends up being a queue of things for you to process, items drop off once done.
CREATE INDEX "yourtable_isProcessed_idx" ON "public"."yourtable"
USING btree ("isProcessed")
WHERE (isProcessed IS NOT TRUE);
This will make life easier when it is looking for the next item to process. Ideally you should be processing more than one at a time, particularly if you can do it in a single query, though doing millions at once may be prohibitive. In that situation, you might be able to do
update yourtable
set ....
where id in (select id from yourtable where isProcessed = false limit 100 )
If you have to do things one at a time, I'd still limit what you retrieve, so potentially retrieve
select id from yourtable where iProcessed = false limit 1
Without seeing your code, it would be tough to say what is really going on. Doing any processing row by row, which it sounds like is what is going on, is going to take a VERY long time.
Generally speaking, the best way to work with data is in sets. At the end of your process, you're going to ultimately have a set of records where isProcessed needs to be true (where the operation was successful), and a set where isProcessed needs to be false (where the operation failed). As you process the data, keep track of which records could be updated successfully, as well as which could not be updated. You could do this by making a list or array of the primary key or whatever other data you use to identify the rows. Then, after you're done processing your data, do one update to flag the records that were successful, and one to update the records that were not successful. This will be a bit more code, but updating each row individually after you process it is going to be awfully slow.
Again, seeing code would help, but if you're updating each record after you process it, I suspect this is what's slowing you down.
Here is approach I use. You should be able to store processing state including errors. It can be one column with values PENDING, PROCESSED, ERROR or two columns is_processed, is_error.
This is to be able skip records which couldn't be successfully processed and which if not skipped slow down processing of good tasks. You may try to reprocess them later or give DevOps possibility to move tasks from ERROR to PENDING state if the reason for failure for example was temporary unavailable resource.
Then you create conditional index on the table which include only PENDING tasks.
Processing is done using following algorithm (using spring: transaction and nestedTransaction are spring transaction templates):
while (!(batch = getNextBatch()).isEmpty()):
transaction.execute( (TransactionStatus status) -> {
for (Element element : batch) {
try {
nestedTransaction.execute( (TransactionStatuc status ) -> {
processElement(element);
markAsProcessed(element);
});
} catch (Exception e) {
markAsFailed(element);
}
}
});
Several important notes:
getting of records is done in batches - this at least saves round trips to database and is quicker then one by one retrieval
processing of individual elements is done in nested transaction (this is implemented using postgresql SAVEPOINTs). This is quicker then processing each element in own transaction but have the benefit that failure in processing of one element will not lose results of processing of others elements in batch.
This is good when processing is complex enough and cannot be done in sql by single query to process batch. If processElement rather simple update of element then whole batch may be updated via single update statement.
processing on elements of the batch may be done in parallel. This requires propagation of transaction to worker threads.

SQL - INSERT IF NOT EXIST, CHECK if the same OR UPDATE

I want the DBMS to help me gain speed when doing a lot of inserts.
Today I do an INSERT Query in Java and catch the exception if the data already is in the database.
The exception I get is :
SQLite Exception : [19] DB[1] exec() columns recorddate, recordtime are not unique.
If I get an exception I do a SELECT Query with the primary keys (recorddate, recordtime) and compare the result with the data I am trying to insert in Java. If it is the same I continue with next insert, otherwise I evaluate the data and decide what to save and maybe do an UPDATE.
This process takes time and I would like to speed it up.
I have thought of INSERT IF NOT EXIST but this just ignore the insert if there is any data with the same primary keys, am I right? And I want to make sure it is exactly the same data before I ignore the insert.
I would appreciate any suggestions for how to make this faster.
I'm using Java to handle large amount of data to insert into a SQLite database (SQLite v. 3.7.10). As the connection between Java and SQLite I am using sqlite4java (http://code.google.com/p/sqlite4java/)
I do not think letting the dbms handling more of that logic would be faster, at least not with plain SQL, as far as I can think of there is no "create or update" there.
When handling lots of entries often latency is an important issue, especially with dbs accessed via network, so at least in that case you want want to use mass operations whereever possible. Even if provided, "create or update" instead of select and update or insert (if even) would only half the latency.
I realize that is not what you asked for, but I would try to optimize in a different way, processing chunks of data, select all of them into a map then partition the input in creates, updates and ignores. That way ignores are almost for free, and further lookups are guaranteed to be done in memory. Unlikely that the dbms can be significantly faster.
If unsure if that is the right approach for you, profiling of overhead times should help.
Wrap all of your inserts and updates into a transaction. In SQL this will be written as follows.
BEGIN;
INSERT OR REPLACE INTO Table(Col1,Col2) VALUES(Val1,Val2);
COMMIT;
There are two things to note here: the database paging and commits will not be written to disk until COMMIT is called, speeding up your queries significantly; the second thing is the INSERT OR REPLACE syntax, which does precisely what you want for UNIQUE or PRIMARY KEY fields.
Most database wrappers have a special syntax for managing transactions. You can certainly execute a query, BEGIN, followed by your inserts and updates, and finish by executing COMMIT. Read the database wrapper documentation.
One more thing you can do is switch to Write-Ahead Logging. Run the following command, only once, on the database.
PRAGMA journal_mode = wal;
Without further information, I would:
BEGIN;
UPDATE table SET othervalues=... WHERE recorddate=... AND recordtime=...;
INSERT OR IGNORE INTO table(recorddate, recordtime, ...) VALUES(...);
COMMIT;
UPDATE will update all existing rows, ignoring non existent because of WHERE clause.
INSERT will then add new rows, ignoring existing because of IGNORE.

How to check if I'm inserting in db from java

just a guide here, please!
I want to insert some values in my db, from java
I have my oracle prepared statement and stuff and it inserts ok, but my other requirement it's to send via email the fields that for some reason werent insert.
So, Im thinking making my method as int and return like 0 if theres an error 1 if its ok, 2 if some fields didn't insert ... etc... BUT
I'm lost figuring how to bind it, I mean if there's an error in the catch from sql in java i can put like
catch (SQLException e) {// TODO Auto-generated catch block
e.printStackTrace();
return result = 0;
And obviously i can put an error message and stuff where i call it, but how will i know in which item stops and then retrieve the rest of the fields that weren't insert...
any idea? anyone has an example?
Thanks in advance
First of all try to do as much validations possible in Java and identify the bad records before persisting them in database, you can catch those records before they hit database and send email.
In case you are worried about db level issues e.g. contrainst etc, then you need to choose Performance vs Flexibile Features
Option1 : Dont use batch insert one record at a time, whatever fails you can send email, this is bad from performance perspective but will solve your purpose
Option2 : Divide your records into batches of 50 or 100, whatever batch fails you can report that whole batch failed, this will also report some good records, but you will have information about what was not persisted.
Option3 : Divide your records into batches, if a batch fails try saving one record at a time for that particular batch, that ways you will be able to identify bad record and performance will be optimized, this will require more coding and testing.
So, you can choose what ever suits you, but I would recommend following:
Do maximum validations in Java
Split into smaller batches and use Option 3.
Cheers !!
When you have an insert you insert all fields included in the statement; or, none. So if throw an exception when running your statement, then you would send an email.
If no exception is thrown then you would assume that it has worked.
Alternatively, after running the insert, you could then attempt to do a select to check to see if the data you inserted was actually there.
If the statement failed then none of your field will be inserted, because oracle internally does a rollback to maintain consistency in your data.

Fastest way to iterate through large table using JDBC

I'm trying to create a java program to cleanup and merge rows in my table. The table is large, about 500k rows and my current solution is running very slowly. The first thing I want to do is simply get an in-memory array of objects representing all the rows of my table. Here is what I'm doing:
pick an increment of say 1000 rows at a time
use JDBC to fetch a resultset on the following SQL query
SELECT * FROM TABLE WHERE ID > 0 AND ID < 1000
add the resulting data to an in-memory array
continue querying all the way up to 500,000 in increments of 1000, each time adding results.
This is taking way to long. In fact its not even getting past the second increment from 1000 to 2000. The query takes forever to finish (although when I run the same thing directly through a MySQL browser its decently fast). Its been a while since I've used JDBC directly. Is there a faster alternative?
First of all, are you sure you need the whole table in memory? Maybe you should consider (if possible) selecting rows that you want to update/merge/etc. If you really have to have the whole table you could consider using a scrollable ResultSet. You can create it like this.
// make sure autocommit is off (postgres)
con.setAutoCommit(false);
Statement stmt = con.createStatement(
ResultSet.TYPE_SCROLL_INSENSITIVE, //or ResultSet.TYPE_FORWARD_ONLY
ResultSet.CONCUR_READ_ONLY);
ResultSet srs = stmt.executeQuery("select * from ...");
It enables you to move to any row you want by using 'absolute' and 'relative' methods.
One thing that helped me was Statement.setFetchSize(Integer.MIN_VALUE). I got this idea from Jason's blog. This cut down execution time by more than half. Memory consumed went down dramatically (as only one row is read at a time.)
This trick doesn't work for PreparedStatement, though.
Although it's probably not optimum, your solution seems like it ought to be fine for a one-off database cleanup routine. It shouldn't take that long to run a query like that and get the results (I'm assuming that since it's a one off a couple of seconds would be fine). Possible problems -
is your network (or at least your connection to mysql ) very slow? You could try running the process locally on the mysql box if so, or something better connected.
is there something in the table structure that's causing it? pulling down 10k of data for every row? 200 fields? calculating the id values to get based on a non-indexed row? You could try finding a more db-friendly way of pulling the data (e.g. just the columns you need, have the db aggregate values, etc.etc)
If you're not getting through the second increment something is really wrong - efficient or not, you shouldn't have any problem dumping 2000, or 20,000 rows into memory on a running JVM. Maybe you're storing the data redundantly or extremely inefficiently?

Categories

Resources