I want the DBMS to help me gain speed when doing a lot of inserts.
Today I do an INSERT Query in Java and catch the exception if the data already is in the database.
The exception I get is :
SQLite Exception : [19] DB[1] exec() columns recorddate, recordtime are not unique.
If I get an exception I do a SELECT Query with the primary keys (recorddate, recordtime) and compare the result with the data I am trying to insert in Java. If it is the same I continue with next insert, otherwise I evaluate the data and decide what to save and maybe do an UPDATE.
This process takes time and I would like to speed it up.
I have thought of INSERT IF NOT EXIST but this just ignore the insert if there is any data with the same primary keys, am I right? And I want to make sure it is exactly the same data before I ignore the insert.
I would appreciate any suggestions for how to make this faster.
I'm using Java to handle large amount of data to insert into a SQLite database (SQLite v. 3.7.10). As the connection between Java and SQLite I am using sqlite4java (http://code.google.com/p/sqlite4java/)
I do not think letting the dbms handling more of that logic would be faster, at least not with plain SQL, as far as I can think of there is no "create or update" there.
When handling lots of entries often latency is an important issue, especially with dbs accessed via network, so at least in that case you want want to use mass operations whereever possible. Even if provided, "create or update" instead of select and update or insert (if even) would only half the latency.
I realize that is not what you asked for, but I would try to optimize in a different way, processing chunks of data, select all of them into a map then partition the input in creates, updates and ignores. That way ignores are almost for free, and further lookups are guaranteed to be done in memory. Unlikely that the dbms can be significantly faster.
If unsure if that is the right approach for you, profiling of overhead times should help.
Wrap all of your inserts and updates into a transaction. In SQL this will be written as follows.
BEGIN;
INSERT OR REPLACE INTO Table(Col1,Col2) VALUES(Val1,Val2);
COMMIT;
There are two things to note here: the database paging and commits will not be written to disk until COMMIT is called, speeding up your queries significantly; the second thing is the INSERT OR REPLACE syntax, which does precisely what you want for UNIQUE or PRIMARY KEY fields.
Most database wrappers have a special syntax for managing transactions. You can certainly execute a query, BEGIN, followed by your inserts and updates, and finish by executing COMMIT. Read the database wrapper documentation.
One more thing you can do is switch to Write-Ahead Logging. Run the following command, only once, on the database.
PRAGMA journal_mode = wal;
Without further information, I would:
BEGIN;
UPDATE table SET othervalues=... WHERE recorddate=... AND recordtime=...;
INSERT OR IGNORE INTO table(recorddate, recordtime, ...) VALUES(...);
COMMIT;
UPDATE will update all existing rows, ignoring non existent because of WHERE clause.
INSERT will then add new rows, ignoring existing because of IGNORE.
Related
Situation: I need to change many records in database (10 000 records, in example), using ORMLite DAO. All records change only in one table, in one column and changing records, which have specified id.
Question: how update many records in database at once, using ORMLite DAO?
Now I update records, using this code:
imagesDao.update(imageOrmRecord);
But updating records in cycle very slow (100 records\sec).
I think that real update records, using SQL-code, but this is undesirable...
SQL is a set-oriented language. The whole point of an ORM is to abstract this away into objects.
So when you want to update a bunch of objects, you have to go through these objects.
(You have run into the object-relational impedance mismatch; also read The Vietnam of Computer Science.)
ORMLite gives you a backdoor to execute raw SQL:
someDao.executeRaw("UPDATE ...");
But if your only problem is performance, this is likely to be caused by the auto-commit mode, which adds transaction overhead to each single statement. Using callBatchTasks() would fix this.
Question: how update many records in database at once, using ORMLite DAO?
It depends a bit on what updates you are making. You can certainly use the UpdateBuilder which will make wholesale updates to objects.
UpdateBuilder<Account, String> updateBuilder = accountDao.updateBuilder();
// update the password to be "none"
updateBuilder.updateColumnValue("password", "none");
// only update the rows where password is null
updateBuilder.where().isNull(Account.PASSWORD_FIELD_NAME);
updateBuilder.update();
Or something like:
// update hasDog boolean to true if dogC > 0
updateBuilder.updateColumnExpression("hasDog", "dogC > 0");
You should be able to accomplish a large percentage of the updates that you would do using raw SQL this way.
But if you need to make per-entity updates then you will need to do dao.update(...) for each one. What I'd do then is to do it in a transaction to make the updates go faster. See this answer.
I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.
I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.
Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).
Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.
I have a multi-threaded client/server system with thousands of clients continuously sending data to the server that is stored in a specific table. This data is only important for a few days, so it's deleted afterwards.
The server is written in J2SE, database is MySQL and my table uses InnoDB engine. It contains some millions of entries (and is indexed properly for the usage).
One scheduled thread is running once a day to delete old entries. This thread could take a large amount of time for deleting, because the number of rows to delete could be very large (some millions of rows).
On my specific system deletion of 2.5 million rows would take about 3 minutes.
The inserting threads (and reading threads) get a timeout error telling me
Lock wait timeout exceeded; try restarting transaction
How can I simply get that state from my Java code? I would prefer handling the situation on my own instead of waiting. But the more important point is, how to prevent that situation?
Could I use
conn.setIsolationLevel( Connection.TRANSACTION_READ_UNCOMMITTED )
for the reading threads, so they will get their information regardless if it is most currently accurate (which is absolutely OK for this usecase)?
What can I do to my inserting threads to prevent blocking? They purely insert data into the table (primary key is the tuple userid, servertimemillis).
Should I change my deletion thread? It is purely deleting data for the tuple userid, greater than specialtimestamp.
Edit:
When reading the MySQL documentation, I wonder if I cannot simply define the connection for inserting and deleting rows with
conn.setIsolationLevel( Connection.TRANSACTION_READ_COMMITTED )
and achieve what I need. It says that UPDATE- and DELETE statements, that use a unique index with a unique search pattern only lock the matching index entry, but not the gap before and with that, rows can still be inserted into that gap. It would be great to get your experience on that, since I can't simply try it on production - and it is a big effort to simulate it on test environment.
Try in your deletion thread to first load the IDs of the records to be deleted and then delete one at a time, committing after each delete.
If you run the thread that does the huge delete once a day and it takes 3 minutes, you can split it to smaller transactions that delete a small number of records, and still manage to get it done fast enough.
A better solution :
First of all. Any solution you try must be tested prior to deployment in production. Especially a solution suggested by some random person on some random web site.
Now, here's the solution I suggest (making some assumptions regarding your table structure and indices, since you didn't specify them):
Alter your table. It's not recommended to have a primary key of multiple columns in InnoDB, especially in large tables (since the primary key is included automatically in any other indices). See the answer to this question for more reasons. You should add some unique RecordID column as primary key (I'd recommend a long identifier, or BIGINT in MySQL).
Select the rows for deletion - execute "SELECT RecordID FROM YourTable where ServerTimeMillis < ?".
Commit (to release the lock on the ServerTimeMillis index, which I assume you have, quickly)
For each RecordID, execute "DELETE FROM YourTable WHERE RecordID = ?"
Commit after each record or after each X records (I'm not sure whether that would make much difference). Perhaps even one Commit at the end of the DELETE commands will suffice, since with my suggested new logic, only the deleted rows should be locked.
As for changing the isolation level. I don't think you have to do it. I can't suggest whether you can do it or not, since I don't know the logic of your server, and how it will be affected by such a change.
You can try to replace your one huge DELETE with multiple shorter DELETE ... LIMIT n with n being determined after testing (not too small to cause many queries and not too large to cause long locks). Since the locks would last for a few ms (or seconds, depending on your n) you could let the delete thread run continuously (provided it can keep-up; again n can be adjusted so it can keep-up).
Also, table partitioning can help.
I've been looking around trying to determine some Hibernate behavior that I'm unsure about. In a scenario where Hibernate batching is properly set up, will it only ever use multiple insert statements when a batch is sent? Is it not possible to use a DB independent multi-insert statement?
I guess I'm trying to determine if I actually have the batching set up correctly. I see the multiple insert statements but then I also see the line "Executing batch size: 25."
There's a lot of code I could post but I'm trying to keep this general. So, my questions are:
1) What can you read in the logs to be certain that batching is being used?
2) Is it possible to make Hibernate use a multi-row insert versus multiple insert statements?
Hibernate uses multiple insert statements (one per entity to insert), but sends them to the database in batch mode (using Statement.addBatch() and Statement.executeBatch()). This is the reason you're seeing multiple insert statements in the log, but also "Executing batch size: 25".
The use of batched statements greatly reduces the number of roundtrips to the database, and I would be surprised if it were less efficient than executing a single statement with multiple inserts. Moreover, it also allows mixing updates and inserts, for example, in a single database call.
I'm pretty sure it's not possible to make Hibernate use multi-row inserts, but I'm also pretty sure it would be useless.
I know that this is an old question but i had the same problem that i thought that hibernate batching means that hibernate would combine multiple inserts into one statement which it doesn't seem to do.
After some testing i found this answer that a batch of multiple inserts is just as good as a multi-row insert. I did a test inserting 1000 rows one time using hibernate batch and one time without. Both tests took about 20s so there was no performace gain in using hibernate batch.
To be sure i tried using the rewriteBatchedStatements option from the MySQL Connector/J which actually combines multiple inserts into one statement. It reduced the time to insert 1000 records down to 3s.
So after all hibernate batch seems to be useless and a real multi-row insert to be much better. Am i doing something wrong or what causes my test results?
The Oracle bulk insert collect an array of entyty and pass in a single block to the db associating to it a unic ciclic insert/update/delete.
Is unic way to speed network throughput .
Oracle suggest to do it calling a stored procedure from hibernate passing it an array of datas.
http://biemond.blogspot.it/2012/03/oracle-bulk-insert-or-select-from-java.html?m=1
Is not only a software problem but infrastructural!
Problem is network data flow optimization and TCP stack fragmentation.
Mysql have function.
You have to do something like what is described in this article.
Normal transfer on network the correct volume of data is the solution
You have also to verify network mtu and Oracle sdu/tdu utilization respect data transferred between application and database
I have an existing application that I am working w/ and the customer has defined the table structure they would like for an audit log. It has the following columns:
storeNo
timeChanged
user
tableChanged
fieldChanged
BeforeValue
AfterValue
Usually I just have simple audit columns on each table that provide a userChanged, and timeChanged value. The application that will be writing to these tables is a java application, and the calls are made via jdbc, on an oracle database. The question I have is what is the best way to get the before/after values. I hate to compare objects to see what changes were made to populate this table, this is not going to be efficient. If several columns change in one update, then this new table will have several entries. Or is there a way to do this in oracle? What have others done in the past to track not only changes but changed values?
This traditionally what oracle triggers are for. Each insert or update triggers a stored procedure which has access to the "before and after" data, which you can do with as you please, such as logging the old values to an audit table. It's transparent to the application.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:59412348055
If you use Oracle 10g or later, you can use built in auditing functions. You paid good money for the license, might as well use it.
Read more at http://www.oracle.com/technology/pub/articles/10gdba/week10_10gdba.html
"the customer has defined the table structure they would like for an audit log"
Dread words.
Here is how you would implement such a thing:
create or replace trigger emp_bur before insert on emp for each row
begin
if :new.ename = :old.ename then
insert_audit_record('EMP', 'ENAME', :old.ename, :new.ename);
end if;
if :new.sal = :old.sal then
insert_audit_record('EMP', 'SAL', :old.sal, :new.sal);
end if;
if :new.deptno = :old.deptno then
insert_audit_record('EMP', 'DEPTNO', :old.deptno, :new.deptno);
end if;
end;
/
As you can see, it involves a lot of repetition, but that is easy enough to handle, with a code generator built over the data dictionary. But there are more serious problems with this approach.
It has a sizeable overhead: an
single update which touches ten
field will generate ten insert
statements.
The BeforeValue and AfterValue
columns become problematic when we
have to handle different datatypes -
even dates and timestamps become
interesting, let alone CLOBs.
It is hard to reconstruct the state
of a record at a point in time. We
need to start with the earliest
version of the record and apply the
subsequent changes incrementally.
It is not immediately obvious how
this approach would handle INSERT
and DELETE statements.
Now, none of those objections are a problem if the customer's underlying requirement is to monitor changes to a handful of sensitive columns: EMPLOYEES.SALARY, CREDIT_CARDS.LIMIT, etc. But if the requirement is to monitor changes to every table, a "whole record" approach is better: just insert a single audit record for each row affected by the DML.
I'll ditto on triggers.
If you have to do it at the application level, I don't see how it would be possible without going through these steps:
start a transaction
SELECT FOR UPDATE of the record to be changed
for each field to be changed, pick up the old value from the record and the new value from the program logic
for each field to be changed, write an audit record
update the record
end the transaction
If there's a lot of this, I think I would be creating an update-record function to do the compares, either at a generic level or a separate function for each table.