Incremental update of millions of records

Incremental update of millions of records - java

I have a table that contains approx 10 million rows. This table is periodically updated (few times a day) by an external process. The table contains information that, if not in the update, should be deleted. Of course, you don't know if its in the update until the update has finished.
Right now, we take the timestamp of when the update began. When the update finishes, anything that has an "updated" value less than the start timestammp is wiped. This works for now, but is problematic when the updater process crashes for whatever value - we have to start again with a new timestamp value.
It seems to be that there must be something more robust as this is a common problem. Any advice?

Instead of a time stamp, use an integer revision number. Increment it ONLY when you have a complete update, and then delete the elements with out of date revisions.

If you use a storage engine that supports transactions, like InnoDb (you're using MySql right?), you can consider using transactions, so if the update process crashes, the modifications are not commited.
Here is the official documentation.
We don't know anything about your architecture, and how you do this update (pure SQL, webservice?), but you might already have a transaction management layer.

Related

Best approach to lock editing certain record in DB

I am working on a spring boot project, the task is: I should lock editing capability of product for 15 minutes after creation, so basically if the user create a product, this product will be locked for editing for 15 minutes, after that it can be changed or deleted from the DB.
My question is: what is the best approach to achieve that:
1- Should I add a field to the DB table called lastUpdate and then check if the time of 15 minutes exceed.
2- Should I save all the newly created products in array and clear this array every 15 minutes.
or there is any better ways in regard to performance and best practice??
I am using springboot with JPA & mysql.
Thanks.

You should not use the locking available in InnoDB.
Instead, you should have some column in some table that controls the lock. It should probably be a TIMESTAMP so you can decide whether the 15 minutes has been used up.
If the 'expiration' and 'deletion' and triggered by some db action (attempt to use the item, etc), check it as part of that db action. The expiration check (and delete) should be part of the transaction that includes that action; this will use InnoDB locking, but only briefly.
If there is no such action, then use either a MySQL EVENT or an OS "cron job" to run around every few minutes to purge anything older than 15 minutes. (There will be a slight delay in purging, but that should not matter.
If you provide the possible SQL statements that might occur during the lifetime of the items, I may be able to be more specific.

you can make some check in your update method and delete method. If there are many methods, you can use AOP.

You can make use of both the functionalities you have mentioned.
First its good to have a lastUpdated field in tables which would help you in future also with other functionalities.
And then you can have an internal cache (map which has time and object reference), store objects in that and restrict editing for them. You can run a scheduler to check every minute and clear objects from you map and make them available for updating.

You could put your new products in an "incoming_products" table and put a timestamp column in that table that you set to date_add(now(), INTERVAL 15 MINUTE).
Then have a #Scheduled method in our Boot application run every minute to check if there are incoming products where the timestamp column is < now() and insert them as products and delete the corresponding incoming-products record.

How to prevent MySQL InnoDB setting a lock for delete statement through JDBC

I have a multi-threaded client/server system with thousands of clients continuously sending data to the server that is stored in a specific table. This data is only important for a few days, so it's deleted afterwards.
The server is written in J2SE, database is MySQL and my table uses InnoDB engine. It contains some millions of entries (and is indexed properly for the usage).
One scheduled thread is running once a day to delete old entries. This thread could take a large amount of time for deleting, because the number of rows to delete could be very large (some millions of rows).
On my specific system deletion of 2.5 million rows would take about 3 minutes.
The inserting threads (and reading threads) get a timeout error telling me
Lock wait timeout exceeded; try restarting transaction
How can I simply get that state from my Java code? I would prefer handling the situation on my own instead of waiting. But the more important point is, how to prevent that situation?
Could I use
conn.setIsolationLevel( Connection.TRANSACTION_READ_UNCOMMITTED )
for the reading threads, so they will get their information regardless if it is most currently accurate (which is absolutely OK for this usecase)?
What can I do to my inserting threads to prevent blocking? They purely insert data into the table (primary key is the tuple userid, servertimemillis).
Should I change my deletion thread? It is purely deleting data for the tuple userid, greater than specialtimestamp.
Edit:
When reading the MySQL documentation, I wonder if I cannot simply define the connection for inserting and deleting rows with
conn.setIsolationLevel( Connection.TRANSACTION_READ_COMMITTED )
and achieve what I need. It says that UPDATE- and DELETE statements, that use a unique index with a unique search pattern only lock the matching index entry, but not the gap before and with that, rows can still be inserted into that gap. It would be great to get your experience on that, since I can't simply try it on production - and it is a big effort to simulate it on test environment.

Try in your deletion thread to first load the IDs of the records to be deleted and then delete one at a time, committing after each delete.
If you run the thread that does the huge delete once a day and it takes 3 minutes, you can split it to smaller transactions that delete a small number of records, and still manage to get it done fast enough.
A better solution :
First of all. Any solution you try must be tested prior to deployment in production. Especially a solution suggested by some random person on some random web site.
Now, here's the solution I suggest (making some assumptions regarding your table structure and indices, since you didn't specify them):
Alter your table. It's not recommended to have a primary key of multiple columns in InnoDB, especially in large tables (since the primary key is included automatically in any other indices). See the answer to this question for more reasons. You should add some unique RecordID column as primary key (I'd recommend a long identifier, or BIGINT in MySQL).
Select the rows for deletion - execute "SELECT RecordID FROM YourTable where ServerTimeMillis < ?".
Commit (to release the lock on the ServerTimeMillis index, which I assume you have, quickly)
For each RecordID, execute "DELETE FROM YourTable WHERE RecordID = ?"
Commit after each record or after each X records (I'm not sure whether that would make much difference). Perhaps even one Commit at the end of the DELETE commands will suffice, since with my suggested new logic, only the deleted rows should be locked.
As for changing the isolation level. I don't think you have to do it. I can't suggest whether you can do it or not, since I don't know the logic of your server, and how it will be affected by such a change.

You can try to replace your one huge DELETE with multiple shorter DELETE ... LIMIT n with n being determined after testing (not too small to cause many queries and not too large to cause long locks). Since the locks would last for a few ms (or seconds, depending on your n) you could let the delete thread run continuously (provided it can keep-up; again n can be adjusted so it can keep-up).
Also, table partitioning can help.

Populate database table on a frequent basis using JPA

One of my Java application's functionality is to read and parse very frequently (almost every 5 minutes) an xml file and populate a database table. I have created a cron job to do that. Most of the columns' values remain the same but for certain columns there may be a frequent update on the value. I was wondering what is the most efficient way of doing that:
1) Delete the table every time and re-create it or
2) Update the table data and specifically the column where a change in the source file has appeared.
The number of rows parsed and persisted every time is about 40000-50000.
I would assume that around 2000-3000 rows need to update on every cron job run.
I am using JPA to persist data to a mysql server and I have gone for the first option so far.
Obviously for both options the job would execute as a single transaction.
Any ideas which one is better and possibly any optimization suggestions?

I would suggest scheduling your jobs using something more sophisticated than cron. For instance, Quartz.

is Select Before Update a good approach or vice versa?

I am developing an application using normal JDBC connection. The application is developed with Java-Java EE SpringsMVC 3.0 and SQL Server 08 as database. I am required to update a table based on a non primary key column.
Now, before updating the table we had to decide an approach for updating the table, as table may contain huge amount of data. The update Query will be executed in a batch and we are required to design application in a manner wherein it doesn't hog the system resources.
Now, We had to decide between either of the approaches,
1. SELECT DATA BEFORE YOU UPDATE or
2. UPDATE DATA AND THEN SELECT MISSING DATA.
Select data before update is only benificial if chances of failure are maximum, i.e. if a batch 100 Query update is executed, and out of which if only 20 rows are updated successfully, then this approach should be taken
Update data and then check missing data is benificial only when failure records are far less. By this ap[proach one database select call can be avoided, i.e after a batch update, the count of records updated can be taken and the select query should be executed if and only if theres is a count in mismatch w.r.t no of query.
We are totally unaware about the system on Production environment, but we want to counter for all possibilities and want a faster system. I need your inputs as which is a better approach.

Since there is 50:50 chance of successful updates or faster selects, its hard to tell from the current scenario mentioned. You probably would want a fuzzy logic approach, getting constant feedback of how many updates were successful over the period of time, and then decide on the basis of that data to either do an update before select or do a select before update.

database audit table

I have an existing application that I am working w/ and the customer has defined the table structure they would like for an audit log. It has the following columns:
storeNo
timeChanged
user
tableChanged
fieldChanged
BeforeValue
AfterValue
Usually I just have simple audit columns on each table that provide a userChanged, and timeChanged value. The application that will be writing to these tables is a java application, and the calls are made via jdbc, on an oracle database. The question I have is what is the best way to get the before/after values. I hate to compare objects to see what changes were made to populate this table, this is not going to be efficient. If several columns change in one update, then this new table will have several entries. Or is there a way to do this in oracle? What have others done in the past to track not only changes but changed values?

This traditionally what oracle triggers are for. Each insert or update triggers a stored procedure which has access to the "before and after" data, which you can do with as you please, such as logging the old values to an audit table. It's transparent to the application.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:59412348055

If you use Oracle 10g or later, you can use built in auditing functions. You paid good money for the license, might as well use it.
Read more at http://www.oracle.com/technology/pub/articles/10gdba/week10_10gdba.html

"the customer has defined the table structure they would like for an audit log"
Dread words.
Here is how you would implement such a thing:
create or replace trigger emp_bur before insert on emp for each row
begin
if :new.ename = :old.ename then
insert_audit_record('EMP', 'ENAME', :old.ename, :new.ename);
end if;
if :new.sal = :old.sal then
insert_audit_record('EMP', 'SAL', :old.sal, :new.sal);
end if;
if :new.deptno = :old.deptno then
insert_audit_record('EMP', 'DEPTNO', :old.deptno, :new.deptno);
end if;
end;
/
As you can see, it involves a lot of repetition, but that is easy enough to handle, with a code generator built over the data dictionary. But there are more serious problems with this approach.
It has a sizeable overhead: an
single update which touches ten
field will generate ten insert
statements.
The BeforeValue and AfterValue
columns become problematic when we
have to handle different datatypes -
even dates and timestamps become
interesting, let alone CLOBs.
It is hard to reconstruct the state
of a record at a point in time. We
need to start with the earliest
version of the record and apply the
subsequent changes incrementally.
It is not immediately obvious how
this approach would handle INSERT
and DELETE statements.
Now, none of those objections are a problem if the customer's underlying requirement is to monitor changes to a handful of sensitive columns: EMPLOYEES.SALARY, CREDIT_CARDS.LIMIT, etc. But if the requirement is to monitor changes to every table, a "whole record" approach is better: just insert a single audit record for each row affected by the DML.

I'll ditto on triggers.
If you have to do it at the application level, I don't see how it would be possible without going through these steps:
start a transaction
SELECT FOR UPDATE of the record to be changed
for each field to be changed, pick up the old value from the record and the new value from the program logic
for each field to be changed, write an audit record
update the record
end the transaction
If there's a lot of this, I think I would be creating an update-record function to do the compares, either at a generic level or a separate function for each table.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.