I have a J2EE server, currently running only one thread (the problem arises even within one single request) to save its internal model of data to MySQL/INNODB-tables.
Basic idea is to read data from flat files, do a lot of calculation and then write the result to MySQL. Read another set of flat files for the next day and repeat with step 1. As only a minor part of the rows change, I use a recordset of already written rows, compare to the current result in memory and then update/insert it correspondingly (no delete, just setting a deletedFlag).
Problem: Despite a purely sequential process I get lock timeout errors (#1204) and Innodump show record locks (though I do not know how to figure the details). To complicate things under my windows machine everything works, while the production system (where I can't install innotop) has some record locks.
To the critical code:
Read data and calculate (works)
Get Connection from Tomcat Pool and set to autocommit=false
Use Statement to issue "LOCK TABLES order WRITE"
Open Recordset (Updateable) on table order
For each row in Recordset --> if difference, update from in-memory-object
For objects not yet in the database --> Insert data
Commit Connection, Close Connection
The Steps 5/6 have an Commitcounter so that every 500 changes the rows are committed (to avoid having 50.000 rows uncommitted). In the first run (so w/o any locks) this takes max. 30sec / table.
As stated above right now I avoid any other interaction with the database, but it in future other processes (user requests) might read data or even write some fields. I would not mind for those processes to read either old/new data and to wait for a couple of minutes to save changes to the db (that is for a lock).
I would be happy to any recommendation to do better than that.
Summary: Complex code calculates in-memory objects which are to be synchronized with database. This sync currently seems to lock itself despite the fact that it sequentially locks, changes unlocks the tables without any exceptions thrown. But for some reason row locks seem to remain.
Kind regards
Additional information:
Mysql: show processlist lists no active connections (all asleep or alternatively waiting for table locks on table order) while "show engine INNODB" reports a number of row locks (unfortuantely I can't understand which transaction is meant as output is quite cryptic).
Solved: I wrongly declared a ResultSet as updateable. The ResultSet was closed only on a "finalize()" method via Garbage Collector which was not fast enough - before I reopended the ResultSet and tried therefore to aquire a lock on an already locked table.
Yet it was odd, that innotop showed another query of mine to hang on a completely different table. Though as it works for me, I do not care about oddities:-)
Related
I have a use case where in I need to process some queries in batch ( stmt.addBatch() ) and some queries need to be executed as soon as they are created (stmt.executeQuery() ) because their result will be used in the queries going for batch processing. The database is (obviously) common for both. My question is can I keep two open connections throughout for the above use case ? Will that be a good idea considering consistency & viability in mind ?
Edit : What if the queries in question were to act upon different records, can 2 simultaneous active connection objects be kept alive for handling each scenario ?
P.S. - I am relatively new to backend & databases specifically, request you to include some explanation / further reading pointers for the same.
In your case it is basically not predictable which will start first as also processors can shift commands ad risc level.
in a view of a rdms, you can have simultaneous access to the same table/ row for that you can read more about locking
for a executebatch is one transaction for the hole batch, so if it would start first, your single command would be locked out, till the lock was lifted, and then proceed if not the timeout hits first.
so for one rdms the consistency is guaranteed, for replicated system it gets more complicated
Oh I see an edit:
Row level locks are mentioned in the first link, which would answer your edit also:
I am fetching records (of large data set, around 1 Million records)from MariaDB in batches of size 500 (by using 'limit').
For each fetch iteration I am opening and closing the connection.
In my peer review I was advised to fetch the result set once and batch process by iterating on the result set itself, i.e. without closing the connection.
Is the second method right way of doing it ?
Edit : After I fetch records in batches of size 500 I am updating a field for each record and putting it on a messaging queue.
Yes, the second method is the right way to do it. Here are some reasons:
There is overhead to running the query multiple times.
The underlying data might change in the tables you are using, and the separate batches might be inconsistent.
You are depending on the ordering of the results and the sorting might have duplicates.
Your program starts
Connect to database
do some SQL (selects/inserts/whatever)
do some more SQL (selects/inserts/whatever)
do some more SQL (selects/inserts/whatever)
...
Disconnect from database
Your program ends
That is, keep the connection open as long as needed during the program. (Even if you don't explicitly disconnect, the termination of your program will do the termination. This is important to note when doing a web site -- each 'page' is essentially a separate 'program'; the db connection cannot be held between pages.)
You have another implied question... "Should I grab a batch of rows at once, then process them in the client?" The answer is "It depends".
If the processing can be done in SQL, it is probably much more efficient to do it there. Example: summing up some numbers.
If you fetch some rows from one table, then for each of those rows, fetch row(s) from another table... It will be much more efficient to use an SQL JOIN.
"Batching" may not be relevant. The client interface is probably
Fetch rows from a table (possibly all rows, even if millions)
Look at each row in turn.
Please provide the specifics of what you will be doing with the million rows so we can discuss more specifically.
Polling loop:
If you check for new things to do only once a minute, do reconnect each time.
Given that, it does not make sense to hang onto a resultset between pools.
Opening and closing a database connection is quite time consuming. Keeping the connection open will save a lot of time.
In my application I have the problem that sometimes SELECT statements run into a java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction exception. Sadly I can't create an example as the circumstances are very complex. So the question is just about general understanding.
A little bit of background information: I'm using MySQL (InnoDB) with READ_COMMITED isolation level.
Actually I don't understand how a SELECT can ever run into a lock timeout with that setup. I thought that a SELECT would never lock as it will just return the latest commited state (managed by MySQL). Anyway according to what is happening this seems to be wrong. So how is it really?
I already read this https://dev.mysql.com/doc/refman/8.0/en/innodb-locking.html but that didn't really give me a clue. No SELECT ... FOR UPDATE or something like that is used.
That is probably due to your database. Usually this kind of problems come from that side, not from the programming side that access it.In my experience with db's, these problems are usually due to that. In the end, the programming side is just "go and get that for me in that db" thing.
I found this without much effort.
It basically explains that:
Lock wait timeout occurs typically when a transaction is waiting on row(s) of data to update which is already been locked by some other transaction.
You should also check this answer that has a specific transaction problem, which might help you, as trying to change different tables might do the timeout
the query was attempting to change at least one row in one or more InnoDB tables. Since you know the query, all the tables being accessed are candidates for being the culprit.
To speed up queries in a DB, several transactions can be executed at the same time. For example if someone runs a select query over a table for the wages of the employees of a company (each employee identified by an id) and another one changes the last name of someone who e.g. has married, you can execute both queries at the same time because they don't interfere.
But in other cases even a SELECT statement might interfere with another statement.
To prevent unexpected results in a SQL transactions, transactions follow the ACID-model which stands for Atomicity, Consistency, Isolation and Durability (for further information read wikipedia).
Let's say transaction 1 starts to calculate something and then wants to write the results to table A. Before writing it it locks all SELECT statements to table A. Otherwise this would interfere with the Isolation requirement. Because if a transaction 2 would start while 1 is still writing, 2's results depend on where 1 has already written to and where not.
Now, it might even produce a dead-lock. E.g. before transaction 1 can write the last field in table A, it still has to write something to table B but transaction 2 has already blocked table B to read safely from it after it read from A and now you have a deadlock. 2 wants to read from A which is blocked by 1, so it waits for 1 to finish but 1 waits for 2 to unlock table B to finish by itself.
To solve this problem one strategy is to rollback certain transactions after a certain timeout. (more here)
So that might be a read on for your select statement to get a lock wait timeout exceeded.
But a dead-lock usually just happens by coincidence, so if transaction 2 was forced to rollback, transaction 1 should be able to finish so that 2 should be able to succeed on a later try.
I'd like to realize following scenario in PosgreSql from java:
User selects data
User starts transaction: inserts, updates, deletes data
User commits transaction
I'd like data not be available for other users during the transaction. It would be enough if I'd get an exception when other user tries to update the table.
I've tried to use select for update or select for share, but it locks data for reading also. I've tried to use lock command, but I'm not able to get a lock (ERROR: could not obtain lock on relation "fppo10") or another transaction gets lock when trying to commit transaction, not when updating the data.
Does it exist a way to lock data in a moment of transaction start to prevent any other call of update, insert or delete statement?
I have this scenario working successfully for a couple of years on DB2 database. Now I need the same application to work also for PostgreSql.
Finally, I think I get what you're going for.
This isn't a "transaction" problem per se (and depending on the number of tables to deal with and the required statements, you may not even need one), it's an application design problem. You have two general ways to deal with this; optimistic and pessimistic locking.
Pessimistic locking is explicitly taking and holding a lock. It's best used when you can guarantee that you will be changing the row plus stuff related to it, and when your transactions will be short. You would use it in situations like updating "current balance" when adding sales to an account, once a purchase has been made (update will happen, short transaction duration time because no further choices to be made at that point). Pessimistic locking becomes frustrating if a user reads a row and then goes to lunch (or on vacation...).
Optimistic locking is reading a row (or set of), and not taking any sort of db-layer lock. It's best used if you're just reading through rows, without any immediate plan to update any of them. Usually, row data will include a "version" value (incremented counter or last updated timestamp). If your application goes to update the row, it compares the original value(s) of the data to make sure it hasn't been changed by something else first, and alerts the user if the data changed. Most applications interfacing with users should use optimistic locking. It does, however, require that users notice and pay attention to updated values.
Note that, because a lock is rarely (and for a short period) taken in optimistic locking, it usually will not conflict with a separate process that takes a pessimistic lock. A pessimistic locking app would prevent an optimistic one from updating locked rows, but not reading them.
Also note that this doesn't usually apply to bulk updates, which will have almost no user interaction (if any).
tl;dr
Don't lock your rows on read. Just compare the old value(s) with what the app last read, and reject the update if they don't match (and alert the user). Train your users to respond appropriately.
Instead of select for update try a "row exclusive" table lock:
LOCK TABLE YourTable IN ROW EXCLUSIVE MODE;
According to the documentation, this lock:
The commands UPDATE, DELETE, and INSERT acquire this lock mode on the
target table (in addition to ACCESS SHARE locks on any other
referenced tables). In general, this lock mode will be acquired by any
command that modifies data in a table.
Note that the name of the lock is confusing, but it does lock the entire table:
Remember that all of these lock modes are table-level locks, even if
the name contains the word "row"; the names of the lock modes are
historical
I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...
Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.
My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.