I have a multi threaded Java program where each thread gets one username for some processing which takes about 10 minutes or so.
Right now it's getting the usernames by a sql query that returns one username randomly and the problem is that the same username can be given to more than one thread at a time.
I don't want a username that is being processed by a thread, to be fetched again by another thread. What is a simple and easy way to achieve this goal?
Step-by-step solution:
Create a threads table where you store the threads' state. Among other columns, you need to store the owner user's id there as well.
When a thread is associated to a user, create a record, storing the owner, along with all other juicy stuff.
When a thread is no longer associated to a user, set its owner to null.
When a thread finishes its job, remove its record.
When you randomize your user for threads, filter out all the users who are already associated to at least a thread. This way you know any users at the end of randomization are threadless.
Make sure everything is in place. If, while working on the feature some thread records were created and should be removed or disposed from its owner, then do so.
There is a lot of ways to do this... I can think of three solution to this problem:
1) A singleton class with a array that contains all the user already in use. Be sure that the acces to the array is synchronized and you remove the unused users from it.
2) A flag in the user table that contains a unique Id referencing the thread that is using it. After you have to manage when you remove the flag from the table.
-> As an alternative, why do you check if a pool of connections shared by all the thread could be the solution to your problem ?
You could do one batch query that returns all of the usernames you want from the database and store them in a List (or some type of collection).
Then ensure synchronised access to this list to prevent two threads taking the same username at the same time. Use a synchronised list or a synchronised method to access the list and remove the username from the list.
One way to do it is to add another column to your users table.this column is a simple flag that shows if a user has an assigned thread or not.
but when you query the db you have to wrap it in a transaction.
you begin the transaction and then first you select a user that doesn't have a thread after that you update the flag column and then you commit or roll back.
since the queries are wrapped in a transaction the db handles all the issues that happen in scenarios like this.
with this solution there is no need to implement synchronization mechanisms in your code since the database will do it for you.
if you still have problems after doing this i think you have to configure isolation levels of your db server.
You appear to want a work queue system. Don't reinvent the wheel - use a well established existing work queue.
Robust, reliable concurrent work queuing is unfortunately tricky with relational databases. Most "solutions" land up:
Failing to cope with work items not being completed due to a worker restart or crash;
Actually land up serializing all work on a lock, so all but one worker are just waiting; and/or
Allowing a work item to be processed more than once
PostgreSQL 9.5's new FOR UPDATE SKIP LOCKED feature will make it easier to do what you want in the database. For now, use a canned reliable task/work/message queue engine.
If you must do this yourself, you'll want to have a table of active work items where you record the active process ID / thread ID of the worker that's processing a row. You will need a cleanup process that runs periodically, on thread crash, and on program startup that removes entries for failed jobs (where the worker process no longer exists) so they can be re-tried.
Note that unless the work the workers do is committed to the database in the same transaction that marks the work queue item as done, you will have timing issues where the work can be completed then the DB entry for it isn't marked as done, leading to work being repeated. To absolutely prevent that requires that you commit the work to the DB in the same transaction as the change that marks the work as done, or that you use two-phase commit and an external transaction manager.
Related
I would like to ask about the best way to do the following, currently I have many rows that are being inserted into database, with some status like 'NEW'
One thread(ThreadA) is reading 20 rows of the data from table with following query: select * from TABLE where status = 'NEW' order by some_date asc and puts read data into the the queue. It only reads data when number of elements in the queue is less then 20.
Another thread(ThreadB) is reading data from the queue and processes it, during the process it changes the status of the row to something like 'IN PROGRESS'.
My fear is that while ThreadB is processing one row, but still didn't update its status, if the number of elements in the queue is reduced to be lower than 20, it will fetch another 20 elements and put it into the queue, so there is a possibility of having duplicates in queue.
The data might come back with a status like 'NEW' I thought that I can update the data read with some flag(something like fetched), and to set the flag as not read after processing.
I feel like I am missing something. So I would like to ask if there some best practice on how to handle tasks similar to this.
PS. Number of threads that read the data might be increased in the future, this is what I try to keep in mind
Right, since no-one seems to be picking this one up, I'll continue here what was started in the comments:
There are lots of solutions to this one. In your case, with just one processing thread, you might want for example to store just the records ids in the queue. Then ThreadB can fetch the row itself to make sure the status is indeed NEW. Or use optimistic locking with update table set status='IN_PROGRESS' where id=rowId and status='NEW' and quit processing this row on exception.
Optimisting locking is fun, and you could also use it to get rid of producer thread altogether. Imagine a few threads, processing records from database. They could each pick up a record, and try to set the status with optimistic locking as in the first example. It's quite possible to get a lot of contention for records this way, so each thread could fetch N rows, where N is number of threads, or twice that much. And then try to process the first row that it succeeded to set IN_PROGRESS for. This solution makes for a less complicated system, and one less thing to take care of/synchronize with.
And you can have the thread not only pick up records that are NEW, but also these which are IN_PROGRESS and started_date < sysdate = timeout, that would include records that were not processed because of system failure (like a thread managed to set one row to IN_PROGRESS and then your system went down. So you get some resilience here.
I've been reading a bit about the Kafka concurrency model, but I still struggle to understand whether I can have local state in a Kafka Processor, or whether that will fail in bad ways?
My use case is: I have a topic of updates, I want to insert these updates into a database, but I want to batch them up first. I batch them inside a Java ArrayList inside the Processor, and send them and commit them in the punctuate call.
Will this fail in bad ways? Am I guaranteed that the ArrayList will not be accessed concurrently?
I realize that there will be multiple Processors and multiple ArrayLists, depending on the number of threads and partitions, but I don't really care about that.
I also realize I will loose the ArrayList if the application crashes, but I don't care if some events are inserted twice into the database.
This works fine in my simple tests, but is it correct? If not, why?
Whatever you use for local state in your Kafka consumer application is up to you. So, you can guarantee only the current thread/consumer will be able to access the local state data in your array list. If you have multiple threads, one per Kafka consumer, each thread can have their own private ArrayList or hashmap to store state into. You could also have something like a local RocksDB database for persistent local state.
A few things to look out for:
If you're batching updates together to send to the DB, are those updates in any way related, say, because they're part of a transaction? If not, you might run into problems. An easy way to ensure this is the case is to set a key for your messages with a transaction ID, or some other unique identifier for the transaction, and that way all the updates with that transaction ID will end up in one specific partition, so whoever consumes them is sure to always have the
How are you validating that you got ALL the transactions before your batch update? Again, this is important if you're dealing with database updates inside transactions. You could simply wait for a pre-determined amount of time to ensure you have all the updates (say, maybe 30 seconds is enough in your case). Or maybe you send an "EndOfTransaction" message that details how many messages you should have gotten, as well as maybe a CRC or hash of the messages themselves. That way, when you get it, you can either use it to validate you have all the messages already, or you can keep waiting for the ones that you haven't gotten yet.
Make sure you're not committing to Kafka the messages you're keeping in memory until after you've batched and sent them to the database, and you have confirmed that the updates went through successfully. This way, if your application dies, the next time it comes back up, it will get again the messages you haven't committed in Kafka yet.
I have a J2EE server, currently running only one thread (the problem arises even within one single request) to save its internal model of data to MySQL/INNODB-tables.
Basic idea is to read data from flat files, do a lot of calculation and then write the result to MySQL. Read another set of flat files for the next day and repeat with step 1. As only a minor part of the rows change, I use a recordset of already written rows, compare to the current result in memory and then update/insert it correspondingly (no delete, just setting a deletedFlag).
Problem: Despite a purely sequential process I get lock timeout errors (#1204) and Innodump show record locks (though I do not know how to figure the details). To complicate things under my windows machine everything works, while the production system (where I can't install innotop) has some record locks.
To the critical code:
Read data and calculate (works)
Get Connection from Tomcat Pool and set to autocommit=false
Use Statement to issue "LOCK TABLES order WRITE"
Open Recordset (Updateable) on table order
For each row in Recordset --> if difference, update from in-memory-object
For objects not yet in the database --> Insert data
Commit Connection, Close Connection
The Steps 5/6 have an Commitcounter so that every 500 changes the rows are committed (to avoid having 50.000 rows uncommitted). In the first run (so w/o any locks) this takes max. 30sec / table.
As stated above right now I avoid any other interaction with the database, but it in future other processes (user requests) might read data or even write some fields. I would not mind for those processes to read either old/new data and to wait for a couple of minutes to save changes to the db (that is for a lock).
I would be happy to any recommendation to do better than that.
Summary: Complex code calculates in-memory objects which are to be synchronized with database. This sync currently seems to lock itself despite the fact that it sequentially locks, changes unlocks the tables without any exceptions thrown. But for some reason row locks seem to remain.
Kind regards
Additional information:
Mysql: show processlist lists no active connections (all asleep or alternatively waiting for table locks on table order) while "show engine INNODB" reports a number of row locks (unfortuantely I can't understand which transaction is meant as output is quite cryptic).
Solved: I wrongly declared a ResultSet as updateable. The ResultSet was closed only on a "finalize()" method via Garbage Collector which was not fast enough - before I reopended the ResultSet and tried therefore to aquire a lock on an already locked table.
Yet it was odd, that innotop showed another query of mine to hang on a completely different table. Though as it works for me, I do not care about oddities:-)
This kind of thing has been done a million times I'm sure, but my search foo appears weak today, and I'd like to get opinions on what is generally considered the best way to accomplish this goal.
My application keeps track of sessions for online users in a system. Each session corresponds to a single record in a database. A session can be ended in one of two ways. Either a "stop" message is received, or the session can timeout. The former case is easy, it is handled in the message processing thread and everything is fine. The latter case is where the concern comes from.
In order to process timeouts, each record has an ending time column that is updated each time a message is received for that session. To make timeouts work, I have a thread that returns all records from the database whose endtime < NOW() (has an end time in the past), and goes through the processing to close those sessions. The problem here is that it's possible that I might receive a message for a session while the timeout thread is going through processing for the same session. I end up with a race between the timeout thread and message processing thread.
I could use a semaphore or the like and just prevent the message thread from processing while timeout is taking place as it only needs to run every 30 seconds or a minute. However, as the user table gets large this is going to run into some serious performance issues. What I think I would like is a way to know in the message thread that this record is currently being processed by the timeout thread. If I could achieve that I could either discard the message or wait for timeout thread to end but only in the case of conflicts now instead of always.
Currently my application uses JDBC directly. Would there be an easier/standard method for solving this issue if I used a framework such as Hibernate?
This is a great opportunity for all kinds of crazy bugs to occur, and some of the cures can cause performance issues.
The classic solution would be to use transactions (http://dev.mysql.com/doc/refman/5.0/en/commit.html). This allows you to guarantee the consistency of your data - but a long-running transaction on the database turns it into a huge bottleneck; if your "find timed-out sessions" code runs for a minute, the transaction may run for that entire period, effectively locking write access to the affected table(s). Most systems would not deal well with this.
My favoured solution for this kind of situation is to have a "state machine" for status; I like to implement this as a history table, but that does tend to lead to a rapidly growing database.
You define the states of a session as "initiated", "running", "timed-out - closing", "timed-out - closed", and "stopped by user" (for example).
You implement code which honours the state transition logic in whatever data access logic you've got. The pseudo code for your "clean-up" script might then be:
update all records whose endtime < now() and whose status is "running, set status = "timed-out - closing"
for each record whose status is "timed-out - closing"
do whatever other stuff you need to do
update that record to set status "timed-out - closed" where status = "timed-out - closing"
next record
All other attempts to modify the current state of the session record must check that the current status is valid for the attempted change.
For instance, the "manual" stop code should be something like this:
update sessions
set status = "stopped by user"
where session_id = xxxxx
and status = 'running'
If the auto-close routine has kicked off in the time between showing the user interface and the database code, the where clause won't match any records, so the rest of the code simply doesn't run.
For this to work, all code that modifies the session status must check its pre-conditions; the most maintainable way is to encode status and allowed transitions into a separate database table.
You could also write triggers to enforce this logic, though I'm normally not a fan of triggers - only do this if you have to.
I don't think this adds significant performance worries - but test and optimize. The majority of the extra work on the database is by adding extra "where" clauses to your update statements; assuming you have an index on status, it's unlikely to have a measurable impact.
I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...
Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.
My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.