Java concurrency for periodic database batch insert

Java concurrency for periodic database batch insert - java

Scenario: One thread is being called up to thousands of times per second to do inserts to the same table and is currently doing them one-by-one.
Goal: Do periodic batch inserts instead to improve performance.
Trying to use a TimerTask to instead add objects being saved to a list as the thread's saveItem method gets called, then combine them for a batch insert every 2 seconds or so.
First thought was to have two Lists, call them toSave and toSaveBackup. When the thread's saveItem method is called to save something it will be added to the toSave list, but once the TimerTask kicks off and needs to save everything to the database, it will set an AtomicBoolean flag saveInProgress to true. This flag is checked by saveItem and it will add to toSaveBackup instead of toSave if saveInProgress is true. When the batch save is complete, all items will in toSaveBackup will be moved to the toSave list, probably with a synchronized block on the lists.
Is this a reasonable approach? Or is there a better best practice? My googling skills have failed me so any help is welcome.
Misc info:
All these inserts are to the same table
Inserts are driven by receipt of MQTT messages, so I can't combine them in a batch before this point
Update: A tweak on CKing's answer below achieved the desired approach: A TimerTask runs every 100 ms and checks the size of the saveQueue and how long it's been since a batch was saved. If either of these values exceed the configured limit (save every 2 seconds or every 1000 records etc) then we save. A LinkedBlockingQueue is used to simplify sychronization.
Thanks again to everyone for their help!

It looks like your primary objective is to wait for a predefined amount of time and then trigger an insert. When an insert is in progress, you wan't other insert requests to wait till the insert is complete. After the insert is complete, you want to repeat the same process again for the next insert requests.
I would propose the following solution with the above understanding in mind. You don't need to have two separate lists to achieve your goal. Also note that I am proposing an old fashioned solution for the sake of explanation. I cover some other APIs you can use at the end of my explanation. Here goes :
Define a Timer and a TimerTask that will run every N seconds.
Define an ArrayList that will be used for queuing up insert requests sent to saveItem method.
The saveItem method can define a sycnrhonized block around this ArrayList. You can add items to the ArrayList within this synchronized block as and when saveItem is called.
On the other side of the equation, TimerTask should have a synchronized block on the same ArrayList as well inside its run method. It should insert all the records present in the ArrayList at that given moment into the database. Once the insert is complete, the TimerTask should clear the ArrayList and finally come out of the synchronized block.
You will no longer need to explicitly monitor if an insert is in progress or create a copy of your ArrayList when an insert is in progress. Your ArrayList becomes the shared resource in this case.
If you also want size to be a deciding factor for proceeding with inserts, you can do this :
Define an int called waitAttempts in TimerTask. This field indicates the number of consecutive wake ups for which the TimerTask should do nothing if the size of the list is not big enough.
Everytime the TimerTask wakes up, it can do something like if(waitAttempts%3==0 || list.size > 10) { insert data } else { increment waitAttempts and do nothing. Exit the synchronized block and the run method }. You can change 3 and 10 to whatever number suits your throughput requirements.
Note Intrinsic locking was used as a means of explaining the approach. One can always take this approach and implement it using modern constructs such as a BlockingQueue that would eliminate the need to synchronize manually on the ArrayList. I would also recommend the use of Executors.newSingleThreadScheduledExecutor() instead of a TimerTask as it ensures that there will only be one thread running at any given time and there wont be an overlap of threads. Also, the logic for waitAttempts is indicative and will need to be adjusted to work correctly.

Related

Let a queue build up to a certain amount before processing

So let me give you an idea of what I'm trying to do:
I've got a program that records statistics, lots and lots of them, but it records them as they happen one at a time and puts them into an ArrayList, for example:
Please note this is an example, I'm not recording these stats, I'm just simplifying it a bit
User clicks -> Add user_click to array
User clicks -> Add user_click to array
Key press -> Add key_press to array
After each event(clicks, key presses, etc) it checks the size of the ArrayList, if it is > 150 the following happens:
A new thread is created
That thread is given a copy of the ArrayList
The original ArrayList is .clear()'ed
The new thread combines similar items so user_click would now be one item with a quantity of 2, instead of 2 items with a quantity of 1 each
The thread processes the data to a MySQL db
I would love to find a better approach to this, although this works just fine. The issue with threadpools and processing immediately is there would be literally thousands of MySQL queries per day without combining them first..
Is there a better way to accomplish this? Is my method okay?
The other thing to keep in mind is the thread where events are fired and recorded can't be slowed down so I don't really want to combine items in the main thread.
If you've got code examples that would be great, if not just an idea of a good way to do this would be awesome as-well!
For anyone interested, this project is hosted on GitHub, the main thread is here, the queue processor is here and please forgive my poor naming conventions and general code cleanliness, I'm still(always) learning!

The logic described seems pretty good, with two adjustments:
Don't copy the list and clear the original. Send the original and create a new list for future events. This eliminates the O(n) processing time of copying the entries.
Don't create a new thread each time. Events are delayed anyway, since you're collecting them, so timeliness of writing to database is not your major concern. Two choices:
Start a single thread up front, then use a BlockingQueue to send list from thread 1 to thread 2. If thread 2 is falling behind, the lists will simply accumulate in the queue until thread 2 can catch up, without delaying thread 1, and without overloading the system with too many threads.
Submit the job to a thread pool, e.g. using an Executor. This would allow multiple (but limited number of) threads to process the lists, in case processing is slower than event generation. Disadvantage is that events may be written out of order.
For the purpose of separation of concern and reusability, you should encapsulate the logic of collecting events, and sending them to thread in blocks for processing, in a separate class, rather than having that logic embedded in the event-generation code.
That way you can easily add extra features, e.g. a timeout for flushing pending events before reaching normal threshold (150), so events don't sit there too long if event generation slows down.

constantly check database [duplicate]

I'm using JDBC, need to constantly check the database against changing values.
What I have currently is an infinite loop running, inner loop iterating over a changing values, and each iteration checking against the database.
public void runInBG() { //this method called from another thread
while(true) {
while(els.hasElements()) {
Test el = (Test)els.next();
String sql = "SELECT * FROM Test WHERE id = '" + el.getId() + "'";
Record r = db.getTestRecord(sql);//this function makes connection, executeQuery etc...and return Record object with values
if(r != null) {
//do something
}
}
}
}
I'm think this isn't the best way.
The other way I'm thinking is the reverse, to keep iterating over the database.
UPDATE
Thank you for the feedback regarding timers, but I don't think it will solve my problem.
Once a change occurs in the database I need to process the results almost instantaneously against the changing values ("els" from the example code).
Even if the database does not change it still has to check constantly against the changing values.
UPDATE 2
OK, to anyone interested in the answer I believe I have the solution now. Basically the solution is NOT to use the database for this. Load in, update, add, etc... only whats needed from the database to memory.
That way you don't have to open and close the database constantly, you only deal with the database when you make a change to it, and reflect those changes back into memory and only deal with whatever is in memory at the time.
Sure this is more memory intensive but performance is absolute key here.
As to the periodic "timer" answers, I'm sorry but this is not right at all. Nobody has responded with a reason how the use of timers would solve this particular situation.
But thank you again for the feedback, it was still helpful nevertheless.

Another possibility would be using ScheduledThreadPoolExecutor.
You could implement a Runnable containing your logic and register it to the ScheduledExecutorService as follows:
ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(10);
executor.scheduleAtFixedRate(myRunnable, 0, 5, TimeUnit.SECONDS);
The code above, creates a ScheduledThreadPoolExecutor with 10 Threads in its pool, and would have a Runnable registered to it that will run in a 5 seconds period starting immediately.
To schedule your runnable you could use:
scheduleAtFixedRate
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given period; that is executions will commence after initialDelay then initialDelay+period, then initialDelay + 2 * period, and so on.
scheduleWithFixedDelay
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given delay between the termination of one execution and the commencement of the next.
And here you can see the advantages of ThreadPoolExecutor, in order to see if it fits to your requirements. I advise this question: Java Timer vs ExecutorService? too in order to make a good decision.

Keeping the while(true) in the runInBG() is a bad idea. You better remove that. Instead you can have a Scheduler/Timer(use Timer & TimerTask) which would call the runInBG() periodically and check for the updates in the DB.

u could use a timer--->
Timer timer = new Timer("runInBG");
//Taking an instance of class contains your repeated method.
MyClass t = new MyClass();
timer.schedule(t, 0, 2000);

As you said in the comment above, if application controls the updates and inserts then you can create a framework which notifies for 'BG' thread or process about change in database. Notification can be over network via JMS or intra VM using observer pattern or both local and remote notifications.
You can have generic notification message like (it can be class for local notification or text message for remote notifications)
<Notification>
<Type>update/insert</Type>
<Entity>
<Name>Account/Customer</Name>
<Id>id</Id>
<Entity>
</Notification>

To avoid a 'busy loop', I would try to use triggers. H2 also supports a DatabaseEventListener API, that way you wouldn't have to create a trigger for each table.
This may not always work, for example if you use a remote connection.

UPDATE 2
OK, to anyone interested in the answer I believe I have the solution now. Basically the solution is NOT to use the database for this. Load in, update, add, etc... only whats needed from the database to memory. That way you don't have to open and close the database constantly, you only deal with the database when you make a change to it, and reflect those changes back into memory and only deal with whatever is in memory at the time. Sure this is more memory intensive but performance is absolute key here.

DelayQueue with higher speed remove()?

I have a project that keeps track of state information in over 500k objects, the program receives 10k updates/second about these objects, the updates consist of new, update or delete operations.
As part of the program house keeping must be performed on these objects roughly every five minutes, for this purpose I've placed them in a DelayQueue implementing the Delayed interface, allowing the blocking functionality of the DelayQueue to control house keeping of these objects.
Upon new, an object is placed on the DelayQueue.
Upon update, the object is remove()'d from the DelayQueue, updated and then reinserted at it's new position dictated by the updated information.
Upon delete, the object is remove()'d from the DelayQueue.
The problem I'm facing is that the remove() method becomes a prohibitively long operation once the queue passes around 450k objects.
The program is multithreaded, one thread handles updates and another the house keeping. Due to the remove() delay, we get nasty locking performance issues, and eventually the update thread buffer's consumes all of the heap space.
I've managed to work around this by creating a DelayedWeakReference (extends WeakReference implements Delayed), which allows me to leave "shadow" objects in the queue until they would expire normally.
This takes the performance issue away, but causes an significant increase in memory requirements. Doing this results in around 5 DelayedWeakReference's for every object that actually needs to be in the queue.
Is anyone aware of a DelayQueue with additional tracking that permits fast remove() operations? Or has any suggestions of better ways to handle this without consuming significantly more memory?

took me some time to think about this,
but after reading your interesting question for some minutes, here are my ideas:
A. if you objects have some sort of ID, use it to hash, and actually don't have one delay queue, but have N delay queues.
This will reduce the locking factor by N.
There will be a central data structure,
holding these N queues. Since N is preconfigured,
you can create all N queues when the system starts.

If you only need to perform a housekeeping "roughly every five minutes" this is allot of work to maintain that.
What I would do is have a task which runs every minute (or less as required) to see if it has been five minutes since the last update. If you use this approach, there is no additional collection to maintain and no data structure is altered on an update. The overhead of scanning the components is increased, but is constant. The overhead of performing updates becomes trivial (setting a field with the last time updated)

If I understand your problem correctly, you want to do something to an object, if it hasn't been touched for 5 minutes.
You can have a custom linked list; the tail is the most recently touched. Removing a node is fast.
The book keeping thread can simply wake up every 1 second, and remove heads that are 5 minutes old. However, if the 1 second delay is unacceptable, calculate the exact pause time
// book keeping thread
void run()
synchronized(list)
while(true)
if(head==null)
wait();
else if( head.time + 5_min > now )
wait( head.time + 5_min - now );
else
remove head
process it
// update thread
void add(node)
synchronized(list)
append node
if size==1
notify()
void remove(node)
synchronized(list)
remove node

Calculating the response time of queries using threads

I have a set of queries which has to be executed simultaneously. For this I am starting a Runnable thread for each queries, calling it from a while loop iterating through the List of queries.
The thread executed each of the queries, and also calculates the time taken by each query to execute. This is done by capturing start time and end time getting the diff. This time should be specific for each query, which needs to be printed out along with the corresponding query/
In this scenario, do I just need to capture times and display? Will it lead to any synchronization problems?

This question is too generic. Will it lead to concurrency problems? May be. Do you share Objects between each Runnable? If you do you MIGHT have issue, if you don't then no.
If your loop looks like this for example (this code proves a point - it does not work):
for(int i=0;i<queries.size();++i){
String query = queries.get(i);
new Thread(new Runnable() {
public void run() {
//Execute the query
}
});
}
Each Thread will execute it's own query WITHOUT accessing some shared data - then no, you will not have concurrency problems - at least in the Java code. You could have problems in the Database - for example multiple queries trying to update the same row.

In this scenario, do I just need to capture times and display? Will it
lead to any synchronization problems?
I don't know, do you? Why don't you just have a member field for each one of your Runnable instances that keeps track of the time it takes to execute - and then when all threads are finished, iterate over the Runnables and display the data. Who besides you knows what your true intentions are. There are only two synchronization problems I know of: Deadlock & Race Conditions. Neither one applies to this scenario.

How to iterate over db records correctly with hibernate

I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...

Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.

My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.