I am writing a program with threads insert into a db.
Example
public static void save(String name){
{
try(PreparedStatement preparedStatement = ...insert...)
{
preparedStatement.setString(1, name);
preparedStatement.executeUpdate();
preparedStatement.close();
} catch (...){
}
}
Question: Could it be that when simultaneously executing threads of insert into a table, one thread will use (preparedStatement.executeUpdate()) the preparedStatement from another Thread?
Absolutely. You should not be doing this - each thread needs to have its own database connection (which therefore implies it neccessarily also ends up having its own PreparedStatement).
Better yet - don't do this. You're just making things confusing and slower, it's lose-lose-lose. There is no benefit at all to your plan. The database isn't going to magically do the job faster if you insert from 2 threads simultaneously.
The conclusion is simple: threads are a really bad idea when INSERTing a lot of data into the same table, so DO NOT DO IT!
But I really want to speed up my INSERTs!
My data gathering is slow
IF (big if!!) gathering the data for insertion is slower than the database can insert records for you, and the data gathering job lends itself well to multi-threading, then have threads that gather the data, but have these threads put objects onto a queue, and have a separate 'DB inserter' thread (the only thread that even has a connection to the DB) that pulls objects off this queue and runs an INSERT.
If you can gather the data quickly, or the source does not lend itself to multi-threaded, this only makes your code longer, harder to understand, considerably harder to test, and slower. No point at all.
Useful tools: LinkedBlockingQueue - an instance of this is the one piece of shared data all threads have. Your data gatherer threads toss objects onto this queue, and your single db inserted thread fetches objects off of it.
General insert speed advice 1: bundling
DBs work in transactions. If you have autocommit mode on (and Connections start in this mode), that's not 'no transactions'. That's merely (hence the name): The DB commits after every transaction. You can't do 'non-transactional' in proper databases. A commit() is heavy (takes a long time to process), but so is excessively long transactions (doing thousands of things in a single transaction before committing). Thus, you get the goldilocks principle: You want to run about 500 or so inserts, then commit.
Note that this has a downside: If an error occurs halfway through this process, then some records have been committed and some haven't been. Keep that in mind - your process needs to be idempotent or that is likely not acceptable and you'd need to make it idempotent (e.g. by having a column that lists the 'insert session' id, so you can delete them all if the operation cannot be completed properly) - and if your DB is simultaneously used by other stuff, you need more complexity as well (some sort of flag or filter so that other simultaneous code doesn't consider any of the committed, inserted records until the entire batch is completely added).
Relevant methods:
con.setAutoCommit(false);
con.commit()
This general structure:
try (PreparedStatement ps = con.prepare.......) {
int inserted = 0;
while (!allGenerationDone) {
Data d = queue.take();
ps.setString(1, d.getName());
ps.setDate(2, d.getBirthDate());
// set the other stuff
ps.execute();
if (inserted++ % 500 == 0) con.commit();
}
}
con.commit();
General insert speed advice 2: bulk
Most DB engines have special commands for bulk insertion. From a DB engine's perspective, various cleanup and care tasks take a ton of time and may not even be neccessary, or can be combined to save a load of time, when doing bulk inserts. Specifically, checking of constraints (particularly, reference constraints) and building of indices takes most of the time of processing an INSERT, and these things can either be skipped entirely or sped up considerably by doing them in bulk all at once at the end.
The way to do this is highly dependent on the underlying database. For example, in postgres, you can turn off constraint checking and turn off index building, then run your inserts, then re-enable. You can even choose to omit constraint checks entirely (meaning, your DB could be in an invalid state if your code is messed up, but if speed is more important than safety this can be the right way to go about it). Index building is considerably faster if done at the end.
Other databases generally have similar strategies. Alternatively, there are commands that combine it all, generally called COPY (instead of INSERT). Check your DB engine's docs.
Read this SO question for some info and benchmarks on how COPY compares to INSERT. And use a web search engine searching for e.g. mysql bulk insert.
Related
I am fetching records (of large data set, around 1 Million records)from MariaDB in batches of size 500 (by using 'limit').
For each fetch iteration I am opening and closing the connection.
In my peer review I was advised to fetch the result set once and batch process by iterating on the result set itself, i.e. without closing the connection.
Is the second method right way of doing it ?
Edit : After I fetch records in batches of size 500 I am updating a field for each record and putting it on a messaging queue.
Yes, the second method is the right way to do it. Here are some reasons:
There is overhead to running the query multiple times.
The underlying data might change in the tables you are using, and the separate batches might be inconsistent.
You are depending on the ordering of the results and the sorting might have duplicates.
Your program starts
Connect to database
do some SQL (selects/inserts/whatever)
do some more SQL (selects/inserts/whatever)
do some more SQL (selects/inserts/whatever)
...
Disconnect from database
Your program ends
That is, keep the connection open as long as needed during the program. (Even if you don't explicitly disconnect, the termination of your program will do the termination. This is important to note when doing a web site -- each 'page' is essentially a separate 'program'; the db connection cannot be held between pages.)
You have another implied question... "Should I grab a batch of rows at once, then process them in the client?" The answer is "It depends".
If the processing can be done in SQL, it is probably much more efficient to do it there. Example: summing up some numbers.
If you fetch some rows from one table, then for each of those rows, fetch row(s) from another table... It will be much more efficient to use an SQL JOIN.
"Batching" may not be relevant. The client interface is probably
Fetch rows from a table (possibly all rows, even if millions)
Look at each row in turn.
Please provide the specifics of what you will be doing with the million rows so we can discuss more specifically.
Polling loop:
If you check for new things to do only once a minute, do reconnect each time.
Given that, it does not make sense to hang onto a resultset between pools.
Opening and closing a database connection is quite time consuming. Keeping the connection open will save a lot of time.
I have a J2EE server, currently running only one thread (the problem arises even within one single request) to save its internal model of data to MySQL/INNODB-tables.
Basic idea is to read data from flat files, do a lot of calculation and then write the result to MySQL. Read another set of flat files for the next day and repeat with step 1. As only a minor part of the rows change, I use a recordset of already written rows, compare to the current result in memory and then update/insert it correspondingly (no delete, just setting a deletedFlag).
Problem: Despite a purely sequential process I get lock timeout errors (#1204) and Innodump show record locks (though I do not know how to figure the details). To complicate things under my windows machine everything works, while the production system (where I can't install innotop) has some record locks.
To the critical code:
Read data and calculate (works)
Get Connection from Tomcat Pool and set to autocommit=false
Use Statement to issue "LOCK TABLES order WRITE"
Open Recordset (Updateable) on table order
For each row in Recordset --> if difference, update from in-memory-object
For objects not yet in the database --> Insert data
Commit Connection, Close Connection
The Steps 5/6 have an Commitcounter so that every 500 changes the rows are committed (to avoid having 50.000 rows uncommitted). In the first run (so w/o any locks) this takes max. 30sec / table.
As stated above right now I avoid any other interaction with the database, but it in future other processes (user requests) might read data or even write some fields. I would not mind for those processes to read either old/new data and to wait for a couple of minutes to save changes to the db (that is for a lock).
I would be happy to any recommendation to do better than that.
Summary: Complex code calculates in-memory objects which are to be synchronized with database. This sync currently seems to lock itself despite the fact that it sequentially locks, changes unlocks the tables without any exceptions thrown. But for some reason row locks seem to remain.
Kind regards
Additional information:
Mysql: show processlist lists no active connections (all asleep or alternatively waiting for table locks on table order) while "show engine INNODB" reports a number of row locks (unfortuantely I can't understand which transaction is meant as output is quite cryptic).
Solved: I wrongly declared a ResultSet as updateable. The ResultSet was closed only on a "finalize()" method via Garbage Collector which was not fast enough - before I reopended the ResultSet and tried therefore to aquire a lock on an already locked table.
Yet it was odd, that innotop showed another query of mine to hang on a completely different table. Though as it works for me, I do not care about oddities:-)
One jdbc "select" statement takes 5 secs to complete.
So doing 5 statements takes 25 secs.
Now I try to do the job in parallel. The db is mysql with innodb.
I start 5 threads and give each thread its own db connection. But it still takes 25 secs for all to complete?
Note I provide java with enough heap and have 8 cores but only one hd (maybe having only one hd is the bottleneck here?)
Is this the expected behavour with mysql out of the box?
here is example code:
public void doWork(int n) {
try (Connection conn = pool.getConnection();
PreparedStatement stmt = conn.prepareStatement("select id from big_table where id between "+(n * 1000000)" and " +(n * 1000000 +1000000));
) {
try (ResultSet rs = stmt.executeQuery();) {
while (rs.next()) {
Long itemId = rs.getLong("id");
}
}
}
}
public void doWorkBatch() {
for(int i=1;i<5;i++)
doWork(i);
}
public void doWorkParrallel() {
for(int i=1;i<5;i++)
new Thread(()->doWork(i)).start();
System.console().readLine();
}
(I don't recall where but I read that a standard mysql installation can easily handle 1000 connections in parallel)
Looking at your problem definitely multi-threading will improve your performance because even i once converted an 4-5 hours batch job into a 7-10 minute job by doing exactly the same what you're thinking but you need to know the following things before hand while designing :-
1) You need to think about inter-task dependencies i.e. tasks getting executed on different threads.
2) Using connection pool is a good sign since Creating Database connections are slow process in Java and takes long time.
3) Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
4) Cut tasks into several work units where each unit does one job.
5) Particularly for your case, i.e. using mysql. Which database engine you use would also affect the performance as the InnoDB engine uses row-level locking. This way, it will handle much higher traffic. The (usual) alternative, however, (MyISAM) does not support row-level locking, it uses table locking.
i'm talking about the case What if another thread comes in and wants to update the same row before the first thread commits.
6) To improve performance of Java database application is running queries with setAutoCommit(false). By default new JDBC connection has there auto commit mode ON, which means every individual SQL Statement will be executed in its own transaction. while without auto commit you can group SQL statement into logical transaction, which can either be committed or rolled back by calling commit() or rollback().
You can also checkout springbatch which is designed for batch processing.
Hope this helps.
It depends where the bottleneck in your system is...
If your queries spend a few seconds each establishing the connection to the database, and only a fraction of that actually running the query, you'd see a nice improvement.
However if the time is spent in mysql, running the actual query, you wouldn't see as much of a difference.
The first thing I'd do, rather than trying concurrent execution is to optimize the query, maybe add indices to your tables, and so forth.
Concurrent execution may be faster. You should also consider batch execution.
Concurrent execution will help if there is any room for parallelization. In your case, there seems to be no room for parallelization, because you have a very simple query which performs a sequential read of a huge amount of data, so your bottleneck is probably the disk transfer and then the data transfer from the server to the client.
When we say that RDBMS servers can handle thousands of requests per second we are usually talking about the kind of requests that we usually see in web applications, where each SQL query is slightly more complicated than yours, but results in much smaller disk reads (so they are likely to be found in a cache) and much smaller data transfers (stuff that fit within a web page.)
I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...
Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.
My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.
I'm writing an application to analyse a MySQL database, and I need to execute several DMLs simmultaneously; for example:
// In ResultSet rsA: Select * from A;
rsA.beforeFirst();
while (rsA.next()) {
id = rsA.getInt("id");
// Retrieve data from table B: Select * from B where B.Id=" + id;
// Crunch some numbers using the data from B
// Close resultset B
}
I'm declaring an array of data objects, each with its own Connection to the database, which in turn calls several methods for the data analysis. The problem is all threads use the same connection, thus all tasks throw exceptios: "Lock wait timeout exceeded; try restarting transaction"
I believe there is a way to write the code in such a way that any given object has its own connection and executes the required tasks independent from any other object. For example:
DataObject dataObject[0] = new DataObject(id[0]);
DataObject dataObject[1] = new DataObject(id[1]);
DataObject dataObject[2] = new DataObject(id[2]);
...
DataObject dataObject[N] = new DataObject(id[N]);
// The 'DataObject' class has its own connection to the database,
// so each instance of the object should use its own connection.
// It also has a "run" method, which contains all the tasks required.
Executor ex = Executors.newFixedThreadPool(10);
for(i=0;i<=N;i++) {
ex.execute(dataObject[i]);
}
// Here where the problem is: Each instance creates a new connection,
// but every DML from any of the objects is cluttered in just one connection
// (in MySQL command line, "SHOW PROCESSLIST;" throws every connection, and all but
// one are idle).
Can you point me in the right direction?
Thanks
I think the problem is that you've confounded a lot of middle tier, transactional, and persistent logic into one class.
If you're dealing directly with ResultSet, you're not thinking about things in a very object-oriented fashion.
You're smart if you can figure out how to get the database to do some of your calculations.
If not, I'd recommend keeping Connections open for the minimum time possible. Open a Connection, get the ResultSet, map it into an object or data structure, close the ResultSet and Connection in local scope, and return the mapped object/data structure for processing.
You keep persistence and processing logic separate this way. You save yourself a lot of grief by keeping connections short-lived.
If a stored procedure solution is slow it could be due to poor indexing. Another solution will perform equally poorly if not worse. Try running EXPLAIN PLAN and see if any of your queries are using TABLE SCAN. If yes, you have some indexes to add. It could also be due to large rollback logs if your transactions are long-running. There's a lot you could and should do to ensure you've done everything possible with the solution you have before switching. You could go to a great deal of effort and still not address the root cause.
After some time of brain breaking, I figured out my own mistakes... I want to put this new knowledge, so... here I go
I made a very big mistake by declaring the Connection objet as a Static object in my code... so obviously, despite I created a new Connection for each new data object I created, every transaction went through a single, static, connection.
With that first issue corrected, I went back to the design table, and realized that my process was:
Read an Id from an input table
Take a block of data related to the Id read in step 1, stored in other input tables
Crunch numbers: Read the related input tables and process the data stored in them
Save the results in one or more output tables
Repeat the process while I have pending Ids in the input table
Just by using a dedicated connection for input reading and a dedicated connection for output writing, the performance of my program increased... but I needed a lot more!
My original approach for steps 3 and 4 was to save into the output each one of the results as soon as I had them... But I found a better approach:
Read the input data
Crunch the numbers, and put the results in a bunch of queues (one for each output table)
A separated thread is checking every second if there's data in any of the queues. If there's data in the queues, write it to the tables.
So, by dividing input and output tasks using different connections, and by redirecting the core process output to a queue, and by using a dedicated thread for output storage tasks, I finally achieved what I wanted: Multithreaded DML execution!
I know there are better approaches to this particular problem, but this one works quite fine.
So... if anyone is stuck with a problem like this... I hope this helps.