Is it worth to parallelize queries with jdbc and mysql? - java

One jdbc "select" statement takes 5 secs to complete.
So doing 5 statements takes 25 secs.
Now I try to do the job in parallel. The db is mysql with innodb.
I start 5 threads and give each thread its own db connection. But it still takes 25 secs for all to complete?
Note I provide java with enough heap and have 8 cores but only one hd (maybe having only one hd is the bottleneck here?)
Is this the expected behavour with mysql out of the box?
here is example code:
public void doWork(int n) {
try (Connection conn = pool.getConnection();
PreparedStatement stmt = conn.prepareStatement("select id from big_table where id between "+(n * 1000000)" and " +(n * 1000000 +1000000));
) {
try (ResultSet rs = stmt.executeQuery();) {
while (rs.next()) {
Long itemId = rs.getLong("id");
}
}
}
}
public void doWorkBatch() {
for(int i=1;i<5;i++)
doWork(i);
}
public void doWorkParrallel() {
for(int i=1;i<5;i++)
new Thread(()->doWork(i)).start();
System.console().readLine();
}
(I don't recall where but I read that a standard mysql installation can easily handle 1000 connections in parallel)

Looking at your problem definitely multi-threading will improve your performance because even i once converted an 4-5 hours batch job into a 7-10 minute job by doing exactly the same what you're thinking but you need to know the following things before hand while designing :-
1) You need to think about inter-task dependencies i.e. tasks getting executed on different threads.
2) Using connection pool is a good sign since Creating Database connections are slow process in Java and takes long time.
3) Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
4) Cut tasks into several work units where each unit does one job.
5) Particularly for your case, i.e. using mysql. Which database engine you use would also affect the performance as the InnoDB engine uses row-level locking. This way, it will handle much higher traffic. The (usual) alternative, however, (MyISAM) does not support row-level locking, it uses table locking.
i'm talking about the case What if another thread comes in and wants to update the same row before the first thread commits.
6) To improve performance of Java database application is running queries with setAutoCommit(false). By default new JDBC connection has there auto commit mode ON, which means every individual SQL Statement will be executed in its own transaction. while without auto commit you can group SQL statement into logical transaction, which can either be committed or rolled back by calling commit() or rollback().
You can also checkout springbatch which is designed for batch processing.
Hope this helps.

It depends where the bottleneck in your system is...
If your queries spend a few seconds each establishing the connection to the database, and only a fraction of that actually running the query, you'd see a nice improvement.
However if the time is spent in mysql, running the actual query, you wouldn't see as much of a difference.
The first thing I'd do, rather than trying concurrent execution is to optimize the query, maybe add indices to your tables, and so forth.

Concurrent execution may be faster. You should also consider batch execution.

Concurrent execution will help if there is any room for parallelization. In your case, there seems to be no room for parallelization, because you have a very simple query which performs a sequential read of a huge amount of data, so your bottleneck is probably the disk transfer and then the data transfer from the server to the client.
When we say that RDBMS servers can handle thousands of requests per second we are usually talking about the kind of requests that we usually see in web applications, where each SQL query is slightly more complicated than yours, but results in much smaller disk reads (so they are likely to be found in a cache) and much smaller data transfers (stuff that fit within a web page.)

Related

PreparedStatement in Threads (java)

I am writing a program with threads insert into a db.
Example
public static void save(String name){
{
try(PreparedStatement preparedStatement = ...insert...)
{
preparedStatement.setString(1, name);
preparedStatement.executeUpdate();
preparedStatement.close();
} catch (...){
}
}
Question: Could it be that when simultaneously executing threads of insert into a table, one thread will use (preparedStatement.executeUpdate()) the preparedStatement from another Thread?
Absolutely. You should not be doing this - each thread needs to have its own database connection (which therefore implies it neccessarily also ends up having its own PreparedStatement).
Better yet - don't do this. You're just making things confusing and slower, it's lose-lose-lose. There is no benefit at all to your plan. The database isn't going to magically do the job faster if you insert from 2 threads simultaneously.
The conclusion is simple: threads are a really bad idea when INSERTing a lot of data into the same table, so DO NOT DO IT!
But I really want to speed up my INSERTs!
My data gathering is slow
IF (big if!!) gathering the data for insertion is slower than the database can insert records for you, and the data gathering job lends itself well to multi-threading, then have threads that gather the data, but have these threads put objects onto a queue, and have a separate 'DB inserter' thread (the only thread that even has a connection to the DB) that pulls objects off this queue and runs an INSERT.
If you can gather the data quickly, or the source does not lend itself to multi-threaded, this only makes your code longer, harder to understand, considerably harder to test, and slower. No point at all.
Useful tools: LinkedBlockingQueue - an instance of this is the one piece of shared data all threads have. Your data gatherer threads toss objects onto this queue, and your single db inserted thread fetches objects off of it.
General insert speed advice 1: bundling
DBs work in transactions. If you have autocommit mode on (and Connections start in this mode), that's not 'no transactions'. That's merely (hence the name): The DB commits after every transaction. You can't do 'non-transactional' in proper databases. A commit() is heavy (takes a long time to process), but so is excessively long transactions (doing thousands of things in a single transaction before committing). Thus, you get the goldilocks principle: You want to run about 500 or so inserts, then commit.
Note that this has a downside: If an error occurs halfway through this process, then some records have been committed and some haven't been. Keep that in mind - your process needs to be idempotent or that is likely not acceptable and you'd need to make it idempotent (e.g. by having a column that lists the 'insert session' id, so you can delete them all if the operation cannot be completed properly) - and if your DB is simultaneously used by other stuff, you need more complexity as well (some sort of flag or filter so that other simultaneous code doesn't consider any of the committed, inserted records until the entire batch is completely added).
Relevant methods:
con.setAutoCommit(false);
con.commit()
This general structure:
try (PreparedStatement ps = con.prepare.......) {
int inserted = 0;
while (!allGenerationDone) {
Data d = queue.take();
ps.setString(1, d.getName());
ps.setDate(2, d.getBirthDate());
// set the other stuff
ps.execute();
if (inserted++ % 500 == 0) con.commit();
}
}
con.commit();
General insert speed advice 2: bulk
Most DB engines have special commands for bulk insertion. From a DB engine's perspective, various cleanup and care tasks take a ton of time and may not even be neccessary, or can be combined to save a load of time, when doing bulk inserts. Specifically, checking of constraints (particularly, reference constraints) and building of indices takes most of the time of processing an INSERT, and these things can either be skipped entirely or sped up considerably by doing them in bulk all at once at the end.
The way to do this is highly dependent on the underlying database. For example, in postgres, you can turn off constraint checking and turn off index building, then run your inserts, then re-enable. You can even choose to omit constraint checks entirely (meaning, your DB could be in an invalid state if your code is messed up, but if speed is more important than safety this can be the right way to go about it). Index building is considerably faster if done at the end.
Other databases generally have similar strategies. Alternatively, there are commands that combine it all, generally called COPY (instead of INSERT). Check your DB engine's docs.
Read this SO question for some info and benchmarks on how COPY compares to INSERT. And use a web search engine searching for e.g. mysql bulk insert.

Why do I see a large number of connections in V$SESSION for a period of time?

We are tuning up a server where clients can create queries and send them to Oracle. This server can create a pool and have connections on standby. The number of standby connections is something we can control and that is what we are trying to tune up. So, while we are tuning up this minimal number of connection on standby we were checking the V$SESSION table to see the connections standing by and when they were active. At that moment of being active is when we started to see this number of "connections" grow up to 70 or 80 at a time while the query was executing. My guess is that these are not connections per se. Looks like the places where it is reading the data from?? I am not sure and that is why I'd like to know. What are those? They only show when the query executes. Here is the query I am using to check in oracle what my connections are doing:
select TO_CHAR(s.prev_exec_start, 'DD-MON-YYYY HH24:MI:SS') as "LAST_RAN", s.*
from V$SESSION s
where username = 'MY_USER_NAME';
It would help if you could show some of the output from your query, but it is very possible that what you are seeing are parallel execution threads for some of the queries. These will spawn automatically to handle the parallel processing requirements and generally have process names like 'P000', 'P001', etc.
The level of parallelism for a query can be defined as part of the related object properties, or within query hints. Short making changes at the system level to parallel processing behavior (see Documentation), or disabling parallel processing entirely for these sessions - which might completely kill query performance - there isn't much you can do.
A logon trigger to disable parallel execution might look like this:
CREATE OR REPLACE TRIGGER APP_SCHEMA_LOGON_TRG
AFTER LOGON ON APP_USER.SCHEMA
BEGIN
execute immediate 'ALTER SESSION DISABLE PARALLEL QUERY';
END;
/

Oracle JDBC connection timed out issue

I have a scenario in production for a web app, where when a form is submitted the data gets stored in 3 tables in Oracle DB through JDBC. Sometimes I am seeing connection time out errors in logs while the app is trying to connect to Oracle DB through Java code. This is intermittent.
Below is the exception:
SQL exception while storing data in table
java.sql.SQLRecoverableException: IO Error: Connection timed out
Most of the times the web app is able to connect to data base and insert values in it but some times and I am getting this time out error and unable to insert data in it. I am not sure why am I getting this intermittent issue. When I checked the connections pool config in my application, I noticed the following things:
Pool Size (Maximum number of Connections that this pool can open) : 10
Pool wait (Maximum wait time, in milliseconds, before throwing an Exception if all pooled Connections are in use) : 1000
Since the pool size is just 10 and if there are multiple users trying to connect to data base will this connection time out issue occur ?
Also since there are 3 tables where the data insertion occurs we are doing the whole insertion in just one connection itself. We are not opneing each DB connection for each individual table.
NOTE: This application is deployed on AEM (Content Management system) server and connections pool config is provided by them.
Update: I tried setting the validation query in the connections pool but still I am getting the connection time out error. I am not sure whether the connections pool has checked the validation query or not. I have attached the connections pool above for reference.
I would try two things:
Try setting a validation query so each time the pool leases a connection, you're sure it's actually available. select 1 from dual should work. On recent JDBC drivers that should not be required but you might give it a go.
Estimate the concurrency of your form. A 10 connections pool is not too small depending on the complexity of your work on DB. It seems you're saving a form so it should not be THAT complex. How many users per day do you expect? Then, on peak time, how many users do you expect to be using the form at the same time? A 10 connections pool often leases and retrieves connections quite fast so it can handle several transactions per second. If you expect more, increase the size slightly (more than 25-30 actually degrades DB performance as more queries compete for resources there).
If nothing seems to work, it would be good to check what's happening on your DB. If possible, use Enterprise Manager to see if there are latches while doing stuff on those three tables.
I give this answer from programming point of view. There are multiple possibilities for this problem. These are following and i have added appropriate solution for it. As connection timeout occurs, means your new thread do not getting database access within mentioned time and it is due to:
Possibility I: Not closing connection, there should be connection leakage somewhere in your application Solution
You need to ensure this thing and need to check for this leakage and close the connection after use.
Possibility II: Big Transaction Solution
i. Is these insertion synchronized, if it is so then use it very carefully. Use it at block level not method level. And your synchronized block size should be minimum as much as possible.
What happen is if we have big synchronized block, we give connection, but it will be in waiting state as this synchronized block needs too much time for execution. so other thread waiting time increases. Suppose we have 100 users, each have 100 threads for that operation. 1st is executing and it takes too long time. and others are waiting. So there may be a case where 80th 90th,etc thread throw timeout. And For some thread this issue occurs.
So you must need to reduce size of the synchronized block.
ii. And also for this case also check If the transaction is big, then try to cut the transaction into smaller ones if possible:-
For an example here, for one insertion one small transaction. for second other small transaction, like this. And these three small transaction completes operation.
Possibility III: Pool size is not enough if usability of application is too high Solution
Need to increase the pool size. (It is applicable if you properly closes all the connection after use)
You can use Java Executor service in this case .One thread One connection , all asynchronous .Once transaction completed , release the connection back to pool.That way , you can get rid of this timeout issue.
If one connection is inserting the data in 3 tables and other threads trying to make connection are waiting, timeout is bound to happen.

Concurrent use of same JDBC connection by multiple threads

I'm trying to better understand what will happen if multiple threads try to execute different sql queries, using the same JDBC connection, concurrently.
Will the outcome be functionally correct?
What are the performance implications?
Will thread A have to wait for thread B to be completely done with its query?
Or will thread A be able to send its query immediately after thread B has sent its query, after which the database will execute both queries in parallel?
I see that the Apache DBCP uses synchronization protocols to ensure that connections obtained from the pool are removed from the pool, and made unavailable, until they are closed. This seems more inconvenient than it needs to be. I'm thinking of building my own "pool" simply by creating a static list of open connections, and distributing them in a round-robin manner.
I don't mind the occasional performance degradation, and the convenience of not having to close the connection after every use seems very appealing. Is there any downside to me doing this?
I ran the following set of tests using a AWS RDS Postgres database, and Java 11:
Create a table with 11M rows, each row containing a single TEXT column, populated with a random 100-char string
Pick a random 5 character string, and search for partial-matches of this string, in the above table
Time how long the above query takes to return results. In my case, it takes ~23 seconds. Because there are very few results returned, we can conclude that the majority of this 23 seconds is spent waiting for the DB to run the full-table-scan, and not in sending the request/response packets
Run multiple queries in parallel (with different keywords), using different connections. In my case, I see that they all complete in ~23 seconds. Ie, the queries are being efficiently parallelized
Run multiple queries on parallel threads, using the same connection. I now see that the first result comes back in ~23 seconds. The second result comes back in ~46 seconds. The third in ~1 minute. etc etc. All the results are functionally correct, in that they match the specific keyword queried by that thread
To add on to what Joni mentioned earlier, his conclusion matches the behavior I'm seeing on Postgres as well. It appears that all "correctness" is preserved, but all parallelism benefits are lost, if multiple queries are sent on the same connection at the same time.
Since the JDBC spec doesn't give guarantees of concurrent execution, this question can only be answered by testing the drivers you're interested in, or reading their source code.
In the case of MySQL Connector/J, all methods to execute statements lock the connection with a synchronized block. That is, if one thread is running a query, other threads using the connection will be blocked until it finishes.
Doing things the wrong way will have undefined results... if someone runs some tests, maybe they'll answer all your questions exactly, but then a new JVM comes out, or someone tries it on another jdbc driver or database version, or they hit a different set of race conditions, or tries another platform or JVM implementation, and another different undefined result happens.
If two threads modify the same state at the same time, anything could happen depending on the timing. Maybe the 2nd one overwrites the first's query, and then both run the same query. Maybe the library will detect your error and throw an exception. I don't know and wouldn't bother testing... (or maybe someone already knows or it should be obvious what would happen) so this isn't "the answer", but just some advice. Just use a connection pool, or use a synchronized block to ensure problems don't happen.
We had to disable the statement cache on Websphere, because it was throwing ArrayOutOfBoundsException at PreparedStatement level.
The issue was that some guy though it was smart to share a connection with multiple threads.
He said it was to save connections, but there is no point multithreading queries because the db won't run them parallel.
There was also an issue with a java runnables that were blocking each others because they used the same connection.
So that's just something to not do, there is nothing to gain.
There is an option in websphere to detect this multithreaded access.
I implemented my own since we use jetty in developpement.

c3p0 getConnection Hangs as Number of Connections are Increased

i"m running a web service on Heroku and using New Relic to monitor its performance. I'm using MySQL with Hibernate on top. My non default c3p0 settings are the following
hibernate.c3p0.maxStatementsPerConnection, 5
hibernate.c3p0.maxPoolSize, 35
hibernate.c3p0.minPoolSize, 5
hibernate.c3p0.initialPoolSize, 10
hibernate.c3p0.acquireIncrement, 10
Every single request to my web service hits the database at least a couple of times. After running a load test of about 200 requests/minute for 10min I see most of time is spent in
com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource.getConnection
My guess it's waiting for a connection in the connection pool? The interesting part is as I increased
hibernate.c3p0.maxPoolSize, 40
the performance was worse off (longer wait time in the same getConnection call. During the test I can see that the max number of c3p0 connections is indeed open at the MySQL server (max connection set on MySQL's end is 300, definitely not exhausted).
All of my database functions use the same format
public void executeTransaction( Session session, IGenericQuery<T> query, T entity )
{
Transaction tx = null;
try
{
tx = session.beginTransaction();
query.execute( session, entity );
tx.commit();
}
catch ( RuntimeException e )
{
try
{
tx.rollback();
}
catch ( RuntimeException e2 )
{
}
throw e;
}
finally
{
if ( session != null )
{
session.close();
}
}
}
so I'm certain all sessions are closed, which should translate into connections closing. Why is the wait time more as I increase the max number of connections? It seems like performance increases from hibernate.c3p0.maxPoolSize, 25 to hibernate.c3p0.maxPoolSize, 30, but drops after hibernate.c3p0.maxPoolSize, 35. Are my values far off?
Thanks!
as a guess, i would try increasing numHelperThreads. you have a heavy load; maybe c3p0's administrative Threads are getting backed up. (You should be able to see this if you dump stack traces or use JMX to monitor c3p0. If you have enough helper threads, they should generally be idle(), wait()ing. If they are getting backed up, you'll see them mostly active and runnable, and by JMX you'll see tasks queued.)
an insufficiency of helper threads is consistent with your observed better-then-worse performance with maxPoolSize. initially you get what you want, more Connections at the ready, but then the helper Threads fail to keep up and adding more Connections just makes things worse.
given your settings, helper Threads shouldn't have too much work to do, UNLESS maxStatementsPerConnection is too small. if your app has more than 5 PreparedStatements that are run frequently, then you will end up churning through Statements and tying up helper Threads with Statement close() tasks. you might try making this value larger. it should be approximately (rounding up) the number of distinct PreparedStatements used on an ongoing basis by your application. (You can ignore single or very rarely used PreparedStatements, involved for example in setup or cleanup.) again, monitoring what helper threads are up to would give you information about whether this is the issue. (you'd see backed-up Statement close() tasks.)
so, things to try: increase numHelperThreads, increase maxStatementsPerConnection (or set it to zero, to turn off Statement caching entirely.)
good luck!

Categories

Resources