I am using multiple threads to insert insert records in different tables. In addition, I am using batch processing for the insertion of records to improve the efficiency.
Note: Number of records to be inserted are in millions.
My question is should I use connection pooling in this multi-threaded environment?
My Concern:
Each threads gonna run for quite sometime to perform the database operation. So, if the size of my connection pool is 2 and number of threads are 4,then at a given moment only 2 threads are gonna run. Consequently, other 2 threads gonna sit ideal for a long time to get the connection, as the db operations for million records are time consuming. Moreover, such connection-pooling will hinder the purpose of using multiple threads.
Using a connection pool in a batch is a matter of convenciency. It will help you limit the amount of open connections, abandoned time, close connections if you forget to close them verify if the connection is open etc.
Check out the Plain Ol' Java example here
Related
I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.
Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).
As per my understanding the execution of java programs is pretty fast, things that slow down an application are mainly Network and IO operations.
For example, if I have a for loop running 10000 times which opens a file, processes some data and saves the data back into the file. If at all the application is slow, it not because of the loop executing 10000 times but because of the file opening and closing within the loop.
I have an MVC application where before I view a page I go through a Controller which in turn calls Services, which finally calls some DAO methods.
The problem is that there are so many queries being fired before the page loads and hence the page load time is 2 mins, which is pathetic.
Since the service calls various DAO methods and each DAO method uses a different connection object, I thought of doing this: "Create only one DAO method that the Service would call and this DAO method would fire all queries on one Connection object."
So this would save the time of connecting and disconnecting to the database.
But, the connection object in my application is coming from a connection pool. And most connection pools don't close connections they just send them back to the connection pools. So my above solution would not have any effect as anyways there is no opening and closing of connections.
How can I enhance the performance of my application?
Firstly you should accurately determine where the time is spent using tools like Profiler.
Once the root cause is known you can see if the operations can be optimized, i.e remove unnecessary steps. If not then you can see if the result of the operations can be cached and reused.
Without accurate understanding of processing that is taking time, it will be difficult to make any reasonable optimization.
If you reuse connection objects from the pool, this means that the connection/disconnection does not create any performance problem.
I agree with Ashwinee K Jha that a Profiler would give you some clear information of what you could optimize.
Meanwhile some other ideas/suggestions:
Could you maintain some cache of answers? I guess that not all of the 10,000 queries are distinct!
Try tuning the number of Connection objects in the Pool. There should be an optimal number.
Is your query execution already multi-threaded? I guess it is, so try tuning the number of threads. Generally, the number of cores is a good number of threads BUT, in the case of I/Os a much larger number is optimal (the big cost is the I/Os, not the CPU)
Some suggestions :
Scale your database. Perhaps the database itself is just slow.
Use 'second level caching' or application session caching to
potentially speed things up and reduce the need to query the
database.
Change your queries, application or schemas to reduce the number of
calls made.
You can use The Apache DBCP which use connection pool, calling database IO is costly, but mostly db connection openning and closing take good chunk of time.
You can also increase the maxIdle time (The maximum number of connections that can remain idle in the pool)
Also you can look into in memory data grid eg hazelcast etc
I'm working on small CRUD application that use plain JDBC, with a Connection enum-based singleton, after reading the first part of Java Concurrency in Practice I just liked the ThreadLocalapproach to write thread-safe code, my question is :
When wrapping global JDBC connection in a ThreadLocal considered a good practice ?
When wrapping global JDBC connection in a ThreadLocal considered a good practice ?
Depends a lot on the particulars. If there are a large number of threads then each one of them is going to open up their own connection which may be prohibitive. Then you are going to have connections that stagnate as threads lie dormant.
It would be better to use a reentrant connection pool. Then you can reuse connections that are already open but not currently in use but limit the number of connections to the minimum you need to work concurrently. Apache's DBCP is a good example and is well thought of.
To quote from their docs:
Creating a new connection for each user can be time consuming (often requiring multiple seconds of clock time), in order to perform a database transaction that might take milliseconds. Opening a connection per user can be unfeasible in a publicly-hosted Internet application where the number of simultaneous users can be very large. Accordingly, developers often wish to share a "pool" of open connections between all of the application's current users. The number of users actually performing a request at any given time is usually a very small percentage of the total number of active users, and during request processing is the only time that a database connection is required.
First of all, is this possibly ready-made?
Hibernate 3.6 , JDBC batch_size 500 , using Hilo generator on the
200.000 entities.
In my example case, I have a request that takes 56 seconds, and I am creating 200,000 entities in the session. So, the session.flush() command takes 32 of those 56 seconds with only one CPU core at %100.
Is there a way to get the list of entities that need to be updated and create the SQL statements, say in four threads?
You cannot simply flush() in different threads, because what flush() does is basically sending all pending SQL INSERT statements to the database using underlying connection. JDBC connections aren't thread safe, which means you would have to use 4 different connections and thus 4 different transactions. If all inserts need to take place in one transaction, there is nothing you can do here.
If you can live with 4 separate transactions, just create a thread pool and store records in smaller batches. Pool will distribute INSERT operations across several threads.
Also are you sure this will really help? I would guess flush() is not CPU-bound but I/O or network bound. However your experience is different with 100% usage, so I might be wrong. Also try optimizing INSERTs - using stateless session, raw JDBC/Native queries, batch inserts, etc. Spliting into separate threads is much harder.
I am working to develop a JMS application(stand alone multithreaded java application) which can receive 100 messages at a time , they need to be processed and database procedures need to be called for inserting/updating data. Procedures are very heavy as validations are also performed in them. Each procedure is taking about 30 to 50 seconds of time to execute and they are capable to run concurrently.
My concern is to execute 100 procedures for all 100 messages and also send reply within time limit of 90 seconds by jms application.
No application server to be used(requirement) and database is Teradata (RDBMS)
I am using connection pool and thread pool in java code and testing code with 90 connections.
Question is :
(1) What should be the limit on number of connections with database at a time?
(2) How many threads at a time are recommended?
Thanks,
Jyoti
90 seems like a lot. My recommendation is to benchmark this. Your criteria is uniques and you need to make sure you get the maximum throughput.
I would make the code configurable with how many concurrent connections you use and run it with 10 ... 100 connections going up 10 at a time. This should not take long. When you start slowing down then you know you have exceeded the benefits of running concurrently.
Do it several times to make sure your results are predictable.
Another concern is your statement of 'procedure is taking about 30 to 50 seconds to run'. How much of this time is processing via Java and how much time is waiting for the database to process an SQL statement? Should both times really be added to determine the max number of connections you need?
Generally speaking, you should get a connection, use it, and close it as quickly as possible after processing your java logic if possible. If possible, you should avoid getting a connection, do a bunch of java side processing, call the database, do more java processing, then close the conection. There is probably no need to hold the connection open that long. A consideration to keep in mind when doing this approach is what processing (including database access) you need to keep in single transaction.
If for example, of the 50 seconds to run, only 1 second of database access is necessary, then you probably don't need such a high max number of connections.