I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?
Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.
I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.
No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.
In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.
Related
I have a table of one million records. on a column of each record I have to do a math operation on java that takes a fair amount of time. To improve total time, I created threads. If I create 10 threads, I make a query on each thread of 100.000 records by using a SQL offset and a first fetch so I end up operating in the whole one million along all the threads.
It does improve the processing. I have 12 free processors. But I realized that if I create for example 4 threads, it is much better than one thread (half the total time), but if I create 8 threads, the time is almost the same as 4 threads.
On each thread, I am using something like
Connection conn=DriverManager.getConnection(...);
Statement stmt= conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE);
ResultSet rs= stmt.executeQuery("select COL from DATA OFFSET X FETCH FIRST Y");
while(rs.next()){
//do math stuff on col
rs.updateString("COL", new_math_stuff);
rs.updateRow();
}
I was wondering if having different cursors on different connection on different parts of the table, still can cause too much IO on the table, hence having more threads doesn't help, or if the problem might by somewhere else. (I have very limited profiling capabilities because I am working on a remote computer and to compile the code I have to send it to some guy to approve it and do it.)
My SQL query takes just 1 minute to execute but fetching total rows which can go upto 1 million takes almost 11 minutes.
Following is the Pseudo code :
stmt.setFetchSize(5000);
ResultSet rs = stmt.executeQuery();
while(rs.next()){
//Do some task
}
I have tried changing fetch size from 100 to 80000 but nothing helps, it still takes more than 10 minutes.
If i remove while loop and simply execute the query then whole process takes just few minutes which implies that query is executing fast.
Any suggestions on how to improve the time for fetching of rows?
The only thing you can do at the app level is to:
1. return less data (either by less rows or less columns)
2. increase array fetch size (thereby reducing network round trips via SQL*Net which is a "chatty protocol")
3. have a faster network (thereby reducing the time spent in the network round trips)
4. co-locating app server on same local area network with the DB (essentially a faster network).
While coding the solution to the problem of downloading a huge dynamic zip with low RAM impact, an idea started besieging me, and led to this question, asked for pure curiosity / hunger of knowledge:
What kind of drawbacks could I meet with if, instead of loading the InputStreams one at a time (with separate queries to the database), I would load all the InputStreams in a single query, returning a List of (n, potentially thousands, "opened") InputStreams ?
Current (safe) version: n queries, one inputStream instantiated at a time
for (long id : ids){
InputStream in = getMyService().loadStreamById(id);
IOUtils.copyStream(in, out);
in.close();
}
Hypothetical version: one query, n instantiated inputStreams
List<InputStream> streams = getMyService().loadAllStreams();
for (InputStream in : streams){
IOUtils.copyStream(in, out);
in.close();
in = null;
}
Which are the pro and cons of the second approach, excluding the (I suppose little) amount of memory used to keep multiple java InputStream instantiated ?
Could it lead to some kind of network freeze or database stress (or lock, or problems if others read/write the same BLOB field the Stream is pointing to, etc...) more than multiple queries ?
Or are they smart enough to be almost invisible until asked for data, and then 1 query + 1000 active stream could be better than 1000 query + 1 active stream ?
The short answer is that you risk hitting a limit of your operating system and/or DBMS.
The longer answer depends on the specific operating system and DBMS, but here are a few things to think about:
On Linux there are a maximum number of open file descriptors that any process can hold. The default is/was 1024, but it's relatively easy to increase. The intent of this limit IMO is to kill a poorly-written process, as the amount of memory required per file/socket is minimal (on a modern machine).
If the open stream represents an individual socket connection to the database, there's a hard limit on the total number of client sockets that a single machine may open to a single server address/port. This is driven by the client's dynamic port address range, and it's either 16 or 32k (but can be modified). This limit is across all processes on the machine, so excessive consumption by one process may starve other processes trying to access the same server.
Depending on how the DBMS manages the connections used to retrieve BLOBs, you may run into a much smaller limit enforced by the DBMS. Oracle, for example, defaults to a total of 50 "cursors" (active retrieval operations) per user connection.
Aside from these limits, you won't get any benefit given your code as written, since it runs through the connections sequentially. If you were to use multiple threads to read, you may see some benefit from having multiple concurrent connections. However, I'd still open those connections on an as-needed basis. And lest you think of spawning a thread for each connection (and running into the physical limit of number of threads), you'll probably reach a practical throughput limit before you hit any physical limits.
I tested it in PostgreSQL, and it works.
Since PostgreSQL seems to not have a predefined max cursor limit, I still don't know if the simple assignment of a cursor/pointer from a BLOB field to an Java InputStream object via java.sql.ResultSet.getBinaryStream("blob_field") is considered an active retrieval operation or not (I guess no, but who knows...);
Loading all the InputStreams at once with something like SELECT blob_field FROM table WHERE length(blob_field)>0 , produced a very long query execution time, and a very fast access to the binary content (in the sequential way, as above).
With a test case of 200 MB with 20 files of 10 MB each one:
The old way was circa 1 second per each query, plus 0.XX seconds for the other operations (reading each InputStream and writing it to the outputstream, and something else);
Total Elapsed time: 35 seconds
The experimental way was circa 22 seconds for the big query, and 12 total seconds for iterating and performing the other operations.
Total Elapsed time: 34 seconds
This makes me think that while assigning the BinaryStream from the database to the Java InputStream object, the complete reading is already being performed :/ making the use of an InputStream similar to an byte[] one (but worst in this case, because of the memory overload coming from having all the items instantiated);
Conclusion
reading all at once is a bit faster (~ 1 second faster every 30 seconds of execution),
but it could seriously make the big query timing out, other than cause RAM memory leaks and, potentially, max cursor hits.
Do not try this at home, just stick with one query at once...
We have a SELECT statement which will take approx. 3 secs to execute. We are calling this DB2 query inside a nested While loop.
Ex:
While(hashmap1.hasNext()){
while(hashmap2.hasNext()){
SQL Query
}
}
Problem is, the outer While loop will execute approx. 1200 times and inner While loop will execute 200 times. Which means the SQL will be called 1200*200 = 240,000 times. Approx. each iteration of Outer While loop will take 150 secs. So, 1200 * 150 secs = 50 hrs.
We can afford only around 12-15hrs of time, before we kick off the next process.
Is there any way to do this process quickly? Any new technology which can help us in fetching these records faster from DB2.
Any help would be highly appreciated.
Note: We already looked into all possible ways to cut down the no.of iterations.
Sounds to me like you're trying to use the middle tier for something that the database itself is better suited for. It's a classic "N+1" query problem.
I'd rewrite this logic to execute entirely on the database as a properly indexed JOIN. That'll not only cut down on all that network back and forth, but it'll bring the database optimizer to bear and save you the expense of bringing all that data to the middle tier for processing.
I am working to develop a JMS application(stand alone multithreaded java application) which can receive 100 messages at a time , they need to be processed and database procedures need to be called for inserting/updating data. Procedures are very heavy as validations are also performed in them. Each procedure is taking about 30 to 50 seconds of time to execute and they are capable to run concurrently.
My concern is to execute 100 procedures for all 100 messages and also send reply within time limit of 90 seconds by jms application.
No application server to be used(requirement) and database is Teradata (RDBMS)
I am using connection pool and thread pool in java code and testing code with 90 connections.
Question is :
(1) What should be the limit on number of connections with database at a time?
(2) How many threads at a time are recommended?
Thanks,
Jyoti
90 seems like a lot. My recommendation is to benchmark this. Your criteria is uniques and you need to make sure you get the maximum throughput.
I would make the code configurable with how many concurrent connections you use and run it with 10 ... 100 connections going up 10 at a time. This should not take long. When you start slowing down then you know you have exceeded the benefits of running concurrently.
Do it several times to make sure your results are predictable.
Another concern is your statement of 'procedure is taking about 30 to 50 seconds to run'. How much of this time is processing via Java and how much time is waiting for the database to process an SQL statement? Should both times really be added to determine the max number of connections you need?
Generally speaking, you should get a connection, use it, and close it as quickly as possible after processing your java logic if possible. If possible, you should avoid getting a connection, do a bunch of java side processing, call the database, do more java processing, then close the conection. There is probably no need to hold the connection open that long. A consideration to keep in mind when doing this approach is what processing (including database access) you need to keep in single transaction.
If for example, of the 50 seconds to run, only 1 second of database access is necessary, then you probably don't need such a high max number of connections.