i have oracle database.
i have a very basic java program that query a table every seconds to check the status of each records and update it.
"SELECT * FROM MYTABLE WHERE STATUS =10 AND MODUS<10"
im using OJB for this program.
the java program runs 10 threads.
This program cause high CPU utilizations , in average 40% from the total Sun CPU. I have created index for that specific query.
Yes, every seconds , that table will have data and the program have to process it.
I want to know, what is the better way in JAVA or Oracle to minimize the CPU utilization and also to achieve this kind of program running every seconds.
My target is to process 200 records every 1 minute.
Thanks
This sounds like a convoluted design. I'd recommend to look into AQ.
Related
I need to read a large result set from MS SQL server into a java program. I need to read a consistent data state, so its running under a single transaction. I don't want dirty reads.
I can split the read using offset and fetch next and having each set of rows processed by a separate thread.
However, when doing this, it seems that the overall performance is ~30k rows read / sec, which is pretty lame. I'd like to get ~1m / sec.
I've checked that I have no memory pressure using visual VM. There are no GC pauses. Looking at machine utilisation it seems that there is no CPU limitation either.
I believe that the upstream source (MS SQL) is the limiting factor.
Any ideas on what I should look at?
I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?
Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.
I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.
No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.
In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.
We have a SELECT statement which will take approx. 3 secs to execute. We are calling this DB2 query inside a nested While loop.
Ex:
While(hashmap1.hasNext()){
while(hashmap2.hasNext()){
SQL Query
}
}
Problem is, the outer While loop will execute approx. 1200 times and inner While loop will execute 200 times. Which means the SQL will be called 1200*200 = 240,000 times. Approx. each iteration of Outer While loop will take 150 secs. So, 1200 * 150 secs = 50 hrs.
We can afford only around 12-15hrs of time, before we kick off the next process.
Is there any way to do this process quickly? Any new technology which can help us in fetching these records faster from DB2.
Any help would be highly appreciated.
Note: We already looked into all possible ways to cut down the no.of iterations.
Sounds to me like you're trying to use the middle tier for something that the database itself is better suited for. It's a classic "N+1" query problem.
I'd rewrite this logic to execute entirely on the database as a properly indexed JOIN. That'll not only cut down on all that network back and forth, but it'll bring the database optimizer to bear and save you the expense of bringing all that data to the middle tier for processing.
I inherited a...well, I guess I can call it a piece-of-#### Struts application, and am tasked with optimizing a Levey-Jennings process that checks if our quality control standards are up to snuff.
The process itself runs fine, but there has always been a huge spike in performance time even if the dataset is small. I tested time between each part of the algorithm and discovered that the big time hog was Java's executeQuery() method.
Most recently I ran the application and logged the execution time to be 10 seconds. The executeQuery() took six of those seconds by itself. Curious to see what the problem was, I took the query into TOAD and ran it verbatim -- it only took 1 second to run.
I ran an even larger dataset, which took 60 seconds to run in the Levey-Jennings application -- however, in TOAD, it took 10.
Is this a problem with the query at all, or is using executeQuery() typically a precursor to extreme slowdown?
When you run a query in TOAD (or any other IDE), this tool wants to provide you with the results you can see, as fast as possible. Typically they show you a grid with between 10 or 40 rows. To show you those first 10-40 rows as fast as possible, they hint the query or change the optimizing environment to produce those first rows as fast as possible.
Here you can see more information about the FIRST_ROWS hint: http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/sql_elements006.htm#SQLRF50302
The query in your application likely doesn't use a FIRST_ROWS hint. It wants ALL the rows as fast as possible. It doesn't care if the first row shows up immediately. So, the optimizing environment for those two queries is different.
It also doesn't help that TOAD displays the time it took to produce the first rows, because it leads you to think that that's the time it takes to get all the rows. There is an option to navigate to the last row, though. Press that and you'll see that it now takes longer.
Hope this helps.
Regards,
Rob.
I am working to develop a JMS application(stand alone multithreaded java application) which can receive 100 messages at a time , they need to be processed and database procedures need to be called for inserting/updating data. Procedures are very heavy as validations are also performed in them. Each procedure is taking about 30 to 50 seconds of time to execute and they are capable to run concurrently.
My concern is to execute 100 procedures for all 100 messages and also send reply within time limit of 90 seconds by jms application.
No application server to be used(requirement) and database is Teradata (RDBMS)
I am using connection pool and thread pool in java code and testing code with 90 connections.
Question is :
(1) What should be the limit on number of connections with database at a time?
(2) How many threads at a time are recommended?
Thanks,
Jyoti
90 seems like a lot. My recommendation is to benchmark this. Your criteria is uniques and you need to make sure you get the maximum throughput.
I would make the code configurable with how many concurrent connections you use and run it with 10 ... 100 connections going up 10 at a time. This should not take long. When you start slowing down then you know you have exceeded the benefits of running concurrently.
Do it several times to make sure your results are predictable.
Another concern is your statement of 'procedure is taking about 30 to 50 seconds to run'. How much of this time is processing via Java and how much time is waiting for the database to process an SQL statement? Should both times really be added to determine the max number of connections you need?
Generally speaking, you should get a connection, use it, and close it as quickly as possible after processing your java logic if possible. If possible, you should avoid getting a connection, do a bunch of java side processing, call the database, do more java processing, then close the conection. There is probably no need to hold the connection open that long. A consideration to keep in mind when doing this approach is what processing (including database access) you need to keep in single transaction.
If for example, of the 50 seconds to run, only 1 second of database access is necessary, then you probably don't need such a high max number of connections.