Optimizing the data access layer

Optimizing the data access layer - java

I have a web service (JAX-RS/Spring) that generates SQL queries which run against a temp table in Oracle. The data is then archived to another table (through 3 MERGE statements). The execution of each "job" (querying and merging) is done in the background through a JMS broker (ActiveMQ). The sequence of operations of each job is something like:
insert/update into table Q (select from table F) -- done between 4 and 20 times
merge into table P (select from table Q) -- twice
merge into table P (select from table F)
merge into table P (select from table F)
create a view V as select from table P
(table Q is a temp table).
When I dispatch two or three jobs like that, it takes around 6-7 minutes for each job to execute. But when I dispatch up to 15 running at the same time, the duration stretches out way longer.
is this happening because all these processes are trying to insert/update into the temp table Q? thus fighting for the resource? What techniques should I be looking at to optimize this? For example, I thought of making 5 or 6 duplicates of table Q and "load balance" the data access object queries against them.
Thanks

When I dispatch two or three jobs like
that, it takes around 6-7 minutes for
each job to execute. But when I
dispatch up to 15 running at the same
time, the duration stretches out way
longer.
There's any number of resources your processes could be contending for, not just the temporary table.
For starters, how many processors (CPUs/cores) does your database have? There's a pretty good rule of thumb that we shouldn't run more than 1.8 background jobs per processor. So there's no point in worrying about cloning your temporary table if you don't have enough processors to support a high degree of parallelism.
The key thing with tuning is: don't guess. Unlike some other DBMS products, Oracle has lots of instrumentation we can use to find out exactly where the time goes. It's called the Wait Interface. It's not perfect but it's a lot better than blindly re-designing your database schemas. Find out more.

If Q is really a temp table (as in a GLOBAL TEMPORARY TABLE) then each session will have a separate physical object, so they won't contend for locks or at the data level.
You are more likely to get contention on the permanent table P, or on server resources (memory and disk).

Related

Best way to optimize DB inserts using Java

The program is written in java. 25 threads are processing 1 million tasks. Each task saves data into DB hence 1 million times DB insert happens. In order to optimize this, We tried following approach
Tasks save data into ConcurrentLinkedDeque
A thread polls the duque in periodic interval and gets all the available objects at that point in time.
Once the available objects' count reaches a threshold ( say 100K ), then create a thread to save.
But this approach is not improving overall performance.
I would like to reduce the number of times (1 million times currently ) DB insert happens to improve performance. Are there any alternate solution like High Performing - multiple concurrent publisher and single concurrent subscriber kind of implementation ?

Reduce the overhead of row-by-row processing by batching commands. Many APIs include ways to batch commands, or you can combine them yourself with a statement like this:
INSERT INTO products (product_no, name, price)
SELECT 1, 'Cheese', 9.99 FROM dual UNION ALL
SELECT 2, 'Bread' , 1.99 FROM dual UNION ALL
SELECT 3, 'Milk' , 2.99 FROM dual;

Copy data from Excel file to Database using Mutlithreading

I have a requirement to copy a huge data from excel(5,00,000) rows to Database. Should I go with the Blocking Queue method of multi threading or is there any other way to leverage multi threading on a more efficient scale?

if you want to use multithreading to improve the performance better you can create one thread per table and try to perform db-operation over the table. for thread management you can use executor service, threadpoolexecutor etc.

500.000 rows nowadays is not that huge amount for the database.
I think you should, first of all, optimize the DB access and if you don't have the desired performance, go with more advanced techniques.
You've stated Java, 2 optimizations like this come to mind:
Use Prepared Statement and not Statement from JDBC (or if you use any abstraction over JDBC, make sure that that's the case under the hood). This will allow the DB to not reparse query every time
Use batch operations. There alone will boost the speed in an order of magnitude or so depending on your RDBMS setup:
PreparedStatement pstmt = connection.prepareStatement(<YOUR_INSERT_SQL>);
for(...) { // chose batch size like 500 to 2000
pstmt.setXXX(<bind the parameters here>)
pstmt.addBatch(); // add to the batch
}
pstmt.executeBatch(); // does bunch of inserts at once
It can take less than a minute to perform all these operations, or 1-2 minutes, but not, say hours (of course depending on where do you insert the data and the network quality, but typically this is a case).
If it's not enough, you can go with parallel access of course, using a number of connections simultaneously. But again, if its one-time operation, I wouldn't have bothered, after all, it will take you more time to write this multithreaded code than the difference in performance you'll gain :)

Reading from Database through multiple threads in java

I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?

Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.

I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.

No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.

In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.

Distribute database records evenly across multiple processes

I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java

The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.

Sounds like a job for Hadoop to me.

I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.

Durable map to map to queue for fair scheduling?

Our system needs to process billions of queries from thousands of clients for millions of resources. Some resources will be queried much more often than others. Each client will submit anywhere from hundreds to hundreds-of-millions of queries at a time. Because each resource can only support thousands of queries per minute, the queries will be enqueued and their results will be determined asynchronously.
Now, here's the rub: Each client's queries need to be given equal priority with respect to each resource. That is, if one client submits a million queries for a particular resource, and then another client submits a dozen, immediately after, then the second client should not have to wait for the first client's queries to be processed before theirs are. Rather, first the one client's first query should be handled, and then the other's first query, then the first's second query, and so on, back and forth. (And the analogous idea for more than two clients, and multiple resources; also, it can be a little less granular, as long as this basic idea is preserved).
If this was small enough to be in-memory, we'd just have a map from resources to a map from accounts to a queue of queries, and circularly iterate accounts, per resource; but it's not, so we need a disk-based solution. We also need it to be robust, highly available, transactional etc.. What are my options? I'm using Java SE.
Thanks in advance!

Ahead of time, I know HBase much better than I do Cassandra. Some aspects of my response are HBase specific, and I'll mark them as such.
Assuming that you provision enough hardware, then a BigTable implementation like Cassandra or HBase would give you the following:
The ability to store and retrieve your queries at an extremely high rate
The ability to absorb deletes at an extremely high rate (though with HBase and Cassandra, flushing writes to disk can cause periodic delays)
Trivially, I could see a schema where you used a combination of resource-id as row key and account-id and perhaps timestamp as column key, but (in HBase specifically) this could lead to hotspots in the servers hosting certain popular resources (in both HBase and Cassandra, a single server is responsible for hosting the master copy of any given row at a time). In Cassandra you can reduce the overhead of updates by using async writes (writing to only one or two nodes, and allowing gossip to replicate them), but this could result in old records being around dramatically longer than you expect in situations where network traffic is high. In HBase writes are always consistent and always written to the RegionServer hosting the row, so hotspotting is definitely a potential problem.
You can reduce the impact of hotspotting by making your row key a combination of resource ID and account id, but then you need to scan all row keys to determine the list of accounts that have outstanding queries for a resource.
One other potential advantage that you may not have considered is the potential capability to run your queries directly from the HBase or Cassandra data nodes, saving you the need to ship your query over the network again to an executor process to actually run that query. You might want to look into HBase Coprocessors or Cassandra Plugins to do something like that. Specifically I am talking about turning this workflow:
/-> Query -> Executor -> Resource -> Results -> \
Client -> Query -> Query Storage --> Query -> Executor -> Resource -> Results -> --> Client
\-> Query -> Executor -> Resource -> Results -> /
into something like:
/-> Query -> Resource -> Results -> \
Client -> Query -> Query Storage --> Query -> Resource -> Results -> --> Client
\-> Query -> Resource -> Results -> /
This may not make sense in your use case though.

I can give you some answers with respect to Cassandra.
Cassandra internally writes only new data files and only does so sequentially, never overwriting or modifying existing files, and has an append-only write-ahead log like transactional relational databases. Cassandra internally sees deletes as essentially just like any other writes.
Cassandra is linearly scalable across many nodes and has no single point of failure. It is linearly scalable for both reads and writes. That is to say, a single cluster can support any number of concurrent reads and writes you wish to throw at it, so long as you add enough nodes to the cluster and give the cluster time to rebalance data across the new nodes. Netflix recently load-tested Cassandra on EC2 and found linear scalability, with the largest cluster they tested at 288 nodes supporting 1,000,000 writes/sec sustained for an hour.
Cassandra supports many consistency levels. When performing each read or write from Cassandra, you specify with what consistency level you want that read or write to be executed. This lets you determine, per-read and per-write, whether that read or write must be fast or must be done consistently across all nodes hosting that row.
Cassandra does not support multi-operation transactions.
If the Cassandra data model works well in your case, Cassandra may well be the simplest solution, at least at the operations level. Every node is configured exactly alike. There are no masters and no slaves, only peers of equals. It is not necessary to set up separate load balancing, failover, heartbeats, log shipping, replication, etc.
But the only way to find out for sure is to test it out.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.