I'm trying to create a java program to cleanup and merge rows in my table. The table is large, about 500k rows and my current solution is running very slowly. The first thing I want to do is simply get an in-memory array of objects representing all the rows of my table. Here is what I'm doing:
pick an increment of say 1000 rows at a time
use JDBC to fetch a resultset on the following SQL query
SELECT * FROM TABLE WHERE ID > 0 AND ID < 1000
add the resulting data to an in-memory array
continue querying all the way up to 500,000 in increments of 1000, each time adding results.
This is taking way to long. In fact its not even getting past the second increment from 1000 to 2000. The query takes forever to finish (although when I run the same thing directly through a MySQL browser its decently fast). Its been a while since I've used JDBC directly. Is there a faster alternative?
First of all, are you sure you need the whole table in memory? Maybe you should consider (if possible) selecting rows that you want to update/merge/etc. If you really have to have the whole table you could consider using a scrollable ResultSet. You can create it like this.
// make sure autocommit is off (postgres)
con.setAutoCommit(false);
Statement stmt = con.createStatement(
ResultSet.TYPE_SCROLL_INSENSITIVE, //or ResultSet.TYPE_FORWARD_ONLY
ResultSet.CONCUR_READ_ONLY);
ResultSet srs = stmt.executeQuery("select * from ...");
It enables you to move to any row you want by using 'absolute' and 'relative' methods.
One thing that helped me was Statement.setFetchSize(Integer.MIN_VALUE). I got this idea from Jason's blog. This cut down execution time by more than half. Memory consumed went down dramatically (as only one row is read at a time.)
This trick doesn't work for PreparedStatement, though.
Although it's probably not optimum, your solution seems like it ought to be fine for a one-off database cleanup routine. It shouldn't take that long to run a query like that and get the results (I'm assuming that since it's a one off a couple of seconds would be fine). Possible problems -
is your network (or at least your connection to mysql ) very slow? You could try running the process locally on the mysql box if so, or something better connected.
is there something in the table structure that's causing it? pulling down 10k of data for every row? 200 fields? calculating the id values to get based on a non-indexed row? You could try finding a more db-friendly way of pulling the data (e.g. just the columns you need, have the db aggregate values, etc.etc)
If you're not getting through the second increment something is really wrong - efficient or not, you shouldn't have any problem dumping 2000, or 20,000 rows into memory on a running JVM. Maybe you're storing the data redundantly or extremely inefficiently?
Related
I need to update / insert a large number of entries very fast. I see 2 options
creating many queries and send them via executeBatch
create one big query (contains all updates/inserts in db-specific syntax) and just execute it. Since the number of updates is fix ("batch size") i can prepare this statement too
The target db is oracle. The number of inserts/updates in a batch is a fixed number between 1000 and 10000 (does this number has some impact on performance?)
So what way to go?
Your options are essentially the same. In fact they may be identical, unless your second option is implemented in a poor way.
Using built in PreparedStatement batching is safer, since the driver will know what to do a lot better than you do. There's less chances for programmer error, and should it ever happen that you change your database provider, you won't need to double check whether your solution is still valid.
Make sure to check out how to properly perform the batching. For example the batch size is commonly 100 instead of the full amount of rows you wish to insert (so you would have 10 executeBatch()es to insert your 1000 rows).
I am trying to improve a data transfer program that I wrote. I am looking for suggestions on how to make it quicker.
My program extracts data from a database (usually Oracle 11g) by filling a ResultSet and writing this result into a file. The program looks periodically into the tables and queries if a special column has changed. For example, this could be such a query:
select columnA, columnB from scheme.table where changeColumn = '1'
Now comes the critical part. After extracting the data I need to update this changeColumn to '0'. Since I have just used the ResultSet for exporting the data into a file I have to rewind it, so the code looks like this:
extractedData.beforeFirst();
while (extractedData.next()) {
extractedData.updateString("changeColumn", "0");
extractedData.updateRow();
}
Now if this ResultSet is bigger (let's say more than 100.000 entries) then this loop can take hours. Does anyone have any suggestions on how to increase the performance of this?
I heard of setting the fetch size to a bigger value, but usually the ResultSet only contains less than a dozen entries. Is there a way to dynamically set the fetch size?
Use a JDBC Batch Update. From all the row that needs updating, take the primary key on the row that needs updating, add it to a batch update (SQL query) and execute the batch.
A good example from Mkyong shows you how to do JDBC Batch Update with JDBC PreparedStatement.
I want to get all table definition for all my tables.
And I want to do it fast (it is part of a script that I'm running a lot)
I am using oracle 11g, and I have 700 tables. On a plain jdbc code it takes 4 minutes and does:
s = con.statement("select DBMS_METADATA.GET_DDL(object_type,object_name) from user_objects where object_type = 'TABLE');
s.execute();
rs = s.getResultSet();
while(rs.next()){
rs.getString(1);
}
SO I want to optimize this code and reach around 20 sec.
I have already reached 40-50 sec by creating 14 threads that each opens a connection to the database and reads a part of the information, using mod on the rownum.
But this is not enough.
I am thinking in these directions:
http://docs.oracle.com/cd/B10501_01/java.920/a96654/connpoca.htm#1063660 - connection caching. can it help speed up things by replacing my 14 connections with connectionCaching?
Is it possible to keep the tables accessed by this function, in the KEEP buffer cache area?
Anyway of indexing some of the information here?
Any other suggestions will be greatly appreciated.
Thank you
Is it required to always get the DDL even if the tables haven't been changed? Otherwise only get the DDL of those tables where ALL_OBJECTS.LAST_DDL_TIME has changed since you last retrieved it.
Another option would be to write your own GET_DDL in a way that is able to get more than one table at once.
I'm afraid there is no easy to make it faster. The whole GET_DDL thing is implemented in Java and uses XSLT transformation as a part of generation process.
Maybe you will find this faster.
http://metacpan.org/pod/DDL::Oracle
I would firstly go for HAL's suggestion of only capturing changes, but I'd also look at eliminating any options that I do not need -- STORAGE clauses, for example?
I am Programming a software with JAVA and using the Oracle DB.
Normally we obtain the values from the Database using a Loop like
Resultset rt = (Resultset) cs.getObject(1);
while(rt.next){
....
}
But it sound is more slowly when fetch thousand of data from the database.
My question is:
In Oracle DB: I created a Procedure like this and it is the Iterating data and assign to the cursor.
Ex.procedure test_pro(sysref_cursor out info) as
open info select * from user_tbl ......
end test_pro;
In JAVA Code: As I mentioned before I Iterate a the resultset for obtain values, but the side of database, even I select the values, why should I use a loop for getting that values?
(another fact in the .net frameworks, there are using the database binding concept. So is any way in the java, binding the database procedures like .net 's, without the iterating.
)
Depending on what you are going to do with that data and at which frequence, the choice for a ref_cursor might be a good or a bad one. Ref_cursors are intended to give non Oracle aware programs a way to pass it data, for reporting purposes.
In you case, stick to the looping but don't forget to implement array fetching because this has a tremendous effect on the performance. The database passes blocks of rows to your jdbc buffer at the client and your code fetches rows from that buffer. By the time you hit the end of the buffer, the Jdbc layer requests the next chunk of rows from the database, eliminating lot's of network round trips. The default already fetches 10 rows at a time. For larger sets, use bigger numbers, if memory can provide the room.
See Oracle® Database JDBC Developer's Guide and Reference
If you know for sure there will always be exactly one result, like in this case, you can even skip the if and just call rs.next() once:
For example :
ResultSet resultset = statement.executeQuery("SELECT MAX (custID) FROM customer");
resultset.next(); // exactly one result so allowed
int max = resultset.getInt(1); // use indexed retrieval since the column has no name
Yes,you can call procedure in java.
http://www.mkyong.com/jdbc/jdbc-callablestatement-stored-procedure-out-parameter-example/
You can't avoid looping. For performance reasons you need to adjust your prefetch on Statement or Resultset object (100 is a solid starting point).
Why is done this way? It's similar to reading streams - you never know how big it can be - so you read by chunk/buffer, one after another...
I have a lot of rows in a database and it must be processed, but I can't retrieve all the data to the memory due to memory limitations.
At the moment, I using LIMIT and OFFSET to retrieve the data to get the data in some especified interval.
I want to know if the is the faster way or have another method to getting all the data from a table in database. None filter will be aplied, all the rows will be processed.
SELECT * FROM table ORDER BY column
There's no reason to be sucking the entire table in to RAM. Simply open a cursor and start reading. You can play games with fetch sizes and what not, but the DB will happily keep its place while you process your rows.
Addenda:
Ok, if you're using Java then I have a good idea what your problem is.
First, just by using Java, you're using a cursor. That's basically what a ResultSet is in Java. Some ResultSets are more flexible than others, but 99% of them are simple, forward only ResultSets that you call 'next' upon to get each row.
Now as to your problem.
The problem is specifically with the Postgres JDBC driver. I don't know why they do this, perhaps it's spec, perhaps it's something else, but regardless, Postgres has the curious characteristic that if your Connection has autoCommit set to true, then Postgres decides to suck in the entire result set on either the execute method or the first next method. Not really important as to where, only that if you have a gazillion rows, you get a nice OOM exception. Not helpful.
This can easily be exactly what you're seeing, and I appreciate how it can be quite frustrating and confusing.
Most Connection default to autoCommit = true. Instead, simply set autoCommit to false.
Connection con = ...get Connection...
con.setAutoCommit(false);
PreparedStatement ps = con.prepareStatement("SELECT * FROM table ORDER BY columm");
ResultSet rs = ps.executeQuery();
while(rs.next()) {
String col1 = rs.getString(1);
...and away you go here...
}
rs.close();
ps.close();
con.close();
Note the distinct lack of exception handling, left as an exercise for the reader.
If you want more control over how many rows are fetched at a time into memory, you can use:
ps.setFetchSize(numberOfRowsToFetch);
Playing around with that might improve your performance.
Make sure you have an appropriate index on the column you use in the ORDER BY if you care about sequencing at all.
Since its clear your using Java based on your comments:
If you are using JDBC you will want to use:
http://download.oracle.com/javase/1.5.0/docs/api/java/sql/ResultSet.html
If you are using Hibernate it gets trickier:
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/batch.html