I am using a java jdbc application to fetch about 500,000 records from DB. The Database being used is Oracle. I write the data into a file as soon as each row is fetched. Since it takes about an hour to complete fetching the entire data, I am trying to increase the fetch size of the result set. I have seen in multiple links that while increasing the fetch size one should be careful about the memory consumption. Does increasing the fetch size actually increase the heap memory used by the jvm?
Suppose if the fetch size is 10 and the program query returns 100 rows in total. During the first fetch the resultset contains 10 record. Once I read the first 10 records the resultset fetches the next 10. Does this mean that after the 2nd fetch the dataset will contain 20 records? Are the earlier 10 records still maintained in memory or are they removed while fetching the newer batch?
Any help is appreciated.
It depends. Different drivers may behave differently and different ResultSet settings may behave differently.
If you have a CONCUR_READ_ONLY, FETCH_FORWARD, TYPE_FORWARD_ONLY ResultSet, the driver will almost certainly actively store in memory the number of rows that corresponds to your fetch size (of course data for earlier rows will remain in memory for some period of time until it is garbage collected). If you have a TYPE_SCROLL_INSENSITIVE ResultSet, on the other hand, it is very likely that the driver would store all the data that was fetched in memory in order to allow you to scroll backwards and forwards through the data. That's not the only possible way to implement this behavior, so different drivers (and different versions of drivers) may have different behaviors but it is the simplest and the way that most drivers I've come across behave.
While increasing the fetch size may help the performance a bit I would also look into tuning the SDU size which controls the size of the packets at the sqlnet layer. Increasing the SDU size can speed up data transfers.
Of course the time it takes to fetch these 500,000 rows largely depends on how much data you're fetching. If it takes an hour I'm guessing you're fetching a lot of data and/or you're doing it from a remote client over a WAN.
To change the SDU size:
First change the default SDU size on the server to 32k (starting in 11.2.0.3 you can even use 64kB and up to 2MB starting in 12c) by changing or adding this line in sqlnet.ora on the server:
DEFAULT_SDU_SIZE=32767
Then modify your JDBC URL:
jdbc:oracle:thin:#(DESCRIPTION=(SDU=32767)(HOST=...)(PORT=...))(CONNECT_DATA=
Related
I have a case where i need to efficiently load about ~1 million rows into memory for processing. I am using oracle and plain JDBC for this.
If I do not set the fetch size, the oracle driver default of 10 is used meaning it will require 100k round trips which makes performance super inefficient. If I up the fetch size to something very large, such as 500k or 1m, the data is loaded in about 5 seconds.
Unfortunately I can't set the fetch size to something like INT_MAX because the oracle driver pre-allocates the buffer based on the fetch size.
What I really want is a way to force JDBC to simply get all of the rows and not use a cursor or do any incremental fetching. I would like to do that in the most memory efficient way possible.
Is there a way to tell oracle to just get all the data and don't do any fetching?
The 12 driver allocates 15 bytes per column per row in the fetch size for bookkeeping plus the actual data size. So if you set the fetch size to 1G the 12 driver will allocate 15GB for bookkeeping plus whatever the actual data is. So depending on how much memory you have you can set the fetch size to whatever you have the memory to support.
Pre 12 the drivers allocated much, much more memory so the fetch size would have to be much smaller.
There's no way to tell the driver to just fetch all the rows in one round trip.
Edit 2017-03-10: We have tested with up to 2G rows and the recently released 12.2 drivers handle this fine with a FORWARD_ONLY result set. They will handle up to 1,802,723,000 rows with a SCROLL_INSENSITIVE result set. That is an expected limit. Sort of obviously this is with a huge machine and a ginormous heap.
I was using so far something like this for querying my database that was working perfectly fine :
PreparedStatement prepStmt = dbCon.prepareStatement(mySql);
ResultSet rs = prepStmt.executeQuery();
But then I needed to use the rs.first(); in order to be able to iterate over my rs multiple times. So I use now
PreparedStatement prepStmt = dbCon.prepareStatement(mySql,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_UPDATABLE);
My question is related to the performance of the two. What do I lose if I use the second option? Will using the second option have any negative effect in the code that I have written so far?
PS: Note that my application is a multi-user, database-intensive web application (on a Weblogic 10.3.4) that uses a back end Oracle 11g database.
Thanks all for your attention.
UPDATE
My maximum reslutset size will be less than 1000 rows and 15-20 columns
If you're using scrollability (your second option), pay attention to this:
Important: Because all rows of any scrollable result set are stored in
the client-side cache, a situation where the result set contains many
rows, many columns, or very large columns might cause the client-side
Java Virtual Machine (JVM) to fail. Do not specify scrollability for a
large result set.
Source: Oracle Database JDBC Developer's Guide and Reference
Since an Oracle cursor is a forward-only structure, in order to simulate a scrollable cursor, the JDBC driver would generally need to cache the results in memory if it wants to be able to ensure that the same results are returned when you iterate through the results a second time. Depending on the number and size of the results returned from the query, that can involve a substantial amount of additional memory being consumed on the application server. On the other hand, that should mean that iterating through the ResultSet a second time should be much more efficient than the first time.
Whether the extra memory required is meaningful depends on your application. You say that the largest ResultSet will have 1000 rows. If you figure that each row is 500 bytes (this will obviously depend on data types-- if your ResultSet just has a bunch of numbers, it would be much smaller, if it contains a bunch of long description strings, it may be much larger), 1000 rows is 500 kb per user. If you've got 1000 simultaneous users, that's only 500 MB of storage which probably isn't prohibitive. If you've got 1 million simultaneous users, on the other hand, that's 500 GB which is probably means that you're buying a few new servers. If your rows are 5000 bytes rather than 500, then you're talking about 5 GB of RAM which could be a large fraction of the memory required on the application server to run your application.
i am using spring jdbc on weblogic. And i set fetch size to 500 for fetching data from db more faster. But this causes memory problems. Here is an example:
http://webmoli.com/2009/02/01/jdbc-performance-tuning-with-optimal-fetch-size/
My question is how to free this memory? Running GC is not working, I guess it is not working because of connection is alive in the connection pool.
Code:
public List<Msisdn> getNewMsisdnsForBulkSmsId(String bulkSmsId,String scheduleId,final int msisdnCount) throws SQLException {
JdbcTemplate jdbcTemplate = getJdbcTemplate();
jdbcTemplate.setFetchSize(500);
jdbcTemplate.setMaxRows(msisdnCount);
jdbcTemplate.query("select BULKSMS_ID, ? as , STATUSSELECTDATE, DELIVERYTIME, ID, MESSAGE from ada_msisdn partition (ID_"+bulkSmsId+") where bulksms_id = ? and status = 0 and ERRORCODE = 0 and SCHEDULEID is null for update skip locked", new Object[]{scheduleId,bulkSmsId}, MsisdnRowMapper.INSTANCE);
//Also i tried to close connection and run gc, this does not free the memory too.
//jdbcTemplate.getDataSource().getConnection().close();
//System.gc();
return null;
}
when i set fetch size to 10, heap size is 12 MB
if i set fetch size to 500, heap size is 206 MB
Thanx
Updates for added sample code, etc:
It sounds like you just need to use a value less than 500, but that makes me think you are returning a lot more data than your result set mapper is actually using.
Now that I see that you're storing all of the mapped results in a List, I would say that the problem seen with the fetch size is likely to be a secondary issue. The combined memory space needed for the List<Msisdn> and a single group of fetched ResultSet rows is pushing you past available memory.
What is the value of msisdnCount? If it's larger than 500, then you are probably using more memory in list than in the ResultSet's 500 records. If it's less than 500, then I would expect that the memory problem also occurs when you set the fetch size to msisdnCount, and the error would go away at some value between min(msisdnCount, 500) and 10.
Loading all of the results into a list and then processing them is a pattern that will very often lead to memory exhaustion. The common solution is to use streaming. If you can process each row as it comes in and not store all of the maps results in your list, then you can avoid the memory problems.
I don't see any streaming support in the Spring JDBC core package, but I'll update if I find it.
--
If that data in the rows you are retrieving is huge enough that fetching 500 rows will use up your heap, then you must either return less data per row or fetch fewer rows at a time.
You may find that you are storing the fetched rows somewhere in your code, which means that it's not the ResultSet using up your memory. For example, you might be copying all of the rows to some collection instance.
I would look at the size of the data in each row and try to reduce unneeded columns that might contain large data types, then try simply loading the data and iterating through the results without doing your normal processing, which may be storing the data somewhere, to see how many rows you can load at a time with the memory you have. If you're running out of memory fetching 500 rows, you must be pulling a lot of data over. If you're not actually using that data, then you're wasting CPU and network resources as well as memory.
edit: You may also want to set the cursor behavior to give your JDBC driver more help to know what it can throw away. You can prepare your statements with ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY for example. http://docs.oracle.com/javase/6/docs/api/index.html?java/sql/ResultSet.html
After investigating a bit at work I noticed that the application I'm working on is using the default fetch size (which is 10 for Oracle from what I know). The problem is that in the majority of cases the users fetch large amount of data (ranging from few thousand to even hundreds of thousands) and that the default 10 is really a huge bottleneck.
So the obvious conclusion here would be to make the fetch size larger. At first I was thinking about setting the default to 100 and bumping it to a 1000 for several queries. But then I read on the net that the default is so small to prevent memory issues (i.e. when the JVM heap cannot handle so much data), should I be worried about it?
I haven't seen anywhere further explanation to this. Does it mean that a bigger fetch sizes means more overhead while fetching the result set? Or do they just mean that with the default I can fetch 10 records and then GC them and fetch another 10 and so on (whereas lets say fetching a 10000 all at once would result in an OutOfMemory exception)? In such case I wouldn't really care as I need all the records in the memory anyway. In the former case (where bigger result set means bigger memory overhead) I guess I should load test it first.
By setting the fetch size too, big you are risking OutOfMemoryError.
The fact that you need all these records anyway is probably not justifiable. More chances you need the entities reflected by the returned ResultSets... Setting the fetch size to 10000 means you're heaping 10000 records represented by JDBC classes. Of course, you don't pass these around through your application. You first transform them into your favorite business-logic-entities and then hand them to your business-logic-executor. This way, The records form the first fetch bulk are available for GC as soon as JDBC fetches the next fetch bulk.
Typically, this transformation is done a little bunch at a time exactly because of the memory threat aforementioned.
One thing you're absolutely right, though: you should test for performance with well-defined requirements before tweaking.
So the obvious conclusion here would be to make the fetch size larger.
Perhaps an equally obvious conclusion should be: "Let's see if we can cut down on the number of objects that users bring back." When Google returns results, it does so in batches of 25 or 50 sorted by greatest likelihood to be considered useful by you. If your users are bringing back thousands of objects, perhaps you need to think about how to cut down on that. Can the database do more of the work? Are there other operations that could be written to eliminate some of those objects? Could the objects themselves be smarter?
Here is the scenario I am researching a solution for at work. We have a table in postgres which stores events happening on network. Currently the way it works is, rows get inserted as network events come and at the same time older records which match the specific timestamp get deleted in order to keep table size limited to some 10,000 records. Basically, similar idea as log rotation. Network events come in burst of thousands at a time, hence rate of transaction is too high which causes performance degradation, after sometime either server just crashes or becomes very slow, on top of that, customer is asking to keep table size up to million records which is going to accelerate performance degradation (since we have to keep deleting record matching specific timestamp) and cause space management issue. We are using simple JDBC to read/write on table. Can tech community out there suggest better performing way to handle inserts and deletes in this table?
I think I would use partitioned tables, perhaps 10 x total desired size, inserting into the newest, and dropping the oldest partition.
http://www.postgresql.org/docs/9.0/static/ddl-partitioning.html
This makes load on "dropping oldest" much smaller than query and delete.
Update: I agree with nos' comment though, the inserts/deletes may not be your bottleneck. Maybe some investigation first.
Some things you could try -
Write to a log, have a separate batch proc. write to the table.
Keep the writes as they are, do the deletes periodically or at times of lower traffic.
Do the writes to a buffer/cache, have the actual db writes happen from the buffer.
A few general suggestions -
Since you're deleting based on timestamp, make sure the timestamp is indexed. You could also do this with a counter / auto-incremented rowId (e.g. delete where id< currentId -1000000).
Also, JDBC batch write is much faster than individual row writes (order of magnitude speedup, easily). Batch writing 100 rows at a time will help tremendously, if you can buffer the writes.