We are storing data in serialized form in Oracle tables. Column which holds the serialized data is of Type BLOB. We do have a stored procedure to query the table and return the results as sys_refcursor. The stored procedure accepts list of keys in comma separated form for which the blobs are to be fetched.
Environment - JRE 1.6
Oracle 11g
Problem : While the application requests the data for say 8 or 10 rows, we want the stored procedure to return all the result rows with binary data in a single db round trip. Is this possible? or what is the best way to fetch the results with minimal database round trip.
(We have set the fetchRowsize to 10, so that multiple rows are fetched. We have also Set the setLOBPrefetchSize to sufficient big value (3 Meg) . However these does not seem to make a difference. The resultSet.getBytes() takes a significant time indicating the data is fetched on request)
What am i missing? Do we have any trace to enable/check what is going on underneath?
BLOB trigger intentionally some specific behaviour, since they can be extremely large (depends on block size, but maximum size is at least 8 TB). The drivers do not always know in advance how much data will be transferred, so the driver takes the safe route by default.
I understand that you want it in just one roundtrip to get over, probably because of your network latency? If not, please state.
We've had the same problem with a standard software package we develop and in which the client retrieves on connection setup a lot of data across the globe. We've solved it as follows:
Initially we used a stored procedure that takes a parameter with the hash value of the last retrieved data rows. Stored procedure checks hash of current data rows. If not different, it returns just a header. If different, it puts everything together in one large piece and sends that back. In that way, it avoids transferring lots of information across the globe, because high network latency often comes with low bandwidth. Plus the number of round trips is reduced.
The last two years we've inserted a different product of our company in between (a webservice provider). It does the compression and deduplication by itself and further reduces bandwidth need. Plus it only uses SQL*Net between the webservice and the Oracle RDBMS. The webservice communicates using bson (json in binary format).
With a low labour effort, I would recommend returning just one LOB from the stored procedure with everything in it. Maybe even compressing it or taking use of a client's side cache.
Option 2 described above takes a lot more labour to introduce.
If you happen to use Oracle's thin JDBC driver, you might want to try setting the connection property useFetchSizeWithLongColumn to true, see in http://download.oracle.com/otn_hosted_doc/jdeveloper/905/jdbc-javadoc/index.html?oracle/jdbc/OracleDriver.html. However, as the property name indicates it is meant for LONG columns and it may not have any effect on BLOB columns. (Unfortunaetly, I am not able to try it myself here. However, source code under http://ora-jdbc-source.googlecode.com/svn/trunk/OracleJDBC/src/oracle/jdbc/driver/OracleResultSetImpl.java does not seem to make LONG/LOB distinction with useFetchSizeWithLongColumn, just "re-opens streams".)
For verification/tracing see the example code in: http://steveracanovic.blogspot.de/2009/08/using-usefetchsizewithlongcolumn-and.html
If the connection property does not work for BLOBs, you might try returning LONGs. However, there does not seem to be a nice method of conversion from BLOBs to LONGs. The following discussion (see answers) mentions a way using RAW variables in PL/SQL: http://oracle.mscer.com/q_oracle_41102.html. As you use mentioned using a stored procedure, you might want to try this.
(Note that setting the connection property implies that prefetched LONG data may consume considerable amounts of memory.)
(Again, I'm sorry I cannot try this approach myself and cannot assist with details as I do not have an Oracle DB here. But as nobody came up with an answer during past two hours ...)
Related
I have a web application in which I'm maintaining many static Maps to store my relevant information. Since the application is deployed on a server. Each and every hit to the server side java uses these maps to match the key and get appropriate result and send back to the client side. My code contains a rank and retrieval feature so I have to read the entire keySet of each of these Maps.
My question is:
1. Is working with static variables better than storing this data in a local embedded DB like Apache Derby and then using it?
2. The use of this data is very frequent. So if I use database will that be faster approach? Since I read the full keyset the where clause may not come handy in many operations.
3. How does the server's memory gets impacted on holding data in static variables?
My no. of maps are fixed but the size of the Maps keeps increasing? Please suggest the better solution.
If you want the data to be saved regularly an embedded database like H2 makes sense. You then also have snapshots of the data, and development, structural changes are a bit more safe.
A real database also has an incredible power behind it: concurrency, caching and so on. An embedded (when file based) database less so.
The problem with maps is that the data extraction can become several indirections. It is more versatile to have SQL queries with joins on the tables.
So SQL is more abstract (does not prescribe the actual query implementation), and easier to test. SQL for instance releases the developer of programming reports.
So go for a database IMHO, when you are really doing hard work.
What you might want to consider is to store the data searched in map when it's searched.
For instance, if a user searches for something specific, that something is stored in the map so that the next user who searches for that gets the data directly from the map rather than the database.
There are some downsides though, as you need to make sure that if the data is changed on the database, the hashmap/cache should be cleared or updated with the new data, as to prevent feeding outdated data to the user.
As for the impact on the server's memory, it depends on the size of the data you're storing. It's hard to give you a precise answer, but you can however test that on your own:
long memoryBefore = Runtime.getRuntime().freeMemory();
// populate your map
long memoryAfter = Runtime.getRuntime().freeMemory();
System.out.println(memoryBefore - memoryAfter);
That should give you the amount of bytes used (more or less, depending on the operations you run between memoryBefore and memoryAfter, as you may have instantiated other classes/variables unrelated to the hashmap)
I would like to ask the experts of what is the recommendation for fetching 3000-5000 records from oracle 11g database from Java application (using JDBC). Our standard is to always invoke a stored procedure.
I did some research and found that ref cursor makes multiple round trips to the database based on the JDBC fetch count property. (can somebody throw more light on this of the end to end flow of how data is stored in memory in oracle and JVM when processing ref cursors)
I was thinking collections are more efficient because the data is sent in one shot to the caller (Java) from oracle db (use bulk collect). With this approach we can avoid multiple network calls from Java to Oracle servers. is this a true assumption?
Appreciate your help!
This is a much bigger topic than anyone is willing to commit to in a posting. Here's a link that discusses how Oracle manages read consistency. That entire page is probably a good read to get some of idea of what's going on in the server. There's also an article here that discusses what happens using collections. How would you return the collection to a JDBC Client (not something I've ever tried)?
Essentially, there's a lot involved in performance, from how your database is configured to how your network is tuned to disk performance, to client performance, etc.
The short answer is you need to try things. Retrieving 3-5k records isn't a lot, and it depends on how big the record is, that your bringing back across the network. If they are 20 byte records, and your network (MTU?) size is 4k blocks, you can fit about 200 records in a block. At some point, you run into the law of diminishing returns.
I use stored procedures as a matter of habit, but you don't need to. It would depend on the complexity of the query (number of tables and the type of joins) and the ability for someone like a DBA to be able to go in and see what the query is doing.
Worrying about network trips is a little less critical, because there's only so much data you can stuff in a packet. There's going to be a number of network trips no matter what you use, it really depends on your use case to determine how critical it is to get that to a bare minimum.
I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.
I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.
Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).
Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.
I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.
I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.
i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.
For a thick-client project I'm working on, I have to remotely connect to a database (IBM i-series) and perfom a number of SQL related tasks:
Download/Update a set of local/offline 'control' data - this data may have changed between runs unnoticed.
On command, download data from multiple (15-20) tables and store separately into a single Java object. The names of the tables are known, but the schema name changes between runs and can change inter-run (as far as I know, PreparedStatements do not allow one to dynamically insert the schema).
I had considered using joins/unions/etc to perform all of these queries as one, but the project requires me to have in-memory separations between table data (instead of one big joined lump).
Perform between 2 and 100+ repetitions of (2)
The last factor is that this needs to be run on high-latency (potentially dial-up) network connections using Java 1.5 on the oldest computers possible.
Currently I run 15-20 dynamically constructed PreparedStatements but I know this to be rather inefficient (I measured, so as to avoid premature optimization ala Knuth).
What would be the most efficient and error-tolerant method of performing these tasks?
My thoughts:
Regarding (1), I really have no idea other than checking the entire table against the new table, at which point I feel I might as well just download the new (potentially and likely unchanged) table and replace the old one, but this takes more time.
For (2): Ideally I'd be able to construct something similar to an array of SELECT statements, send them all at once, and have the database return one ResultSet per internal query. From what I understand, however, neither Statement nor PreparedStatement support returning multiple ResultSet objects.
Lastly, the best way I can think of doing (3) is to batch a number of (2) operations.
There is nothing special about having moving requirements, but the single most important thing to use when talking to most databases is having a connection pool in your Java application and use it properly.
This also applies here. The IBM i DB2/400 database is quite fast, and the database driver available in the jt400 project (type 4, no native code) is quite good, so you can pull over quite a bit of data in a short while simply by generating SQL on the fly.
Note that if you only have a single schema you can tell in the conneciton which one you need, and can then use non-qualified table names in your SQL statements. Read the JDBC properties in the InfoCenter very carefully - it is a bit tricky to get right. If you need multiple schemaes, the "naming=system" allows for library lists - i.e. a list of schemaes to look for the tables, which can be very useful when done correctly. The IBM i folks can help you here.
That said, if the connection is the limiting factor, you might have a very strong case for running the "create object from tables" Java code directly on the IBM i. You should already now prepare for being able to measure the traffic to the database - either with network monitoring tooling, using p6spy or simply going through a proxy (perhaps even a throtteling one)
Ideally, you would have the database group provide you with a set of stored procedures to optimize the access to the database.
Since you don't have access, you may want to ask them if they have timestamp data in the database at the row level to see when records were modified, this way you can select only the data that's changed since some point in time.
What #ThorbjørnRavnAndersen is suggesting is moving the database code on to the IBM host and connecting to it via RMI or JMS from the client. So the server code would be a RMI or JMS Server that accesses the database on your behalf and returns you java objects instead of bringing SQL resultsets across the wire.
I would pass along your requirements to the database team and see if they can't do something for you. I'm sure they don't want all these remote clients bringing all the data down each time, so it would benefit them as much as it would benefit you.