Large SQL dataset query using java

Large SQL dataset query using java - java

I have the following configuration:
SQL Server 2008
Java as backend technology - Spring + Hibernate
Basically what I want to do is a select with a where clause on a table. The problem is the table has about 700M entries and the query takes a really long time.
Can you please indicate some pointers on where to optimize the query or what sort of techniques are can I use in order to get an improvement in performance?
Thanks.

Using indexes is the standard technique used to deal with this problem. As requested, here are some pointers that should get you started:
http://odetocode.com/articles/70.aspx
http://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/
http://www.petri.co.il/introduction-to-sql-server-indexes.htm

The first thing I do in this case is isolate whether it is the amount of data I am returning that is the problem or not (an i/o issue). A simple non-scientific way to do this is change your query to just return the count:
select count(*) --just return a count, no data!
from MyTable
inner join MyOtherTable on ...
where ...
If this runs very quickly, it tells you your indexes are in order (assuming no sub-selects in your WHERE clause). If not, then you need to work on indexes, the WHERE clause, or your query construction itself (JOINs being done, etc).
Once that is satisfactory, add back in your SELECT clause. If it is slow, you are going to have to look at your data access pattern:
Can you return fewer columns?
Can you return fewer rows at once?
Is there caching you can do in the application layer?
Is this query a candidate for partitioned/materialized views (if your database supports those)?

I would run Profiler to find the exact query that is being generated. ORMs can create less than optimal queries. Once you know the query, you can run it in SSMS and see the execution plan. This will give you clues as to where you have performance problems.
Several things that can cause performance problems:
Lack of correct indexing (Foreign keys should be indexed if you have
joins as well as the criteria in the where clause)
Lack of sargability in the where clause forcing the query to not use
existing indexes
Returning more columns than are needed
Correlated subqueries and scalar functions that cause
row-by-agonzing-row operations
Returning too much data (will anybody really be looking at 1 million
records returned? You only want to return the amount you show on page
not the whole possible recordset)
Locking and blocking
There's more (After all whole very long books are written o nthis subject) but that should be enough to get you started at where to look.

You should provide some indexes for those column you often use to restrict the result. Other thing is the pagination of the result set.

Regardless of the specific DB, I would do the following:
run an explain analyze
make sure you have an index for the columns that are part of your where clause
If indexes are ok, it's very likely that you are fetching a lot of
records from disk, which is very slow: if you really cannot refine
your query so that you fetch fewer records, consider clustering your
table, to improve disk locality of your records.

Related

How to improve performance of a simple select query in oracle

I recently got into an interview and I was asked a question
We have a table employee(id, name). And in our java code, we are writing a logic to fetch data from this table and display it in UI. The query is
Select id,name from employee
Query was that during debugging, we found that this jdbc call to fire the query and get the output is taking say 20 secs and we want to reduce this to say 5 seconds or to the optimal time. How can we you do that, or how will I tackle this problem?
As there is no where clause in the query, I didn't suggest to index the column.
As this logic is taking 20 secs every time, so, some other code getting a lock on this table is also out of question.
I suggested that limiting the number of records fetched from the table should help but the interviewer didn't look convinced
Is there anything else we can do as a developer to optimize the call. I guess DBA might tune database setting to improve the performance of this query, but is there any other way

OK, so this is an interview question, so both the problem and the solutions are hypothetical. The interviewer is asking for possible optimizations and / or approaches. Here are some that are most likely to help:
Modify the query to page the data rather than fetching the whole lot. This looks applicable for the example query. Note that this is not just "limiting the number of rows selected from the table" ... which is probably why the interviewer looked doubtful when you said that!
If you do need to display the entire selected record set but in a reduced form (e.g. summed, averaged, sorted, collated etc), do the reduction in the query rather than by fetching the records and doing it in the client.
Tune the fetchSize() as suggested by Ivan.
Here are some other ideas that are less likely to help and / or will require extensive reworking.
Look at the network configs. For example you may be able to get better throughput by OS-level tuning TCP buffer, or optimizing physical or virtual network paths.
Run the query on the database server itself (to eliminate network overheads)
Use an in-memory table
Query a secondary database server; e.g. a readonly snapshot or a slave

You can try to increase fetchSize() for Statement/PreparedStatement to decrease number of network roundtrips between application server/desktop and database server.
You can start several threads that will query some piece of data and then merge all data from several threads.

EDIT: doesn't apply to this situation because id and name are the only columns on this table, but still useful for other readers to note.
If you create an index covering both id and name, then the database can use that index to read the data faster since it wont even have to even read the table.
See this link for a more thorough explanation.
if the index contains all the columns you’re requesting it doesn’t even need to look in the table. That concept is known as index coverage.

Sorting funcationality Optimization using MySQL and Java

I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.

I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.

Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).

Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.

Query is taking more time in db, although used indexed columns in join conditions then what can we do in code to Optimize

If a query is taking more time in db even after using indexed columns in join conditions then what can we do in code to minimize the execution time in Oracle and MySql.
I am feeling some daily in execution of query in Oracle from Java layer. Although I am using condition on the query on index column on numeric value column.
I am using Java Prepared Statement and execution executed from Java.

You are asking us to diagnose something without symptoms. You should provide output of EXPLAIN PLAN (or set autotrace on) and also the schema in question.
There is more to tuning than indexing columns. But without knowing, and I assume you've done all the optimization you can do, then it may be time to do pre-calculation with either tables, or materialized views.
Other options include solid state disk or parallelism (partitioning and/or parallel query) and so forth.
Not sure what you mean by "Java layer", I find that Java is often a hindrance to performance in Oracle. Stick with PL/SQL for stored procedures and daily jobs, if possible. To a Java programmer, every problem appears to be a Java problem. But Java brings little to the table as far as speeding up queries.

Most performant way of querying database with JDBC?

I need to get data from several tables, so I used a query with N left outer joins. It seems to me that it may be a waste of performance since I get the cartesian product of lots of data. Which is the preferable way to this in order to achieve greater performance? I'm thinking of doing N+1 little queries. Am I on the right track?
I know, this has little to do with JDBC specifics. I want to retrieve data from a single table, and make left outer joins to other N tables. The result set gets very big because I get a cartesian product. For example:
table1data1, table2data1, table3data1
table1data1, table2data2, table3data1
table1data1, table2data1, table3data2
table1data1, table2data2, table3data2
I know that if a make several queries to the database (such as in my example I get 1 record for table1, 2 records for table 2 and 2 records for table 2) I'll make a lot of roundtrips to the database. But I've tested this way and it looks a lot faster.

This really isn't JDBC specific. Generally speaking, depending on the amount of data being returned, you'll get better performance retrieving everything in a single result set. N+1 queries tends to make for a lot of round trips to the database. Does the result set contain fields you don't need? Can you trim the columns being returned? That would be a first step, if possible.

I think your current approach off getting a lot of data in one trip to the database is the right approach. However if you find yourself executing the same query many times with different parameters, it is more performant to write it as a stored procedure using bind variables. But I would definitely shy-away from breaking your JOIN into several smaller queries.

Java coding best-practices for reusing part of a query to count

The implementing-result-paging-in-hibernate-getting-total-number-of-rows question trigger another question for me, about some implementation concern:
Now you know you have to reuse part of the HQL query to do the count, how to reuse efficiently?
The differences between the two HQL queries are:
the selection is count(?), instead of the pojo or property (or list of)
the fetches should not happen, so some tables should not be joined
the order by should disappear
Is there other differences?
Do you have coding best-practices to achieve this reuse efficiently (concerns: effort, clarity, performance)?
Example for a simple HQL query:
select a from A a join fetch a.b b where a.id=66 order by a.name
select count(a.id) from A a where a.id=66
UPDATED
I received answers on:
using Criteria (but we use HQL mostly)
manipulating the String query (but everybody agrees it seems complicated and not very safe)
wrapping the query, relying on database optimization (but there is a feeling that this is not safe)
I was hoping someone would give options along another path, more related to String concatenation.
Could we build both HQL queries using common parts?

Have you tried making your intentions clear to Hibernate by setting a projection on your (SQL?)Criteria?
I've mostly been using Criteria, so I'm not sure how applicable this is to your case, but I've been using
getSession().createCriteria(persistentClass).
setProjection(Projections.rowCount()).uniqueResult()
and letting Hibernate figure out the caching / reusing / smart stuff by itself.. Not really sure how much smart stuff it actually does though.. Anyone care to comment on this?

Well, I'm not sure this is a best-practice, but is my-practice :)
If I have as query something like:
select A.f1,A.f2,A.f3 from A, B where A.f2=B.f2 order by A.f1, B.f3
And I just want to know how many results will get, I execute:
select count(*) from ( select A.f1, ... order by A.f1, B.f3 )
And then get the result as an Integer, without mapping results in a POJO.
Parse your query for remove some parts, like 'order by' is very complicated. A good RDBMS will optimize your query for you.
Good question.

Nice question. Here's what I've done in the past (many things you've mentioned already):
Check whether SELECT clause is present.
If it's not, add select count(*)
Otherwise check whether it has DISTINCT or aggregate functions in it. If you're using ANTLR to parse your query, it's possible to work around those but it's quite involved. You're likely better off just wrapping the whole thing with select count(*) from ().
Remove fetch all properties
Remove fetch from joins if you're parsing HQL as string. If you're truly parsing the query with ANTLR you can remove left join entirely; it's rather messy to check all possible references.
Remove order by
Depending on what you've done in 1.2 you'll need to remove / adjust group by / having.
The above applies to HQL, naturally. For Criteria queries you're quite limited with what you can do because it doesn't lend itself to manipulation easily. If you're using some sort of a wrapper layer on top of Criteria, you will end up with equivalent of (limited) subset of ANTLR parsing results and could apply most of the above in that case.
Since you'd normally hold on to offset of your current page and the total count, I usually run the actual query with given limit / offset first and only run the count(*) query if number of results returns is more or equal to limit AND offset is zero (in all other cases I've either run the count(*) before or I've got all the results back anyway). This is an optimistic approach with regards to concurrent modifications, of course.
Update (on hand-assembling HQL)
I don't particularly like that approach. When mapped as named query, HQL has the advantage of build-time error checking (well, run-time technically, because SessionFactory has to be built although that's usually done during integration testing anyway). When generated at runtime it fails at runtime :-) Doing performance optimizations isn't exactly easy either.
Same reasoning applies to Criteria, of course, but it's a bit harder to screw up due to well-defined API as opposed to string concatenation. Building two HQL queries in parallel (paged one and "global count" one) also leads to code duplication (and potentially more bugs) or forces you to write some kind of wrapper layer on top to do it for you. Both ways are far from ideal. And if you need to do this from client code (as in over API), the problem gets even worse.
I've actually pondered quite a bit on this issue. Search API from Hibernate-Generic-DAO seems like a reasonable compromise; there are more details in my answer to the above linked question.

In a freehand HQL situation I would use something like this but this is not reusable as it is quite specific for the given entities
Integer count = (Integer) session.createQuery("select count(*) from ....").uniqueResult();
Do this once and adjust starting number accordingly till you page through.
For criteria though I use a sample like this
final Criteria criteria = session.createCriteria(clazz);
List<Criterion> restrictions = factory.assemble(command.getFilter());
for (Criterion restriction : restrictions)
criteria.add(restriction);
criteria.add(Restrictions.conjunction());
if(this.projections != null)
criteria.setProjection(factory.loadProjections(this.projections));
criteria.addOrder(command.getDir().equals("ASC")?Order.asc(command.getSort()):Order.desc(command.getSort()));
ScrollableResults scrollable = criteria.scroll(ScrollMode.SCROLL_INSENSITIVE);
if(scrollable.last()){//returns true if there is a resultset
genericDTO.setTotalCount(scrollable.getRowNumber() + 1);
criteria.setFirstResult(command.getStart())
.setMaxResults(command.getLimit());
genericDTO.setLineItems(Collections.unmodifiableList(criteria.list()));
}
scrollable.close();
return genericDTO;
But this does the count every time by calling ScrollableResults:last().

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.