I need to get data from several tables, so I used a query with N left outer joins. It seems to me that it may be a waste of performance since I get the cartesian product of lots of data. Which is the preferable way to this in order to achieve greater performance? I'm thinking of doing N+1 little queries. Am I on the right track?
I know, this has little to do with JDBC specifics. I want to retrieve data from a single table, and make left outer joins to other N tables. The result set gets very big because I get a cartesian product. For example:
table1data1, table2data1, table3data1
table1data1, table2data2, table3data1
table1data1, table2data1, table3data2
table1data1, table2data2, table3data2
I know that if a make several queries to the database (such as in my example I get 1 record for table1, 2 records for table 2 and 2 records for table 2) I'll make a lot of roundtrips to the database. But I've tested this way and it looks a lot faster.
This really isn't JDBC specific. Generally speaking, depending on the amount of data being returned, you'll get better performance retrieving everything in a single result set. N+1 queries tends to make for a lot of round trips to the database. Does the result set contain fields you don't need? Can you trim the columns being returned? That would be a first step, if possible.
I think your current approach off getting a lot of data in one trip to the database is the right approach. However if you find yourself executing the same query many times with different parameters, it is more performant to write it as a stored procedure using bind variables. But I would definitely shy-away from breaking your JOIN into several smaller queries.
Related
I need to get the distinct values from a column (which is not indexed) and the table contains billions of rows.
So when I use distinct in the select query, the query gets time out as the timeout is set to 3 minutes.
Will be it a good approach to get all the data from the table and then using set we can get the unique values?
please suggest the best approach here.
Thanks in advance !! :)
It is not a good idea to take all rows (especially if there are billions of them), because it is not memory efficient, or you"ll get OutOfMemoryError.
The best of course is to rearrange unstructured data.
Also, for big sets of data common practice is to use a paging (pagination) mechanism, that allows you to take the data in small chunks, so you'll bypass this timeout issue.
I am using jdbc mysql. Let's assume there is a table in my db called Test. And there is a 700k rows. But fetching all rows are taking huge time. I am using preparedStatement. But I want to use multi threading in such a way that think there is 10 threads. for. eg 1st thread will fetch 70k rows then 2nd will fetch next 70k and so on. How to implement this?
Forgive me if this is too obvious and you tried it or it won't work in your situation, but caching might be very helpful here.
Regarding actually doing it with multi-threading, It might make sense to have some procedure you run (might need a new column in your table to do this) that would assign ids that you can query - something like " WHERE id BETWEEN value1 AND value2". Each Thread would query a different range. This would be faster than using order by, since this way avoids the need for the database to sort.
If you do want to go the order by route though, consider indexing your database so that that ordering doesn't take extra time.
I have a requirement like running 'n' numbers of select queries at fixed time intervals and storing that data. These results need to be pulled later upon a client's demand.
My question is:
1) Is it okay to store it as csv files? Or could you suggest another format?
2) Or, should it be stored as clob variable in a db?
Please suggest any compression techniques to store these query results; also, is it possible to store only revisions of previous resultsets instead of storing the whole resultset?
note:
The minimum time interval is hourly.
The number of queries (n) will be varying (currently 10 to 200 queries.)
The resultset size of each query is also varying (say 10 to 1,000,000 but mostly around 10k.)
The resultset data fetched between each time intervals doesn't differ much. (The row value will not be updated frequently.)
I am new to computer science and programming and also not very aware about storage or db designs.
It sounds like you should be building a data warehouse.
Performance-wise I suppose it would be better to have a table which purpose is to store the query results.
I think you need to store the data in a database. SQL database can serve you the best.
Regarding to storing the data in fixed interval of time, you just need to make effect of the change in the data set instead of storing the whole data again and again. I don't know what is your requirement and how much infrastructure you can afford. If you have such huge queries, I recommend you to work in Distributed System. Use NOSQL database for better performance.
I have the following configuration:
SQL Server 2008
Java as backend technology - Spring + Hibernate
Basically what I want to do is a select with a where clause on a table. The problem is the table has about 700M entries and the query takes a really long time.
Can you please indicate some pointers on where to optimize the query or what sort of techniques are can I use in order to get an improvement in performance?
Thanks.
Using indexes is the standard technique used to deal with this problem. As requested, here are some pointers that should get you started:
http://odetocode.com/articles/70.aspx
http://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/
http://www.petri.co.il/introduction-to-sql-server-indexes.htm
The first thing I do in this case is isolate whether it is the amount of data I am returning that is the problem or not (an i/o issue). A simple non-scientific way to do this is change your query to just return the count:
select count(*) --just return a count, no data!
from MyTable
inner join MyOtherTable on ...
where ...
If this runs very quickly, it tells you your indexes are in order (assuming no sub-selects in your WHERE clause). If not, then you need to work on indexes, the WHERE clause, or your query construction itself (JOINs being done, etc).
Once that is satisfactory, add back in your SELECT clause. If it is slow, you are going to have to look at your data access pattern:
Can you return fewer columns?
Can you return fewer rows at once?
Is there caching you can do in the application layer?
Is this query a candidate for partitioned/materialized views (if your database supports those)?
I would run Profiler to find the exact query that is being generated. ORMs can create less than optimal queries. Once you know the query, you can run it in SSMS and see the execution plan. This will give you clues as to where you have performance problems.
Several things that can cause performance problems:
Lack of correct indexing (Foreign keys should be indexed if you have
joins as well as the criteria in the where clause)
Lack of sargability in the where clause forcing the query to not use
existing indexes
Returning more columns than are needed
Correlated subqueries and scalar functions that cause
row-by-agonzing-row operations
Returning too much data (will anybody really be looking at 1 million
records returned? You only want to return the amount you show on page
not the whole possible recordset)
Locking and blocking
There's more (After all whole very long books are written o nthis subject) but that should be enough to get you started at where to look.
You should provide some indexes for those column you often use to restrict the result. Other thing is the pagination of the result set.
Regardless of the specific DB, I would do the following:
run an explain analyze
make sure you have an index for the columns that are part of your where clause
If indexes are ok, it's very likely that you are fetching a lot of
records from disk, which is very slow: if you really cannot refine
your query so that you fetch fewer records, consider clustering your
table, to improve disk locality of your records.
I currently have hibernate set up in my project. It works well for most things. However today I needed to have a query return a couple hundred thousand rows from a table. It was ~2/3s of the total rows in the table. The problem is the query is taking ~7 minutes. Using straight JDBC and executing what I assumed was an identical query, it takes < 20 seconds. Because of this I assume I am doing something completely wrong. I'll list some code below.
DetachedCriteria criteria =DetachedCriteria.forlass(MyObject.class);
criteria.add(Restrictions.eq("booleanFlag", false));
List<MyObject> list = getHibernateTemplate().findByCriteria(criteria);
Any ideas on why it would be slow and/or what I could do to change it?
You have probably answered your own question already, use straight JDBC.
Hibernate is creating at best an instance of some Object for every row, or worse, multiple Object instances for each row. Hibernate has some really degenerate code generation and instantiation behavior that can be difficult to control, especially with large data sets, and even worse if you have any of the caching options enabled.
Hibernate is not suited for large results sets, and processing hundreds of thousands of rows as objects isn't very performance oriented either.
Raw JDBC is just that raw types for rows columns. Orders of magnitudes of less data.
I'm not sure hibernate is the right thing to use if you need to pull hundreds of thousands of records. The query execute time might be under 20 seconds but the fetch time will be huge and consume a lot of memory. After you get all those records, how do you output them? It's far more data than you could display to a user. Hibernate isn't really a good solution for doing data wharehouse style data crunching.
Probably you have several references to other classes in your MyObject class and in your mapping you set eager loading or something like that. It's very hard to find the issue using the code you wrote because it's OK.
Probably it will be better for you to use Hibernate Profiler - http://hibernateprofiler.com/ . It will show you all the problems with your mappings, configurations and queries.