Improving application performance with large queries

Improving application performance with large queries - java

I was looking into using caching for improving website performance recently and thought of implementing it in my application.
I have a scenario where I am executing a large query to get list of orders based on order status and customer/supplier. There can be other filters selected at runtime by the user. The query is also complex with a lot of joins amongst tables.
Can we implement caching memcached/Redis in anyway to improve performance in this area by storing the orders object in cache with order id as key and order object as value when they get created and write some logic to fetch the list or orders with runtime filters applied or do i need to run the query itself for fetching orders?
Also there are simple queries like get orders object by orderNumber. Can we cache these to avoid DB query?
Please let me know of any possible solution for improving site performance.
Thanks in advance :)

Related

What is more efficient of Querying the database

Should we query the table with more filtering by adding multiple conditions/ where clauses to SQL query and get the specific data
Or pull all the data and do filtering in our java class.
Looking for efficient way of coding practices
Example :
A table with multiple columns Id, Name, Place.
I need to pull the list of ids with Place should be form placesList and Name should match namesList.
1)
(SELECT id
FROM Person p
WHERE p.name IN (<name_list>)
AND p.place IN (<place_list>)
order
by p.id asc)
public List<Long> getIds(#BindIn("name_list") List<String> name_list, #BindIn("place_list") List<String> place_list);
or
2)
(SELECT id
FROM Person p)
public List getIds();
apply java8 filters to the result
Note: Above example i took name place for easy understanding. In Real time, data is huge and have multiple fields and rows in the table. The list used to filter is also large.

Best approach is query with required filters on database and which will reduce amount of data fetch you from applicaion and db back and forth and also it will reduce time on I/O operations(since it involves some latency to transfer large amount of data over network).
also reduces overhead of memory need to process large amount of data on application side.
Also when you are running query and filtering on multiple fields you can add indexes(if necessary) on fields it will improve query fetch time.
Hope it answers

You always want to perform things in the database if possible. You want to avoid transferring data from the database to your application, using up memory, just to remove it there.
Databases are very efficient at doing those things, so you'll want to use them to their full extent.

Query the database directly instead of downloading data in Java application. It will reduce the latency from the database to your java application.
But be very careful when using user inputs in the filter. Make sure that you have sanitized the user input before using them in a query to the database to avoid SQL injection.
If you are worried about security more than performance then filter the data in Java App (if the data is not massive in size).
But I strongly recommend filtering the data on the database itself by ensuring necessary safeguards.

JPA entity graphs and pagination

In my current project we have multiple search pages in the system where we fetch a lot of data from the database to be shown in a large table element in the UI. We're using JPA for data access (our provider is Hibernate). The data for most of the pages is gathered from multiple database tables - around 10 in many cases - including some aggregate data from OneToMany relationships (e.g. "number of associated entities of type X"). In order to improve performance, we're using result set pagination with TypedQuery.setFirstResult() and TypedQuery.setMaxResults() to lazy-load additional rows from the database as the user scrolls the table. As the searches are very dynamic, we're using the JPA CriteriaQuery API to build the queries. However, we're currently somewhat suffering from the N+1 SELECT problem. It's pretty bad in some cases actually, as we might be iterating through 3 levels of nested OneToMany relationships, where on each level the data is lazy-loaded. We can't really declare those collections as eager loaded in the entity mappings, as we're only interested in them in some of our pages. I.e. we might fetch data from the same table in several different pages, but we're showing different data from the table and from different associated tables in different pages.
In order to alleviate this, we started experimenting with JPA entity graphs, and they seem to help a lot with the N+1 SELECT problem. However, when you use entity graphs, Hibernate apparently applies the pagination in-memory. I can somewhat understand why it does that, but this behavior negates a lot (if not all) of the benefits of the entity graphs in many cases. When we didn't use entity graphs, we could load data without applying any WHERE restrictions (i.e. considering the whole table as the result set), no matter how many millions of rows the table had, as only a very limited amount of rows were actually fetched due to the pagination. Now that the pagination is done in-memory, Hibernate basically fetches the whole database table (plus all relationships defined in the entity graph), and then applies the pagination in-memory, throwing the rest of the rows away. Not good.
So the question is, is there an efficient way to apply both pagination and entity graphs with JPA (Hibernate)? If JPA does not offer a solution to this, Hibernate-specific extensions are also acceptable. If that's not possible either, what are the other alternatives? Using database Views? Views would be a bit cumbersome, as we support several database vendors. Creating all of the necessary views for different vendors would increase development effort quite a bit.
Another idea I've had would be to apply both the entity graphs and pagination as we currently do, and simply not trigger any queries if they would return too many rows. I already need to do COUNT queries to get the lazy-loading of rows to work properly in the UI.

I'm not sure I fully understand your problem but we faced something similar: We have paged lists of entities that may contain data from multiple joined entities. Those lists might be sorted and filtered (some of those sorts/filters have to be applied in memory due missing capabilities in the dbms but that's just a side note) and the paging should be applied afterwards.
Keeping all that data in memory doesn't work well so we took the following approach (there might be better/more standard ones):
Use a query to load the primary keys (simple longs in our case) of the main entities. Join only what is needed for sorting and filtering to make the query as simple as possible.
In our case the query would actually load more data to apply sorts and filters in memory where necessary but that data is released asap and only the primary keys are kept.
When displaying a specific page we extract the corresponding primary keys for a page and use a second query to load everything that is to be displayed on that page. This second query might contain more joins and thus be more complex and slower than the one in step 1 but since we only load data for that page the actual burden on the system is quite low.

Oracle distinct vs java (cqengine/set) : whose leads to better performances?

I have a table from which I extract 8 columns, said columns will be properties of a pojo, say MyPojo.
I want to remove duplicates.
I came up with two strategies.
1-Let oracle take care of this with distinct keyword
select distinct c1,c2...c8 from TABLE where...`
2-Do this in java with cqengine (https://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Logical_Elimination_Strategy):
DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.LOGICAL_ELIMINATION);
ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));
3-Do this in java with a set
simply storing rows inside of a Set<MyPojo>
From a performance point of view which one is better?

Let the database do the work. In this case you don't send unnecessary data over the network which will - probably - have the biggest positive impact on performance.
Also it is the most compact solution in terms of code size.

The best way to decide these things is to model it.
What are the access patterns in your application?
If this is would be a one-off request: have the database do the filtering.
If you expect to get many such identical requests: have the database do the filtering, and consider caching results in the application.
If you expect to get a variety of queries on the same dataset, consider caching the unfiltered dataset into the application tier, and querying it with CQEngine.
There is no rule of thumb such as "always have the database do the work". If your application operates at any kind of scale, you will not want every request to hit the database. You need to scale out your application tier.
On the other hand, you should not over-engineer. The answer depends on the traffic volume and data access patterns that you expect.

Database Data Filtering Best Practice

I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.

I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.

i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.