SQL query speed: multiple queries vs sorting within Java - java

Lets say I have a table with 2 columns:
city
name (of a person).
I also have a Java "city" object which contains:
city name
a list of all the people in that city
So now I have two options to get the data:
First use DISTINCT to get a list of all the cities. Then, for each city, query the database again, using WHERE to get only records where the person lives in that city. Then I can store this in a City object.
Get a list of all the data, using ORDER BY to order by the city name. Then loop through all the records and start storing them in City objects. When I detect that the city name changes then I can create a new City object and store the records in that.
Which of these methods is faster / better practice? Or is there some better way of getting this information than these two methods? I am using Oracle database.

A database query is a relatively expensive operation - you need to communicate with another server over the network, it then may need to access its disk, compute a result, return it to you, etc. You'd want to minimize these as much as possible. Having a single query and going over its results is by far a better idea than having multiple queries, unless you have some killer reason not to do so - which doesn't seem to be the case here, at least not from the information you shared.

Sort answer is #2. You wish to make as less queries to the database as possible. #2 if i got it correct you will make a join of city/people and then create the object.
Better way: Use JPA/Hibernate. i.e check http://www.baeldung.com/hibernate-one-to-many

Answer number #2 is optimal, in all cases.
You'll need to code the logic in Java to differentiate when you change from one city to the next one.
Alternatively, if you were using MyBatis the solution becomes very simple by using "collections". These perform a single database call and retrieve the whole Java tree you specify, including all sublists in multiple levels. Very performant and also easy to code.

Related

java object querying ( applying SQL logic on List of java objects)

I have a Employee object for example, Employee contains several fields (300+) like name, department, salary, age, account, etc.
Entire Employee table data cached into java List object , which contains 2+ million records.
Requirement
user can search on any filed presents in Employee object like employee name like Sehwag and age > 30 or salary > 100000, Based on user search we have to show filtered list of Employee list.
due to performance issues we are not querying DB, we want to apply the user search criteria on cached java List object earlier
is there way api / frameworks / any other solution where we can query on java objects?
below approach I am trying but I am feeling not a good approach
Iterating the Employee list and applying condition user search criteria on employee object, to know the user selected search criteria among 300 fields is challenging, written a lot enum mapping logic and some additional logic for every filed to make it work.
with current requirement it may works but thinking to use api or framework or better way to solve the requirement!
thanks in advance for your help.
First of all. If you either don't want or don't have chance to change the existing solution, take a look at querydsl.
There is a querydsl-collection module which fits exactly with your need. See more at http://www.querydsl.com/ and http://www.querydsl.com/static/querydsl/latest/reference/html/ch02s08.html
However, if you have a chance to review/rebuild the solution, you should consider something more appropriate for large volume querying. I suggest you exploring more about nonsql databases (mongodb) or indexing tools such as lucene or elasticsearch which adds a RESTFul layer on top of lucene.
I hope it helps.
tried CQEngine's SQL based queries https://github.com/npgall/cqengine suits to my requirement,
below are some useful links
https://dzone.com/articles/getting-started-cqengine-linq
https://mvnrepository.com/artifact/com.googlecode.cqengine/cqengine

How to store database data with lots of attributes into cache?

Let's say that I have a table with columns TABLE_ID, CUSTOMER_ID, ACCOUNT_NUMBER, PURCHASE_DATE, PRODUCT_CATEGORY, PRODUCT_PRICE.
This table contains all purchases made in some store.
Please don't concentrate on changing the database model (there are obvious improvement possibilities) because this is a made-up example and I can't change the actual database model, which is far from perfect.
The only thing I can change is the code which uses the already existing database model.
Now, I don't want to access the database all the time, so I have to store the data into cache and then read it from there. The problem is, my program has to support all sorts of things:
What is the total value of purchases made by customer X on date Y?
What is the total value of purchases made for products from category X?
Give me a list of total amounts spent grouped by customer_id.
etc.
I have to be able to preserve this hierarchy in my cache.
One possible solution is to have a map inside a map inside a map... etc.
However, that gets messy very quickly, because I need an extra nesting level for every attribute in the table.
Is there a smarter way to do this?
Have you already established that you need a cache? Are you sure the performance of your application requires it? The database itself can optimize queries, have things in memory, etc.
If you're sure you need a cache, you also need to think about cache invalidation: is the data changing from beneath your feet, i.e. is another process changing the data in the database, or is the database data immutable, or is your application the only process modifying your data.
What do you want your cache to do? Just keep track of queries and results that have been requested so the second time a query is run, you can return the result from the cache? Or do you want to aggressively pre calculate some aggregates? Can the cache data fit into your app memory or do you want to use ReferenceMaps for example that shrink when memory gets tight?
For your actual question, why do you need maps inside maps? You probably should design something that's closer to your business model, and store objects that represent the data in a meaningful way. You could have each query (PurchasesByCustomer, PurchasesByCategory) represented as an object and store them in different maps so you get some type safety. Similarly don't use maps for the result but the actual objects you want.
Sorry, your question is quite vague, but hopefully I've given you some food for thoughts.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Java data structure to use with Hibernate to store unknown number of parameters?

Following problem: I want to render a news stream of short messages based on localized texts. In various places of these messages I have to insert parameters to "customize" them. I guess you know what I mean ;)
My question probably falls into the "Which is the best style to do it?" category: How would you store these parameters (they may be Strings and Numbers that need to be formatted according to Locale) in the database? I'm using Hibernate to do the ORM and I can think of the following solutions:
build a combined String and save it as such (ugly and hard to maintain I think)
do some kind of fancy normalization and and make every parameter a single row on the database (clean I guess, but a performance nightmare)
Put the params into an Array, Map or other Java data structure and save it in binary format (probably causes a lot of overhead size-wise)
I tend towards option #3 but I'm afraid that it might be to costly in terms of size in the database. What do you think?
If you can afford the performance hit of using the normalized approach of having a separate table I would go with this approach. We use the same approach as your first suggestion at work, and it gets messy, especially when you reach the column limit and key/values start getting truncated!
Do the normalization.
I would suggest something like:
Table Message
id
Table Params
message_id
key
value
Storing serialized Java objects in the database is quite a bad thing in most cases. As they are hard to maintain and you cannot access them with 'simple' SQL tools.
The performance impact is not as big, as you can fetch all together in a single select using a join.
It depends a bit. Is the number of parameters huge for each entity? If it is not probable second option is the best.
If you don't want to add extra queries caused by the lazy load you can always change fetch type for the variable number of parameters that would only add one join to a query you were always doing. In normal conditions it is not a big price to pay.
Also the third and the first one forbids forever any type of queries over the parameters. A huge technical debt for the future I would not be willing to pay.
directly put it as string and save it ..

When to 'IN' and when not to?

Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:
class Store implements Validatable{
int storeNo;
int storeName;
... etc....
}
So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:
interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}
Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?
That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
Alternatively, I could just do:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )
This query would actually return much faster than doing the one above it 2000 times.
However, it doesn't go well with the design that a Rule can be run for a single store only.
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
What would you do if you were in my shoes, and what is the best practice?
That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
This is what you should not do.
Create a temporary table, fill the table with your values and JOIN this table, like this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM temptable tt
JOIN STORE s
ON s.STORE_ID = t.id
or this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM STORE s
WHERE s.STORE_ID IN
(
SELECT id
FROM temptable tt
)
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
IN filters duplicates out.
If you want each eligible row to be selected for each duplicate value in the list, use JOIN.
IN is in no way a "not suggested methology".
In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.
But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.
See this article in my blog for performance details in SQL Server:
IN vs. JOIN vs. EXISTS
Note, that in a properly designed database, validation rules are also set-based.
I. e. you implement your validation rules as queries against the temptable.
However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.
SELECT store_id FROM store WHERE store_active = 1
or even
SELECT store_id FROM store
will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.
If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!
I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.
If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.
...a Rule can be run for a single store only.
Managing business rules along with performance is a tricky task, so there is a library ("Persistence Layer") that does exactly that. You define rules, then execute a bulk of commands, then the library fetch from DB whatever the rules require in a single query (by using temp tables rather than 'IN') and then passes it to the rules.
There is an example of a validator in here.

Categories

Resources