Problem
I'm trying to get the most frequent item in a collection that belongs to a table. So for instance, if I have Table 'Library' and Table 'Book' and 'Library' has a collection of 'Book's, I would like to retrieve ALL books from ALL Libraries. From this result, I would like the most frequent book. The problem is I need this is one query, if possible. It would also be ok if I just got a list of ALL the books but sorted by occurrence.
What I've Tried
SELECT l.books, COUNT(l.books) AS occur FROM Library l
SELECT b FROM Library l, l.books b ORDER BY b.name
The second one sadly does not order by ALL books, it sorts each collection on its own.
If more information is need I can provide it of course.
I hope somebody can help me :(
Your first query is pretty close. The ANSI standard version of the query is:
SELECT l.books, COUNT(l.books) AS occur
FROM Library l
ORDER BY occur DESC
FETCH FIRST 1 ROW ONLY;
Some databases use TOP or LIMIT to fetch only one row.
Related
I'm dealing with up to a billion records in oracle and I really need efficiency.
The first table is notification. I need to obtain the following data.
src_data_id | match_data_id
The second table is person_info. id is same as src_data_id and match_data_id from the notification table.
id | name
The third table is sample_info, in which self_object_id is the foreign key for person_info.
id | self_object_id
The forth table is sample_dna_gene where sample_id is same as id in sample_id.
sample_id | gene_info
I am writing a program in Java and I want to encalsulate a list of objects. Each object contains the name (from person_info) and gene_info (from gene_info).
Originally, I did it in 2 steps. I joined notification and person_info to obtain the ids. Then I joined person_info, sample_info and gene_info to obtain the names and their corresponding gene_info.
This would be fine for a smaller database, but dealing with up to a billion records, I need to worry about speed. I should not join the three tables like I did, but use simple sqls for each table, and join the pieces in Java instead.
It was easy to get ids from person_info with separate sqls, but I'm having trouble with obtaining their corresponding gene_info. I can get sample_info.id with a simple sql using in(id1,id2,id3...). I can then find gene_info with another simple sql using in(id1,id2,id3...).
I can obtain all these lists in java, but how do I put them together? I'm using spring and mybatis. Originally I could make one big messy sql and encapsulates all elements in the mapper. I'm not sure what to do now.
edit: The messy sql I have right now is
select to_char(sdg.gene_info), max(aa.pid), max(aa.sid), max(aa.id_card_no)
from (select max(pi.person_name),
max(pi.id) pid,
si.id sid,
max(pi.id_card_no),
max(pi.race)
from person_info pi
join sample_info si
on pi.id = si.self_object_id
group by si.id) aa
join sample_dna_gene sdg
on sdg.sample_id = aa.sid
group by to_char(sdg.gene_info)
where aa.pid in ('...')
It's a little more complicated than the orginal question. I need to group by id in sample_id first, then group by gene_info in sample_data_gene. I had to use a lot of max() so group by would work, and even then, I still could not get the gene_info group by to work properly. I'm not sure how ineffcient the max() is and how much it will slow down the query, but you can clearly see the point why I wanted to avoid such a messy sql now.
I had similar case. It was delt with 4 separate readers one for each table and merging was done on java side. Unfortunately prerequisite for that was sorting income streams on database side.
You read single record from stream one then you read records from stream 2 until key changes (as you sorted by that key and key is common for all tabs) then same for following streams. In my case that made sense as first table was very wide and next 3 had many rows for single key in table 1. If in your case there are no 1:n (where n is big) relations I don't see why such approach can be better than join.
I'm not a pro in SQL at all :)
Having a very critical performance issue.
Here is the info directly related to problem.
I have 2 tables in my DB- table condos and table goods.
table condos have the fields:
id (PK)
name
city
country
table items:
id (PK)
name
multiple fields not related to issue
condo_id (FK)
I have 1000+ entities in condos table and 1000+ in items table.
The problem is how i perform items search
currently it is:
For example, i want to get all the items for city = Sydney
Perform a SELECT condos.condo_id FROM public.condos WHERE city = 'Sydney'
Make a SELECT * FROM public.items WHERE item.condo_id = ? for each condo_id i get in step 1.
The issue is that once i get 1000+ entities in condos table, the request is performed 1000+ times for each condo_id belongs to 'Sydney'. And the execution of this request takes more then a 2 minutes which is a critical performance issue.
So, the questions is:
What is the best way for me to perform such search ? should i put a 1000+ id's in single WHERE request? or?
For add info, i use PostgreSQL 9.4 and Spring MVC.
Use a table join to perform a query such that you do not need to perform a additional query. In your case you can join condos and items by condo_id which is something like:
SELECT i.*
FROM public.items i join public.condos c on i.condo_id = c.condo_id
WHERE c.city = 'Sydney'
Note that performance tuning is a board topic. It can varied from environment to environment, depends on how you structure the data in table and how you organize the data in your code.
Here is some other suggestion that may also help:
Try to add index to the field where you use sorting and searching, e.g. city in condos and condo_id in items. There is a good answer to explain how indexing work.
I also recommend you to perform EXPLAIN to devises a query plan for your query whether there is full table search that may cause performance issue.
Hope this can help.
Essentially what you need is to eliminate the N+1 query and at the same time ensure that your City field is indexed. You have 3 mechanisms to go. One is already stated in one of the other answers you have received this is the SUBSELECT approach. Beyond this approach you have another two.
You can use what you have stated :
SELECT condos.condo_id FROM public.condos WHERE city = 'Sydney'
SELECT *
FROM public.items
WHERE items.condo_id IN (up to 1000 ids here)
the reason why I am stating up to 1000 is because some SQL providers have limitations.
You also can do join as a way to eliminate the N+1 selects
SELECT *
FROM public.items join public.condos on items.condo_id=condos.condo_id and condos.city='Sydney'
Now what is the difference in between the 3 queries.
Pros of Subselect query is that you get everything at once.
The Cons is that if you have too many elements the performance may suffer:
Pros of simple In clause. Effectivly solves the N+1 problem,
Cons may lead to some extra queries compared to the Subselect
Joined query pros, you can initialize in one go both Condo and Item.
Cons leads to some data duplication on Condo side
If we have a look into a framework like Hibernate, we can find there that in most of the cases as a fetch strategy is used either Joined either IN strategies. Subselect is used rarely.
Also if you have critical performance you may consider reading everything In Memory and serving it from there. Judging from the content of these two tables it should be fairly easy to just upload it into a Map.
Effectively everything that solves your N+1 query problem is a solution in your case if we are talking of just 2 times 1000 queries. All three options are solutions.
You could use the first query as a subquery in an in operator in the second query:
SELECT *
FROM public.items
WHERE item.condo_id IN (SELECT condos.condo_id
FROM public.condos
WHERE city = 'Sydney')
I look up a bunch of model ids:
List<Long> ids = lookupIds(searchCriteria);
And then I run a query to order them:
fooModelList = (List<FooModel>) query.execute(ids);
The log shows that this is the GQL that this is compiled to:
Compiling "SELECT FROM com.foo.FooModel WHERE
:p.contains(id) ORDER BY createdDateTime desc RANGE 0,10"
When the ids ArrayList is small this works fine.
But over a certain size (40 maybe?) I get this error:
IllegalArgumentException: Splitting the provided query requires
that too many subqueries are merged in memory.
Is there a way to work around this or is this a fixed limit in GAE?
This is a fixed limit. If you're looking up entities by ID, though, you shouldn't be doing queries in the first place - you should be doing fetches by key. If you're querying by a foreign key, you'll need to do separate queries yourself if you want to go over the limit of 40 - but you should probably reconsider your design, since this is extremely inefficient.
I could not verify this using the documentation of GAE, so my answer might not be complete. Yet I found that "ORDER BY createdDateTime desc" sets this limit, which is 30 by the way. My hypothesis is that if gae doesn't need to sort it, it does not need to process the query in memory.
If you do need to 'sort it', do this (which is the way to go with timed-stuff in GAE anyway):
Add a field 'week' or 'month' or something to the query which contains an integer that uniquely separates weeks/months (so you need something else that 0..52 or 0..11, as they also need to be unique over the years). Than you make your query and state that you are only interested in those of this week, and maybe also last week (or month). So if we are in week 4353, your query has something like ":week IN [4353, 4352]". Than you should have a relatively small query-set. Then filter out the posts that are are too old, and sort it in memory.
I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.
I have trouble understanding how to avoid the n+1 select in jpa or hibernate.
From what i read, there's the 'left join fetch', but i'm not sure if it still works with more than one list (oneToMany)..
Could someone explain it to me, or give me a link with a clear complete explanation please ?
I'm sorry if this is a noob question, but i can't find a real clear article or doc on this issue.
Thanks
Apart from the join, you can also use subselect(s). This results in 2 queries being executed (or in general m + 1, if you have m lists), but it scales well for a large number of lists too, unlike join fetching.
With join fetching, if you fetch 2 tables (or lists) with your entity, you get a cartesian product, i.e. all combinations of pairs of rows from the two tables. If the tables are large, the result can be huge, e.g. if both tables have 1000 rows, the cartesian product contains 1 million rows!
A better alternative for such cases is to use subselects. In this case, you would issue 2 selects - one for each table - on top of the main select (which loads the parent entity), so altogether you load 1 + 100 + 100 rows with 3 queries.
For the record, the same with lazy loading would result in 201 separate selects, each loading a single row.
Update: here are some examples:
a tutorial: Tuning Lazy Fetching, with a section on subselects towards the end (btw it also explains the n+1 selects problem and all strategies to deal with it),
examples of HQL subqueries from the Hibernate reference,
just in case, the chapter on fetching strategies from the Hibernate reference - similar content as the first one, but much more thorough