I have a code where I am getting the data from various sources and sorting and ordering them to be send to the user.
I am taking the data by firing a query which contains multiple joins to a list of DTO, then again I am firing another query which further contains multiple joins to the same list of DTO. then I am adding both the lists of DTOs to be presented to the user.
Query 1:
Select * from TABLE1, TABLE2....
Query 2:
Select * from TABLE5, TABLE7....
dto1.addAll(dto2);
dto1.sort(Comparator....);
I am sorting it again programatically is because of below reason:
Query 1 returned sorted data lets assume
1,2,3,4
Query 2 returned sorted data lets assume
1,2,3,4
After combining both the lists, I will get
1,2,3,4,1,2,3,4
Expected data
1,1,2,2,3,3,4,4
My question is, on which case performance will be better?
fetch the sorted data from both the queries, add the list and then sort and order them.
fetch the unsorted data from both the queries, add the list and then sort and order only once.
In the first case, it will get sorted thrice, but on the second case, it will sort and order only once.
When I tested with putting hundreds of thousands of records in the table, I didn't found much difference, second case was a bit faster than the first one.
So, in case of efficiency and performance, which one should be recommended?
Do it all in MySQL:
( SELECT ... )
UNION ALL
( SELECT ... )
ORDER BY ...
Don't worry about sorting in the two selects; wait until the end to do it.
ALL assumes that there are no dups you need to get rid of.
This approach may be fastest simply because it is a single SQL request to the database. And because it does only one sort.
I think all three will have similar performance. You could get a little bit higher speed using one or the other but I don't think it will be significant.
Now, in terms of load, that's a different story. Are you more limited by CPU resources (in your local machine) or by database resources (in the remote DB server)? Most of the time the database will be sitting there idle while your application will be processing a lot of other stuff. If that's the case, I would prefer to put the load on the database, rather than the application itself: that is, I would let the database combine and sort the data in a single SQL call; then the application would simply use the ready-to-use data.
Edit on Dec 22. 2018:
If both queries run on the same database, you can run them as a single one and combine the results using a CTE (Common Table Expression). For example:
with
x (col1, col2, col3, col4, ...) as (
select * from TABLE1, TABLE2... -- query 1
union all
select * from TABLE5, TABLE7... -- query 2
)
select * from x
order by col1
The ORDER BY at then end operates over the combined result. Alternatively, if your database doesn't support CTEs, you can write:
select * from (
select * from TABLE1, TABLE2... -- query 1
union all
select * from TABLE5, TABLE7... -- query 2
) x
order by col1
I think 2nd one is better performer because if you run a sorting algorithm after merging your two list. So you don't need to run sort query to db. So database sorting query cost not requiring your 2nd query.
But if you retrieve data with sorted order and then again you run sorting algorithm then it will must take some more cost to execute although its negligible.
Related
I'm dealing with up to a billion records in oracle and I really need efficiency.
The first table is notification. I need to obtain the following data.
src_data_id | match_data_id
The second table is person_info. id is same as src_data_id and match_data_id from the notification table.
id | name
The third table is sample_info, in which self_object_id is the foreign key for person_info.
id | self_object_id
The forth table is sample_dna_gene where sample_id is same as id in sample_id.
sample_id | gene_info
I am writing a program in Java and I want to encalsulate a list of objects. Each object contains the name (from person_info) and gene_info (from gene_info).
Originally, I did it in 2 steps. I joined notification and person_info to obtain the ids. Then I joined person_info, sample_info and gene_info to obtain the names and their corresponding gene_info.
This would be fine for a smaller database, but dealing with up to a billion records, I need to worry about speed. I should not join the three tables like I did, but use simple sqls for each table, and join the pieces in Java instead.
It was easy to get ids from person_info with separate sqls, but I'm having trouble with obtaining their corresponding gene_info. I can get sample_info.id with a simple sql using in(id1,id2,id3...). I can then find gene_info with another simple sql using in(id1,id2,id3...).
I can obtain all these lists in java, but how do I put them together? I'm using spring and mybatis. Originally I could make one big messy sql and encapsulates all elements in the mapper. I'm not sure what to do now.
edit: The messy sql I have right now is
select to_char(sdg.gene_info), max(aa.pid), max(aa.sid), max(aa.id_card_no)
from (select max(pi.person_name),
max(pi.id) pid,
si.id sid,
max(pi.id_card_no),
max(pi.race)
from person_info pi
join sample_info si
on pi.id = si.self_object_id
group by si.id) aa
join sample_dna_gene sdg
on sdg.sample_id = aa.sid
group by to_char(sdg.gene_info)
where aa.pid in ('...')
It's a little more complicated than the orginal question. I need to group by id in sample_id first, then group by gene_info in sample_data_gene. I had to use a lot of max() so group by would work, and even then, I still could not get the gene_info group by to work properly. I'm not sure how ineffcient the max() is and how much it will slow down the query, but you can clearly see the point why I wanted to avoid such a messy sql now.
I had similar case. It was delt with 4 separate readers one for each table and merging was done on java side. Unfortunately prerequisite for that was sorting income streams on database side.
You read single record from stream one then you read records from stream 2 until key changes (as you sorted by that key and key is common for all tabs) then same for following streams. In my case that made sense as first table was very wide and next 3 had many rows for single key in table 1. If in your case there are no 1:n (where n is big) relations I don't see why such approach can be better than join.
If I open a single JDBC connection (for Oracle), and execute multiple select queries, will it be less efficient than calling a procedure that executes those queries, and returns the result in cursors?
Edit: Sample queries are:
select id, name from animals;
select * from animal_reservoir where animal_id=id;
(The actual first query would be quite complicated, and the id returned would be used as an input multiple times in the second query. As such, the first query will be inefficient to use as subquery in the second query. Also, the queries can't be combined.)
The two main differences are
fewer roundtrips (important if there are many small queries, otherwise not so much)
no need to send "intermediate" results (that are only needed for the next query, but not in the end) back to the client
How much of an impact this has completely depends on the application.
And often, there may be other alternatives (such as issuing different kind of queries in the first place; someone mentioned a JOIN in the comments -- or caching -- or indexing -- or data denormalization -- or ...
) to consider as well.
As usual, do what feels most natural first and optimize when you find there is an issue.
You haven't provided SQL queries that must use procedure
You can do 1 SQL query with multiple "inner SQL" using with clause for example:
with animals as (
select id, name from animals
)
select * from animal_reservoir,animals where animal_id=animals.id;
As the title states, I want to only retrieve a maximum of for example 1000 rows, but if the queries result would be 1001, i would like to know in some way. I have seen examples which would check the amount of rows in result with a a second query, but i would like to have it in the query i use to get the 1000 rows. I am using hibernate and criteria to receive my results from my database. Database is MS SQL
What you want is not posssible in a generic way.
The 2 usual patterns for pagination are :
use 2 queries : a first one that count, the next one that get a page of result
use only one query, where you fetch one result more than what you show on the page
With the first pattern, your pagination have more functionalities because you can display the total number of pages, and allow the user to jump to the page he wants directly, but you get this possibility at the cost of an additional sql query.
With the second pattern you can just say to the user if there is one more page of data or not. The user can then just jump to the next page, (or any previous page he already saw).
You want to have two information that results from two distinct queries :
select (count) from...
select col1, col2, from...
You cannot do it in a single executed Criteria or JPQL query.
But you can do it with a native SQL query (by using a subquery by the way) with a different way according to the DBMS used.
By making it, you would make more complex your code, make it more dependent to a specific DBMS and you would probably not gained really something in terms of performance.
I think that you should use rather a count and a second query to get the rows.
And if later you want to exploit the result of the count to fetch next results, you should favor the use of the pagination mechanisms provided by Hibernate rather doing it in a custom way.
I have list of 100 000 employee names in java and I need to fetch Details for all these employees from database (which has around 400 000 Employee Details). I tried it with IN operator but it takes 10-15 minutes to do the fetch. Is there a better way to do this?
Most DBMS's have a limit to the number of values an IN clause may have.
You have a few choices:
Run separate SELECT statement for each name.
Run separate SELECT statement for each name, but batch them.
Limit batch size to reasonable number, e.g. 1000. Larger batches use more memory and doesn't improve performance.
Chop list of names into blocks of 1000, and run SELECT ... IN for each block.
If you're already doing #3, then you're doing it the best way you can.
Splitting the task into chunks of 1000 (SELECT batches or block of IN lists), will not be much different in performance from doing 100000 in one operation.
Unless you don't have an index on the name column, so the database has to do a full table scan. If that is the case, then chunking it would cause many full table scans, and that would be bad.
Solution #1: Create an index. If you do name lookups of tables with 100000+ records, you really(!) need an index.
Solution #2: Insert all names into a temporary staging table, then do SELECT ... WHERE name IN ( SELECT name FROM temptable ), which is what #JamesZ suggested in the comment below. This would ensure that only one full table scan is needed.
Strongly suggest solution #1.
Suppose that I track an 'event' a user takes on a website, events can be things like:
viewed homepage
added item to cart
checkout
paid for order
Now each of those events are stored in a database like:
session_id event_name created_date ..
So now I want to build a report to display a particular funnel that I will define like:
Step#1 event_n
Step#2 event_n2
Step#3 event_n3
So this particular funnel has 3 steps, and each step is associated with ANY event.
How can I build a report for this now given the above data I have?
Note: just want to be clear, I want to be able to create any funnel that I define, and be able to create a report for it.
The most basic way I can think of is:
get all events for each step I have in my database
step#1 will be, x% of people performed event_n
Now I will have to query the data for step#2 who ALSO performed step#1, and display the %
Same as #3 but for step#3 with the condition for step#2
I'm curious how these online services can display these types of reports in a hosted Saas environment. Does map-reduce make this easier somehow?
First the answer, using standard SQL, given your hypothesis:
there is a table EVENTS with a simple layout:
EVENTS
-----------------------------
SESION_ID , EVENT_NAME , TMST
To get the session that performed step#1 at some time:
-- QUERY 1
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID;
Here I make the assumption that event1 can happen more then once per session. The result is a list of unique session that demonstrated event1 at some time.
In order to get step2 and step3, I can just do the same:
-- QUERY 2
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID;
-- QUERY 3
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID;
Now, you want to select sessions that performed step1, step2 and step3 - in that order.
More precisely you need to count sessions that performed step 1, then count session that performed step2, then count sessions that performed step3.
Basically we just need to combine the 3 above queries with left join to list the sessions that entered the funnel and which steps they performed:
-- FUNNEL FOR S1/S2/S3
SELECT
SESSION_ID,
Q1.TMST IS NOT NULL AS PERFORMED_STEP1,
Q2.TMST IS NOT NULL AS PERFORMED_STEP2,
Q3.TMST IS NOT NULL AS PERFORMED_STEP3
FROM
-- QUERY 1
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1,
LEFT JOIN
-- QUERY 2
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2,
LEFT JOIN
-- QUERY 3
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3
-- Q2 & Q3
ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST<Q3.TMST
-- Q1 & Q2
ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST<Q2.TMST
The result is a list of unique session who entered the funnel at step1, and may have continued to step2 and step3... e.g:
SESSION_ID_1,TRUE,TRUE,TRUE
SESSION_ID_2,TRUE,TRUE,FALSE
SESSION_ID_3,TRUE,FALSE,FALSE
...
Now we just have to compute some stats, for example:
SELECT
STEP1_COUNT,
STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1,
STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2,
STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2,
STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3,
STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE
FROM
(-- QUERY TO COUNT session at each step
SELECT
SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT,
SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT,
SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT
FROM
[... insert the funnel query here ...]
) AS COMPUTE_STEPS
Et voilĂ !
Now for the discussion.
First point, the result is pretty straightforward given you take the "set"(or functional) way of thinking and not the "procedural" approach. Don't visualize the database as a collection of fixed tables with columns and rows... this is how it is implemented, but it is not the way you interact with it. It's all sets, and you can arrange the sets like the way you need!
Second point that query will be automatically optimized to run in parallel if you are using a MPP database for instance. You don't even need to program the query differently, use map-reduce or whatever... I ran the same query on my test dataset with more than 100 millions events and get results in seconds.
Last but not least, the query opens endless possibilities. Just group by the results by the referer, keywords, landing-page, user informations, and analyse which provides the best convertion rate for instance!
The core problem in the way you are thinking about this is that you are thinking in a SQL/table type model. Each event is one record. One of the nice things about NoSQL technologies (which you feel an inkling towards) is that you can naturally store the record as one session per record. Once you store the data in a session-based manner, you can write a routine that checks to see if that session complies with the pattern or not. No need to do joins or anything, just a loop over a list of transactions in a session. Such is the power of semi-structured data.
What if you store your sessions together? Then, all you have to do is iterate through each session and see if it matches.
This is a fantastic use case for HBase, in my opinion.
With HBase, you store the session ID as the row key, then each of the events as values with the time stamp as the column qualifier. What this leaves you with is data that is grouped together by session ID, then sorted by time.
Ok, so now you want to figure out what % of sessions enacted behavior 1, then 2, then 3. You run a MapReduce job over this data. The MapReduce job will provide you one session per row key/value pair. Write a loop over the data to check to see if it matches the pattern. If it does count + 1, if not, don't.
Without going all out with HBase, you can use MapReduce to sessionize your unorganized data at rest. Group by the session ID, then in the reducer you'll have all of the events associated with that session grouped together. Now, you're basically where you were with HBase where you can write a method in the reducer that checks for the pattern.
HBase might be overkill if you don't have a ridiculous amount of data. Any sort of database that can store data hierarchically will be good in this situation. MongoDB, Cassandra, Redis all come to mind and have their strengths and weaknesses.
I recently released an open source Hive UDF to do this: hive-funnel-udf
It's pretty simple to use for this sort of funnel analysis task, you can just write Hive, no need to write custom Java MapReduce code.
This will only work if you are using Hive/Hadoop to store and query your data though.