Funnel analysis calculation, how would you calculate a funnel? - java

Suppose that I track an 'event' a user takes on a website, events can be things like:
viewed homepage
added item to cart
checkout
paid for order
Now each of those events are stored in a database like:
session_id event_name created_date ..
So now I want to build a report to display a particular funnel that I will define like:
Step#1 event_n
Step#2 event_n2
Step#3 event_n3
So this particular funnel has 3 steps, and each step is associated with ANY event.
How can I build a report for this now given the above data I have?
Note: just want to be clear, I want to be able to create any funnel that I define, and be able to create a report for it.
The most basic way I can think of is:
get all events for each step I have in my database
step#1 will be, x% of people performed event_n
Now I will have to query the data for step#2 who ALSO performed step#1, and display the %
Same as #3 but for step#3 with the condition for step#2
I'm curious how these online services can display these types of reports in a hosted Saas environment. Does map-reduce make this easier somehow?

First the answer, using standard SQL, given your hypothesis:
there is a table EVENTS with a simple layout:
EVENTS
-----------------------------
SESION_ID , EVENT_NAME , TMST
To get the session that performed step#1 at some time:
-- QUERY 1
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID;
Here I make the assumption that event1 can happen more then once per session. The result is a list of unique session that demonstrated event1 at some time.
In order to get step2 and step3, I can just do the same:
-- QUERY 2
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID;
-- QUERY 3
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID;
Now, you want to select sessions that performed step1, step2 and step3 - in that order.
More precisely you need to count sessions that performed step 1, then count session that performed step2, then count sessions that performed step3.
Basically we just need to combine the 3 above queries with left join to list the sessions that entered the funnel and which steps they performed:
-- FUNNEL FOR S1/S2/S3
SELECT
SESSION_ID,
Q1.TMST IS NOT NULL AS PERFORMED_STEP1,
Q2.TMST IS NOT NULL AS PERFORMED_STEP2,
Q3.TMST IS NOT NULL AS PERFORMED_STEP3
FROM
-- QUERY 1
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1,
LEFT JOIN
-- QUERY 2
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2,
LEFT JOIN
-- QUERY 3
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3
-- Q2 & Q3
ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST<Q3.TMST
-- Q1 & Q2
ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST<Q2.TMST
The result is a list of unique session who entered the funnel at step1, and may have continued to step2 and step3... e.g:
SESSION_ID_1,TRUE,TRUE,TRUE
SESSION_ID_2,TRUE,TRUE,FALSE
SESSION_ID_3,TRUE,FALSE,FALSE
...
Now we just have to compute some stats, for example:
SELECT
STEP1_COUNT,
STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1,
STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2,
STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2,
STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3,
STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE
FROM
(-- QUERY TO COUNT session at each step
SELECT
SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT,
SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT,
SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT
FROM
[... insert the funnel query here ...]
) AS COMPUTE_STEPS
Et voilĂ  !
Now for the discussion.
First point, the result is pretty straightforward given you take the "set"(or functional) way of thinking and not the "procedural" approach. Don't visualize the database as a collection of fixed tables with columns and rows... this is how it is implemented, but it is not the way you interact with it. It's all sets, and you can arrange the sets like the way you need!
Second point that query will be automatically optimized to run in parallel if you are using a MPP database for instance. You don't even need to program the query differently, use map-reduce or whatever... I ran the same query on my test dataset with more than 100 millions events and get results in seconds.
Last but not least, the query opens endless possibilities. Just group by the results by the referer, keywords, landing-page, user informations, and analyse which provides the best convertion rate for instance!

The core problem in the way you are thinking about this is that you are thinking in a SQL/table type model. Each event is one record. One of the nice things about NoSQL technologies (which you feel an inkling towards) is that you can naturally store the record as one session per record. Once you store the data in a session-based manner, you can write a routine that checks to see if that session complies with the pattern or not. No need to do joins or anything, just a loop over a list of transactions in a session. Such is the power of semi-structured data.
What if you store your sessions together? Then, all you have to do is iterate through each session and see if it matches.
This is a fantastic use case for HBase, in my opinion.
With HBase, you store the session ID as the row key, then each of the events as values with the time stamp as the column qualifier. What this leaves you with is data that is grouped together by session ID, then sorted by time.
Ok, so now you want to figure out what % of sessions enacted behavior 1, then 2, then 3. You run a MapReduce job over this data. The MapReduce job will provide you one session per row key/value pair. Write a loop over the data to check to see if it matches the pattern. If it does count + 1, if not, don't.
Without going all out with HBase, you can use MapReduce to sessionize your unorganized data at rest. Group by the session ID, then in the reducer you'll have all of the events associated with that session grouped together. Now, you're basically where you were with HBase where you can write a method in the reducer that checks for the pattern.
HBase might be overkill if you don't have a ridiculous amount of data. Any sort of database that can store data hierarchically will be good in this situation. MongoDB, Cassandra, Redis all come to mind and have their strengths and weaknesses.

I recently released an open source Hive UDF to do this: hive-funnel-udf
It's pretty simple to use for this sort of funnel analysis task, you can just write Hive, no need to write custom Java MapReduce code.
This will only work if you are using Hive/Hadoop to store and query your data though.

Related

SQL Merge vs Check and Insert/Update in Java

I have an Java(Spring) REST API endpoint where I get 3 data inputs and I need to Insert in the oracle database based on some unique ID using JDBCTemplate. But just to be sure something doesn't break, I want have a check first if I need to insert or just update.
1st Approach
Make a database call with a simple query like
SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1
And based on the value of count, make a separate Database call for Insert or Update. (count would never exceed 1)
2nd Approach
Make one single MERGE query hit using jdbctemplate.update() that would look like
MERGE INTO TABLENAME
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE
SET COL1='A', COL2='B'
WHERE ID='ABC'
WHEN NOT MATCHED THEN
INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')
Based on what I read on different sites, using MERGE is a bit more costly in terms of CPU reads based on an experiment on this site. But they have done it for purely for DB script use where they do it with 2 tables and my context of use is via API call and using DUAL.
I also read on this question that MERGE could result in ORA-0001: unique constraint and some concurrency issue.
I want to do this on a table on which some other operation is possible at the same time for a different row and a very very small chance for the same row value. So I want to know which approach to follow for such use case and I know this might be a common one but I could not find answer to what I'm looking for anywhere. I want to know the performance/reliability of both approach.
Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.
Option 1. Count and decide INSERT or UPDATE
declare
v_count int;
begin
SELECT count(1) INTO v_count FROM my_table WHERE ...;
IF v_count = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
ELSE
UPDATE my_table ...;
END IF;
end;
Option 2. UPDATE, if nothing is updated - INSERT
begin
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
END IF;
end;
Option 3. INSERT, if failed - UPDATE
begin
INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
-- what if another session updated the same row just before this point?
-- this statement will override previous changes
-- what if another session deleted this row?
-- this statement will do nothing silently - is it satisfactory?
-- what if another session locked this row for update?
-- this statement will fail
UPDATE my_table WHERE ...;
end;
Option 4. use MERGE
MERGE INTO my_table
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
-- We have no place to put our "what if" question,
-- but unfortunately MERGE is not atomic,
-- it is just a syntactic sugar for the option #1
Option 5. use interface for DML on my_table
-- Create single point of modifications for my_table and prevent direct DML.
-- For instance, if client has no direct access to my_table,
-- use locks to guarantee that only one session at a time
-- can INSERT/UPDATE/DELETE a particular table row.
-- This could be achieved with a stored procedure or a view "INSTEAD OF" trigger.
-- Client has access to the interface only (view and procedures),
-- but the table is hidden.
my_table_v -- VIEW AS SELECT * FROM my_table
my_table_ins_or_upd_proc -- PROCEDURE (...) BEGIN ...DML on my_table ... END;
PROCEDURE my_table_ins_or_upd_proc(pi_row my_table%ROWTYPE) is
l_lock_handle CONSTANT VARCHAR2(100) := 'my_table_' || pi_row.id;
-- independent lock handle for each id allows
-- operating on different ids in parallel
begin
begin
request_lock(l_lock_handle);
-->> this code is exactly as in option #2
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
INSERT INTO my_table ...;
END IF;
--<<
exception when others then
release_lock(l_lock_handle);
raise;
end;
release_lock(l_lock_handle);
end;
Not going too deep into low level details here, see this article to find out how to use locks in Oracle DBMS.
Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.
Option 5 is bulletproof and fast as it is relies on the DBMS contracts.
Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table is barenaked and clients rely on straightforward DML on this table.
I believe that performance is less important than data integrity, but let's mention that for completeness.
After proper consideration it is easy to see that the options order according to the "theoretical" average performance is:
2 -> 5 -> (1,4) -> 3
Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.
I guess currently we see that there will be no magic. Somewhere in the application it is required to ensure that every id inserted into my_table is unique.
If id values do not matter (95% of cases) - just go for using a SEQUENCE.
Otherwise, create a single point of manipulation on my_table (either in Java or in DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.

Better Performance with SQL and Java Program

I have a code where I am getting the data from various sources and sorting and ordering them to be send to the user.
I am taking the data by firing a query which contains multiple joins to a list of DTO, then again I am firing another query which further contains multiple joins to the same list of DTO. then I am adding both the lists of DTOs to be presented to the user.
Query 1:
Select * from TABLE1, TABLE2....
Query 2:
Select * from TABLE5, TABLE7....
dto1.addAll(dto2);
dto1.sort(Comparator....);
I am sorting it again programatically is because of below reason:
Query 1 returned sorted data lets assume
1,2,3,4
Query 2 returned sorted data lets assume
1,2,3,4
After combining both the lists, I will get
1,2,3,4,1,2,3,4
Expected data
1,1,2,2,3,3,4,4
My question is, on which case performance will be better?
fetch the sorted data from both the queries, add the list and then sort and order them.
fetch the unsorted data from both the queries, add the list and then sort and order only once.
In the first case, it will get sorted thrice, but on the second case, it will sort and order only once.
When I tested with putting hundreds of thousands of records in the table, I didn't found much difference, second case was a bit faster than the first one.
So, in case of efficiency and performance, which one should be recommended?
Do it all in MySQL:
( SELECT ... )
UNION ALL
( SELECT ... )
ORDER BY ...
Don't worry about sorting in the two selects; wait until the end to do it.
ALL assumes that there are no dups you need to get rid of.
This approach may be fastest simply because it is a single SQL request to the database. And because it does only one sort.
I think all three will have similar performance. You could get a little bit higher speed using one or the other but I don't think it will be significant.
Now, in terms of load, that's a different story. Are you more limited by CPU resources (in your local machine) or by database resources (in the remote DB server)? Most of the time the database will be sitting there idle while your application will be processing a lot of other stuff. If that's the case, I would prefer to put the load on the database, rather than the application itself: that is, I would let the database combine and sort the data in a single SQL call; then the application would simply use the ready-to-use data.
Edit on Dec 22. 2018:
If both queries run on the same database, you can run them as a single one and combine the results using a CTE (Common Table Expression). For example:
with
x (col1, col2, col3, col4, ...) as (
select * from TABLE1, TABLE2... -- query 1
union all
select * from TABLE5, TABLE7... -- query 2
)
select * from x
order by col1
The ORDER BY at then end operates over the combined result. Alternatively, if your database doesn't support CTEs, you can write:
select * from (
select * from TABLE1, TABLE2... -- query 1
union all
select * from TABLE5, TABLE7... -- query 2
) x
order by col1
I think 2nd one is better performer because if you run a sorting algorithm after merging your two list. So you don't need to run sort query to db. So database sorting query cost not requiring your 2nd query.
But if you retrieve data with sorted order and then again you run sorting algorithm then it will must take some more cost to execute although its negligible.

Hibernate limit amount of result but check for more

As the title states, I want to only retrieve a maximum of for example 1000 rows, but if the queries result would be 1001, i would like to know in some way. I have seen examples which would check the amount of rows in result with a a second query, but i would like to have it in the query i use to get the 1000 rows. I am using hibernate and criteria to receive my results from my database. Database is MS SQL
What you want is not posssible in a generic way.
The 2 usual patterns for pagination are :
use 2 queries : a first one that count, the next one that get a page of result
use only one query, where you fetch one result more than what you show on the page
With the first pattern, your pagination have more functionalities because you can display the total number of pages, and allow the user to jump to the page he wants directly, but you get this possibility at the cost of an additional sql query.
With the second pattern you can just say to the user if there is one more page of data or not. The user can then just jump to the next page, (or any previous page he already saw).
You want to have two information that results from two distinct queries :
select (count) from...
select col1, col2, from...
You cannot do it in a single executed Criteria or JPQL query.
But you can do it with a native SQL query (by using a subquery by the way) with a different way according to the DBMS used.
By making it, you would make more complex your code, make it more dependent to a specific DBMS and you would probably not gained really something in terms of performance.
I think that you should use rather a count and a second query to get the rows.
And if later you want to exploit the result of the count to fetch next results, you should favor the use of the pagination mechanisms provided by Hibernate rather doing it in a custom way.

Strange Cassandra ReadTimeoutExceptions, depending on which client is querying

I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.
A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.

Avoiding exploding indices and entity-group write-rate limits with appengine

I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.

Categories

Resources