SQL Merge vs Check and Insert/Update in Java

SQL Merge vs Check and Insert/Update in Java - java

I have an Java(Spring) REST API endpoint where I get 3 data inputs and I need to Insert in the oracle database based on some unique ID using JDBCTemplate. But just to be sure something doesn't break, I want have a check first if I need to insert or just update.
1st Approach
Make a database call with a simple query like
SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1
And based on the value of count, make a separate Database call for Insert or Update. (count would never exceed 1)
2nd Approach
Make one single MERGE query hit using jdbctemplate.update() that would look like
MERGE INTO TABLENAME
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE
SET COL1='A', COL2='B'
WHERE ID='ABC'
WHEN NOT MATCHED THEN
INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')
Based on what I read on different sites, using MERGE is a bit more costly in terms of CPU reads based on an experiment on this site. But they have done it for purely for DB script use where they do it with 2 tables and my context of use is via API call and using DUAL.
I also read on this question that MERGE could result in ORA-0001: unique constraint and some concurrency issue.
I want to do this on a table on which some other operation is possible at the same time for a different row and a very very small chance for the same row value. So I want to know which approach to follow for such use case and I know this might be a common one but I could not find answer to what I'm looking for anywhere. I want to know the performance/reliability of both approach.

Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.
Option 1. Count and decide INSERT or UPDATE
declare
v_count int;
begin
SELECT count(1) INTO v_count FROM my_table WHERE ...;
IF v_count = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
ELSE
UPDATE my_table ...;
END IF;
end;
Option 2. UPDATE, if nothing is updated - INSERT
begin
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
END IF;
end;
Option 3. INSERT, if failed - UPDATE
begin
INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
-- what if another session updated the same row just before this point?
-- this statement will override previous changes
-- what if another session deleted this row?
-- this statement will do nothing silently - is it satisfactory?
-- what if another session locked this row for update?
-- this statement will fail
UPDATE my_table WHERE ...;
end;
Option 4. use MERGE
MERGE INTO my_table
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
-- We have no place to put our "what if" question,
-- but unfortunately MERGE is not atomic,
-- it is just a syntactic sugar for the option #1
Option 5. use interface for DML on my_table
-- Create single point of modifications for my_table and prevent direct DML.
-- For instance, if client has no direct access to my_table,
-- use locks to guarantee that only one session at a time
-- can INSERT/UPDATE/DELETE a particular table row.
-- This could be achieved with a stored procedure or a view "INSTEAD OF" trigger.
-- Client has access to the interface only (view and procedures),
-- but the table is hidden.
my_table_v -- VIEW AS SELECT * FROM my_table
my_table_ins_or_upd_proc -- PROCEDURE (...) BEGIN ...DML on my_table ... END;
PROCEDURE my_table_ins_or_upd_proc(pi_row my_table%ROWTYPE) is
l_lock_handle CONSTANT VARCHAR2(100) := 'my_table_' || pi_row.id;
-- independent lock handle for each id allows
-- operating on different ids in parallel
begin
begin
request_lock(l_lock_handle);
-->> this code is exactly as in option #2
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
INSERT INTO my_table ...;
END IF;
--<<
exception when others then
release_lock(l_lock_handle);
raise;
end;
release_lock(l_lock_handle);
end;
Not going too deep into low level details here, see this article to find out how to use locks in Oracle DBMS.
Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.
Option 5 is bulletproof and fast as it is relies on the DBMS contracts.
Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table is barenaked and clients rely on straightforward DML on this table.
I believe that performance is less important than data integrity, but let's mention that for completeness.
After proper consideration it is easy to see that the options order according to the "theoretical" average performance is:
2 -> 5 -> (1,4) -> 3
Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.
I guess currently we see that there will be no magic. Somewhere in the application it is required to ensure that every id inserted into my_table is unique.
If id values do not matter (95% of cases) - just go for using a SEQUENCE.
Otherwise, create a single point of manipulation on my_table (either in Java or in DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.

Related

The strange behaviour of the Oracle "insert into" command

I'm observing the strange situation in work "insert into" command.
I'll try to explain the situation from my point a view
There is TEMP_LINKS table in my database and application inserts data into it.
Say the query lays in insert1.sql
insert into TEMP_LINK (ID, SIDE)
select ID, SIDE
from //inner query//
group by ID, SIDE;
commit;
and there is java1 class which execute it
...
executeSqlScript(getResource("path-to-query1"));
...
After that, another java2 class make another insert into the same TEMP_LINK table
...
executeSqlScript(getResource("path-to-query2"));
...
where query2 looks like
insert into TEMP_LINK (ID, SIDE)
select
ID, 'B'
from (
select ID
from ...tables
where ..conditions
minus (
select ID
from ..tables
union
select ID
from TEMP_LINKS
);
commit;
Both java1 and java2 are executed in different threads and java1 is finished earlier that java2.
But time to time, second insert(from query2) don't insert data at all. I see in log: Update count 0 and in TEPM_LINKS there are data only from query1.
If I'm running the application again the issue is disappeared and both of the queries inserted properly data.
Earlier I tried to put both of the queries into one sql file, but the issue has appeared too.
So, maybe someone has ideas about what should I do, because mine is over. One interesting fact - sql "minus" operation is used only once - in that query2.

A big difference between Oracle and SQL Server, Oracle NEVER blocks a read. This is true even when records are locked. The following is a simplified explanation. Oracle uses the System Change Number (SCN) at the time a transaction starts to determine the state of the database for that transaction. All sorts of things can happen, inserts, updates, and deletes, the transaction sees the database as it was at the start of that transaction. Changes only matters at the point where the commit/rollback is executed.
In your situation, if the second query starts before the first has committed, the second won't see any changes the first has made, even after the first commits. You need to synchronize those transactions. The easiest way is to combine them into a single sequential execution. Oracle has many more complex synchronization methods, I would not go that route in this situation.

Are JDBC select queries through a single connection less efficient that a proc containing those queries?

If I open a single JDBC connection (for Oracle), and execute multiple select queries, will it be less efficient than calling a procedure that executes those queries, and returns the result in cursors?
Edit: Sample queries are:
select id, name from animals;
select * from animal_reservoir where animal_id=id;
(The actual first query would be quite complicated, and the id returned would be used as an input multiple times in the second query. As such, the first query will be inefficient to use as subquery in the second query. Also, the queries can't be combined.)

The two main differences are
fewer roundtrips (important if there are many small queries, otherwise not so much)
no need to send "intermediate" results (that are only needed for the next query, but not in the end) back to the client
How much of an impact this has completely depends on the application.
And often, there may be other alternatives (such as issuing different kind of queries in the first place; someone mentioned a JOIN in the comments -- or caching -- or indexing -- or data denormalization -- or ...
) to consider as well.
As usual, do what feels most natural first and optimize when you find there is an issue.

You haven't provided SQL queries that must use procedure
You can do 1 SQL query with multiple "inner SQL" using with clause for example:
with animals as (
select id, name from animals
)
select * from animal_reservoir,animals where animal_id=animals.id;

Return (self) generated value from insert statement (no id, no returning)

sorry, if the question title is misleading or not accurate enough, but i didn't see how to ask it in one sentence.
Let's say we have a table where the PK is a String (numbers from '100,000' to '999,999', comma is for readability only).
Let's also say, the PK is not sequentially used.
Now i want to insert a new row into the table using java.sql and show the PK of the inserted row to the User. Since the PK is not generated by default (e.g. insert values without the PK didn't work, something like generated_keys is not available in the given environment) i've seen two different approaches:
in two different statements, first find a possible next key, then try to insert (and expect that another transaction used the same key in the time between the two statements) - is it valid to retry until success or could any sql trick with transaction-settings/locks help here? how can i realize that in java.sql?
for me, that's a disappointing solution, because of the non-deterministic behaviour (perhaps you could convince me of the contrary), so i searched for another one:
insert with a nested select statement that looks up the next possible PK. looking up other answers on generating the PK myself I came close to a working solution with that statement (left out the casts from string to int):
INSERT INTO mytable (pk,othercolumns)
VALUES(
(SELECT MIN(empty_numbers.empty_number)
FROM (SELECT t1.pk + 1 as empty_number
FROM mytable t1
LEFT OUTER JOIN mytable t2
ON t1.pk + 1 = t2.pk
WHERE t2.pk IS NULL
AND t1.pk > 100000)
as empty_numbers),
othervalues);
that works like a charm and has (afaik) a more predictable and stable solution than my first approach, but: how can i possibly retrieve the generated PK from that statement? I've read that there is no way to return the inserted row (or any columns) directly and most of the google results i've found, point to returning generated keys - even though my key is generated, it's not generated by the DBMS directly, but by my statement.
Note, that the DBMS used in development is MSSQL 2008 and the productive system is currently a DB2 on AS/400 (don't know which version) so i have to stick close to SQL standards. i can't change the db-structure in any way (e.g. use generated keys, i'm not sure about stored procedures).

DB2 for i allows generated keys, stored procedures, user defined functions - pretty much all of the things SQL Server can do. The exact implementation is different, but that's what manuals are for :-) Ask your admin what version of IBM i they're running, then hit up the Infocenter for specifics.
The constraining factor is that you can't alter the database design; you are stuck with apparently multiple processes trying to INSERT while backfilling 'holes' in the existing keyspace. That's a very tough nut to crack. Because you can't change the DB design, there's nothing to be done except to allow for and handle PK collisions. There's no SQL trick that'll help - the SQL way is to have the DB generate the PK, not the application.
There are several alternatives to suggest, in the event that some change is allowed. All have issues needing a workaround, but that is unavoidable at this point due to the application design.
Create a UDF that all INSERT clients use to retrieve the next available PK. Use a table of 'available numbers' and delete them as they are issued.
Pre-INSERT all the available numbers. Force clients to do an UPDATE. Make them FETCH...FOR UPDATE where (rest of data = not populated). This will lock the row, avoiding collisions as well as make the PK immediately available.
Leave the DB and the other application programs using this table as-is, but have your INSERT process draw from a block of keys that's been set aside for your use. Keep the next available number in an SQL SEQUENCE or an IBM i data area. This only works if there's a very large hole in the keyspace that's not yet used.

Funnel analysis calculation, how would you calculate a funnel?

Suppose that I track an 'event' a user takes on a website, events can be things like:
viewed homepage
added item to cart
checkout
paid for order
Now each of those events are stored in a database like:
session_id event_name created_date ..
So now I want to build a report to display a particular funnel that I will define like:
Step#1 event_n
Step#2 event_n2
Step#3 event_n3
So this particular funnel has 3 steps, and each step is associated with ANY event.
How can I build a report for this now given the above data I have?
Note: just want to be clear, I want to be able to create any funnel that I define, and be able to create a report for it.
The most basic way I can think of is:
get all events for each step I have in my database
step#1 will be, x% of people performed event_n
Now I will have to query the data for step#2 who ALSO performed step#1, and display the %
Same as #3 but for step#3 with the condition for step#2
I'm curious how these online services can display these types of reports in a hosted Saas environment. Does map-reduce make this easier somehow?

First the answer, using standard SQL, given your hypothesis:
there is a table EVENTS with a simple layout:
EVENTS
-----------------------------
SESION_ID , EVENT_NAME , TMST
To get the session that performed step#1 at some time:
-- QUERY 1
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID;
Here I make the assumption that event1 can happen more then once per session. The result is a list of unique session that demonstrated event1 at some time.
In order to get step2 and step3, I can just do the same:
-- QUERY 2
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID;
-- QUERY 3
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID;
Now, you want to select sessions that performed step1, step2 and step3 - in that order.
More precisely you need to count sessions that performed step 1, then count session that performed step2, then count sessions that performed step3.
Basically we just need to combine the 3 above queries with left join to list the sessions that entered the funnel and which steps they performed:
-- FUNNEL FOR S1/S2/S3
SELECT
SESSION_ID,
Q1.TMST IS NOT NULL AS PERFORMED_STEP1,
Q2.TMST IS NOT NULL AS PERFORMED_STEP2,
Q3.TMST IS NOT NULL AS PERFORMED_STEP3
FROM
-- QUERY 1
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1,
LEFT JOIN
-- QUERY 2
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2,
LEFT JOIN
-- QUERY 3
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3
-- Q2 & Q3
ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST<Q3.TMST
-- Q1 & Q2
ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST<Q2.TMST
The result is a list of unique session who entered the funnel at step1, and may have continued to step2 and step3... e.g:
SESSION_ID_1,TRUE,TRUE,TRUE
SESSION_ID_2,TRUE,TRUE,FALSE
SESSION_ID_3,TRUE,FALSE,FALSE
...
Now we just have to compute some stats, for example:
SELECT
STEP1_COUNT,
STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1,
STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2,
STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2,
STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3,
STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE
FROM
(-- QUERY TO COUNT session at each step
SELECT
SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT,
SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT,
SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT
FROM
[... insert the funnel query here ...]
) AS COMPUTE_STEPS
Et voilà !
Now for the discussion.
First point, the result is pretty straightforward given you take the "set"(or functional) way of thinking and not the "procedural" approach. Don't visualize the database as a collection of fixed tables with columns and rows... this is how it is implemented, but it is not the way you interact with it. It's all sets, and you can arrange the sets like the way you need!
Second point that query will be automatically optimized to run in parallel if you are using a MPP database for instance. You don't even need to program the query differently, use map-reduce or whatever... I ran the same query on my test dataset with more than 100 millions events and get results in seconds.
Last but not least, the query opens endless possibilities. Just group by the results by the referer, keywords, landing-page, user informations, and analyse which provides the best convertion rate for instance!

The core problem in the way you are thinking about this is that you are thinking in a SQL/table type model. Each event is one record. One of the nice things about NoSQL technologies (which you feel an inkling towards) is that you can naturally store the record as one session per record. Once you store the data in a session-based manner, you can write a routine that checks to see if that session complies with the pattern or not. No need to do joins or anything, just a loop over a list of transactions in a session. Such is the power of semi-structured data.
What if you store your sessions together? Then, all you have to do is iterate through each session and see if it matches.
This is a fantastic use case for HBase, in my opinion.
With HBase, you store the session ID as the row key, then each of the events as values with the time stamp as the column qualifier. What this leaves you with is data that is grouped together by session ID, then sorted by time.
Ok, so now you want to figure out what % of sessions enacted behavior 1, then 2, then 3. You run a MapReduce job over this data. The MapReduce job will provide you one session per row key/value pair. Write a loop over the data to check to see if it matches the pattern. If it does count + 1, if not, don't.
Without going all out with HBase, you can use MapReduce to sessionize your unorganized data at rest. Group by the session ID, then in the reducer you'll have all of the events associated with that session grouped together. Now, you're basically where you were with HBase where you can write a method in the reducer that checks for the pattern.
HBase might be overkill if you don't have a ridiculous amount of data. Any sort of database that can store data hierarchically will be good in this situation. MongoDB, Cassandra, Redis all come to mind and have their strengths and weaknesses.

I recently released an open source Hive UDF to do this: hive-funnel-udf
It's pretty simple to use for this sort of funnel analysis task, you can just write Hive, no need to write custom Java MapReduce code.
This will only work if you are using Hive/Hadoop to store and query your data though.

Get the next ID not used?

Currently I have a database that is not managed by me and I can't do any changes to it, the id field is a smallint 2 unsigned that gives you up to 65535 id entries.
My problem is that I need to reuse the ids because of the above limitations, how could I get the next usable ID in order or what would you do to manage the inserts with the above limitations ?

Check if 1 is free. If not:
SELECT MIN(a.id) + 1 AS smallestAvailableId
FROM your_table AS a
LEFT JOIN your_table AS a2
ON a2.id = a.id + 1
WHERE a2.id IS NULL

From the tags I deduce that you need the id in Java.
I personally would avoid joining the table with itself. Since you have at most 64K rows, I would select id from table into Java and search for id in Java. One way to search for gaps is by sorting the array first (either in SQL or in Java); finding gaps then becomes trivial.
If you do this repeatedly, you can cache the array and avoid having to run an SQL statement every time you need an id.
Regardless of what you do, if there are multiple clients writing to the database you have to be prepared to deal with race conditions, where multiple clients would attempt to use the same id. You code would need to either use locking or be able to recover gracefully to re-trying the failed insert with a different id (I assume there is a uniqueness constraint on the id column.)

Whichever approach you take is likely to cause you problems because of race conditions unless you know you will have exactly one client accessing the db at any single moment.
To answer your question, what do you consider an "usable" id? Please shed some light on that. Until all id's have been used a simple
SELECT MAX(id) + 1 FROM table;
should do. If you establish a criterion for "usable" ids such as for example, reuse all ids that have been flagged old then you can do:
SELECT MIN(id) FROM table WHERE is_old = 1;
Then just unflag the selected id.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.