I am curious to know on two levels the following details.
Is it considered bad practice or is it worst performance to continually try to insert duplicate data and allowing the dbms to enforce an entity's constraints to deny those inserts. OR is it better to do some sort of SELECT COUNT(1) and only insert if count is not 1.
Assuming that from the first item it is more efficient from dbms perspective to enforce an entity's constraints and not make multiple calls. Will the application code (Java, .NET, etc) suffer a greater performance impact due to code unnecessarily heading into exception block even though the exception will not be handled.
Possible Duplicate: Inserting data into SQL Table with Primary Key. For dupes - allow insert error or Select first?
In terms of performance, you are better off using the in-database feature to enforce constraints.
When you attempt to enforce the constraint outside the database you have two issues. The first is that you have overhead of running a separate query, returning the results, and performing logic -- several database operations. Using the constraint, on the other hand, might do the same work, but it does it all inside the database without the extra overhead of passing things back and forth.
Second, when you attempt to enforce the constraint yourself, you introduce race conditions. This means that you might run the count() and it returns 0. Another transaction, meanwhile, inserts the value and then your insert fails anyway. You really want to avoid such race conditions. One solution, of course, is to put all the logic in a single transaction. This introduces its own overhead.
If you have to do a select and then insert each time that will be considerably slower as it requires two round trips. The cost to check an exception is nothing compared to the time required to do a database statement.
Some databases allow "upsert" statements that allow you to do both update and then insert if it doesnt exist as one call.
Really if you are doing this a lot you need to step back and think about the overall algorithm and architecture. Why are you constantly trying to insert values that already exist and is there something you can change so that it isn't happening at all - rather than handling the failure at the database point.
Just you increased round trip time. One time check and other time insert if not duplicate.
May be it is not good. But what about to catch exception and handle it, time comparison between first check then inset and catching an Exception has much of difference.
Edit:
One issue may rise in First check and then inset is synchronization. May be some thread update/Insert the data before your thread when you checked that there is no duplication, result will be exception at the end. So you must tackle that can not leave.
Personally, I would probably write a cursor to check that a record exists. To me it is a cleaner approach than relying on an Exception in order to control your program flow. Yes, there is a performance implication, but it is likely to be marginal. Clean, readable code would take the priority for me.
Related
I am hitting a REST API to get data from a service. I transform this data and store it in a database. I will have to do this on some interval, 15 minutes, and then make sure this database has latest information.
I am doing this in a Java program. I am wondering if it would be better, after I have queried all data, to do
1. SELECT statements and compare vs transformed data and do UPDATEs (DELETE all associated records to what was changed and INSERT new)
OR
DELETE ALL and INSERT ALL every time.
Option 1 has potential to be a lot less transactions, guaranteed SELECT on all records because we are comparing, but potentially not a lot of UPDATEs since I don't expect data to be changing much. But it has downside of doing comparisons on all records to detect a change
I am planning on doing this using Spring Boot, JPA layer and possibly postgres
The short answer is "It depends. Test and see for your usecase."
The longer answer: this feels like preoptimization. And the general response for preoptimization is "don't." Especially in DB realms like this, what would be best in one situation can be awful in another. There are a number of factors, including (and not exclusive to) schema, indexes, HDD backing speed, concurrency, amount of data, network speed, latency, and so on:
First, get it working
Identify what's wrong → get a metric
Measure against that metric
Make any obvious or necessary changes
Repeat 1 through 4 as appropriate
The first question I would ask of you is "What does better mean?" Once you define that, the path forward will likely become clearer.
My java code reads the excel file and writes (insert) data from it to oracle database.
For example, I need to read some similar cells in 2000 rows of excel file, my code reads it, insert to database and after do commit.
The first approximately 1000 rows inserts very fast, but another 1000 rows inserts very long.
Possibly reason in the lack of memory.
So, I think to do frequently commits while data is loading to database (e.g. do commit after every 50 rows read).
Is it good practice to do it or there are other ways to solve this problem?
Commits are for atomic operations in the database. You don't just throw them around because you feel like it. Each transaction is generally (depending on isolation level, but assuming serial isolation) a distinct, all-or-nothing operation.
If you don't know what is causing database transaction to take "long time", you should read the logs or talk to someone that knows how to diagnose the cause of the "slowdown" and remedies it. Most likely reason is bad configuration.
The bottom line is, people have transactions that insert 100,000 or even millions of rows as a single transaction without causing issues. And generally, it is better not to commit often for performance reasons.
Databases always have to be consistent, i.e. only commit, if your data is consistent, even if your program crashes afterwards.
(If you don't need that consistency, then why do you use a DB?)
PS: You won't go out of memory that fast.
I know (or think I know) that using things like prepared statements can help future executions of the same query execute faster. However, I was wondering, if you're using prepared statements but the actual values are the same every time, will it then also additionally optimize using the value?
To give a little more context, I want to test performance for a service request that uses an underlying database. The easy route would be to send in the same data each time. The more arduous route would be to ensure the data values were different each time. However, in either case, the same SQL query would be generated -- just the values would be different. So, will these scenarios end up testing the same thing or something different because of potential DB optimization?
I've tried to research this topic but I feel like a lot of what I'm reading is over my head. Any good links for someone that knows little about DB optimization would also be welcomed in addition to the central question.
It depends on exactly what you are doing and measuring. I would expect, though, that you'd need to use different values in order to get realistic results.
Caching
If you send the same values every time, you can probably guarantee that the particular row(s) that you're interested in are always going to be cached (in the buffer cache, in the file system cache, in the SAN cache, etc.) which is probably not terribly realistic if the set of possible inputs is large. On the other hand, if there are a small number of potential inputs and you're reasonably confident that the rows of interest will always be cached (for example, if you know that some other activity that takes place just before your service is called will cause the data you're interested in to be cached in memory before your service is called) then perhaps this is a realistic assumption.
Optimization
Ignoring caching, we can look at how the optimizer would treat the two cases. If you are generating SQL queries with embedded literals (a bad practice that is particularly harmful in Oracle but one that is very common), then you are generating different SQL statements. As far as Oracle is concerned
SELECT *
FROM emp
WHERE deptno = 10
is a completely different statement from
SELECT *
FROM emp
WHERE deptno = 20
There are some settings (i.e. cursor_sharing) you can tweak to ask Oracle to treat these two as identical queries (by having Oracle force them into using bind variables) but that is not without its own downsides and is generally only recommended when you're trying to apply a band-aid to a poorly written application while you work on refactoring the application to use bind variables properly.
Assuming that you are generating queries using bind variables in your application, preparing the statement, and then binding different values before executing the query multiple times, i.e.
SELECT *
FROM emp
WHERE deptno = :1
then you get into the realm of histograms, bind variable peeking, and adaptive cursor sharing. This can get pretty involved and depends heavily on the version of Oracle you're using, the edition you're using, and how you've configured the optimizer to work. I'll try to give a simplified high-level overview here-- if you want to delve too much deeper into one of these, we'll probably want a separate question.
Histograms
By default, the optimizer assumes that data is equally spaced and equally likely. So, for example, if the deptno column has 50 distinct values, the optimizer assumes by default that each value is equally likely. That's probably a pretty reasonable assumption for most columns but it's obviously not reasonable for all columns. If I have a table with all active duty military members, for example, and one of the columns is birth_year, there will be far more rows with a birth_year of 1994 (20 years ago) than 1934 (80 years ago). In these cases, you gather histograms on the column in question in order to tell the optimizer that the data isn't evenly distributed and to let the optimizer gather information about which values are more common and how common they are.
The optimizer doesn't care about the values you are passing for your bind variable values unless there is a histogram on one of the columns in your predicate (I'll ignore for the moment the possibility that you are passing a value that is out of range).
Bind variable peeking
If you do have a histogram on one or more columns, then Oracle (9.1 and later if memory serves) will "peek" at the first value that is passed in for a bind variable and use that value with the histogram to determine the best plan for all subsequent executions. This works reasonably well the vast majority of the time but it occasionally leads to hair-pullingly painful problems (and much swearing) when Oracle peeks at a "bad" value and generates a plan that is efficient for that one execution but terrible for all future executions. This is summed up by Tom Kyte's story about the database that has to be restarted if it's rainy on a Monday morning. If you have a histogram on the column and different values that you might pass in would likely benefit from different query plans, you'd likely want to take bind variable peeking into consideration to determine if passing in values in a different order created any performance issues.
Adaptive cursor sharing
In recent versions (if memory serves 11.1 and later) and depending on your configuration, Oracle can use adaptive cursor sharing to maintain multiple query plans for a single statement and to use the most appropriate version for the particular bind variable value that is passed in. This is a much more sophisticated version of bind variable peeking that peeks for each set of values you pass in and figures out whether it is close enough to some other set of values to use the previously generated plan or whether it needs to compute a new plan for the new set of values. Figuring out what constitutes "close enough" and how this interacts with various features for ensuring plan stability is a rather involved topic in its own right.
you could use db caching
http://www.oracle.com/technetwork/articles/sql/11g-caching-pooling-088320.html
if the app is making network roundtrip and caculating results, that will still eat considerable time
I want the DBMS to help me gain speed when doing a lot of inserts.
Today I do an INSERT Query in Java and catch the exception if the data already is in the database.
The exception I get is :
SQLite Exception : [19] DB[1] exec() columns recorddate, recordtime are not unique.
If I get an exception I do a SELECT Query with the primary keys (recorddate, recordtime) and compare the result with the data I am trying to insert in Java. If it is the same I continue with next insert, otherwise I evaluate the data and decide what to save and maybe do an UPDATE.
This process takes time and I would like to speed it up.
I have thought of INSERT IF NOT EXIST but this just ignore the insert if there is any data with the same primary keys, am I right? And I want to make sure it is exactly the same data before I ignore the insert.
I would appreciate any suggestions for how to make this faster.
I'm using Java to handle large amount of data to insert into a SQLite database (SQLite v. 3.7.10). As the connection between Java and SQLite I am using sqlite4java (http://code.google.com/p/sqlite4java/)
I do not think letting the dbms handling more of that logic would be faster, at least not with plain SQL, as far as I can think of there is no "create or update" there.
When handling lots of entries often latency is an important issue, especially with dbs accessed via network, so at least in that case you want want to use mass operations whereever possible. Even if provided, "create or update" instead of select and update or insert (if even) would only half the latency.
I realize that is not what you asked for, but I would try to optimize in a different way, processing chunks of data, select all of them into a map then partition the input in creates, updates and ignores. That way ignores are almost for free, and further lookups are guaranteed to be done in memory. Unlikely that the dbms can be significantly faster.
If unsure if that is the right approach for you, profiling of overhead times should help.
Wrap all of your inserts and updates into a transaction. In SQL this will be written as follows.
BEGIN;
INSERT OR REPLACE INTO Table(Col1,Col2) VALUES(Val1,Val2);
COMMIT;
There are two things to note here: the database paging and commits will not be written to disk until COMMIT is called, speeding up your queries significantly; the second thing is the INSERT OR REPLACE syntax, which does precisely what you want for UNIQUE or PRIMARY KEY fields.
Most database wrappers have a special syntax for managing transactions. You can certainly execute a query, BEGIN, followed by your inserts and updates, and finish by executing COMMIT. Read the database wrapper documentation.
One more thing you can do is switch to Write-Ahead Logging. Run the following command, only once, on the database.
PRAGMA journal_mode = wal;
Without further information, I would:
BEGIN;
UPDATE table SET othervalues=... WHERE recorddate=... AND recordtime=...;
INSERT OR IGNORE INTO table(recorddate, recordtime, ...) VALUES(...);
COMMIT;
UPDATE will update all existing rows, ignoring non existent because of WHERE clause.
INSERT will then add new rows, ignoring existing because of IGNORE.
I have to create a MySQL InnoDB table using a strictly sequential ID to each element in the table (row). There cannot be any gap in the IDs - each element has to have a different ID and they HAVE TO be sequentially assigned. Concurrent users create data on this table.
I have experienced MySQL "auto-increment" behaviour where if a transaction fails, the PK number is not used, leaving a gap. I have read online complicated solutions that did not convince me and some other that dont really address my problem (Emulate auto-increment in MySQL/InnoDB, Setting manual increment value on synchronized mysql servers)
I want to maximise writing concurrency. I cant afford having users writing on the table and waiting long times.
I might need to shard the table... but still keeping the ID count.
The sequence of the elements in the table is NOT important, but the IDs have to be sequential (ie, if an element is created before another does not need to have a lower ID, but gaps between IDs are not allowed).
The only solution I can think of is to use an additional COUNTER table to keep the count. Then create the element in the table with an empty "ID" (not PK) and then lock the COUNTER table, get the number, write it on the element, increase the number, unlock the table. I think this will work fine but has an obvious bottle neck: during the time of locking nobody is able to write any ID.
Also, is a single point of failure if the node holding the table is not available. I could create a "master-master"? replication but I am not sure if this way I take the risk of using an out-of-date ID counter (I have never used replication).
Thanks.
I am sorry to say this, but allowing high concurrency to achieve high performance and at the same time asking for a strictly monotone sequence are conflicting requirements.
Either you have a single point of control/failure that issues the IDs and makes sure there are neither duplicates nor is one skipped, or you will have to accept the chance of one or both of these situations.
As you have stated, there are attempts to circumvent this kind of problem, but in the end you will always find that you need to make a tradeoff between speed and correctness, because as soon as you allow concurrency you can run into split-brain situations or race-conditions.
Maybe a strictly monotone sequence would be ok for each of possibly many servers/databases/tables?