Suppose that we want to insert a record in some table. But in order to be allowed to do that, that table must not contain any record with duplicated values in some fields in such a way that database primary keys are not enough for doing that control and it must be done by the application's code. If the code for inserting a record looked like this...
check duplicates
if no duplicates:
insert the record
else:
show the user a error
That code would be wrong because two different threads could make the check of duplicates at the same time, then pass the check, then insert the same record, producing a situation where there are duplications so that the table state is now inconsistent.
As the code is a web application made in Java, I guess that it would be enough to synchronize the critical section with the same static object so that any user that makes the execution flow to get into the critical section must wait for another one that has previously got into that section. But, is that enough? Is there a more elegant way for doing that?
You can use database triggers for that. The correct one to use in this case is "Before Insert" trigger. If the use case is simpler, another option would be to use checks and constraints.
Third option would be to do it in the java code and synchronizing the section like you described would work too.
According to the clarification from the question comments, a unique index is exactly what's needed here. Create a unique index with the name column as well as the boolean column in it. It will then not allow two entries where both columns have the same values.
Related
Consider this:
I have a database with 10 rows.
Each row has a unique id (int) paired with some value e.g. name (varchar).
These ids are incremented from 1 to 10.
I delete 2 of the records - 2 and 8.
I add 2 more records 11 and 12.
Questions:
Is there a good way to redistribute unique ids in this database so it would go from 1 to 10 again ?
Would this be considered bad practice ?
I ask this question, because after some use of this database: adding and deleting values the ids would differ significantly.
One way to approach this would be to just generate the row numbers you want at the time you actually query, something like this:
SET #rn = 0;
SELECT
(#rn:=#rn + 1) AS rn, name
FROM yourTable;
ORDER BY id;
Generally speaking, you should not be worrying about the auto increment values which MySQL is assigning. MySQL will make sure that the values are unique without your intervention.
If you set the ID column to be primary key and an auto-increment as well, "resetting" is not really necessary because it will keep assigning unique IDs anyways.
If the thing that bothers you are the "gaps" among the existing values, then you might resort to "sort deletion", by employing the is_deleted column with bit/boolean values. Default value would be 0 (or b0), of course. In fact, soft-deleting is advised if there are some really important data that might be useful later on, especially if it involves possibility for payment-related entries where user can delete one of such entries either by omission or deliberately.
There is no simple way to employ the deletion where you simply remove one value and re-arrange the remaining IDs to retain the sequence. A workaround might be to do the following steps:
DELETE entry first. i.e. delete from <table> where ID = _value
INSERT INTO SELECT (without id column). please note that the table need to be identical in terms of columns and types in order for this query to work properly, so to speak... and you can also utilize temporary as the backup_table. i.e. insert into <backup_table> select <coluum1, column2, ...> from <table>
TRUNCATE your table, i.e. truncate table <table>
copy the values from the temp table back into the existing table. You can utilize the INSERT INTO SELECT once again, but make sure to drop the temp table in the end
Please note that I would NOT advise you to do this, mainly because most people utilize some sort of caching in their applications and they also utilize the specific ways to evaluate whether a specific object is the same.
I.e. in Java, the equals() and hashCode() methods for POJOs are overriden and programmers generally rely on IDs to be permanent way of identifying a specific object. By utilizing the above method, you essentially break the whole concept and I would not advise you to change the object's autoincrement ID value for this reason, before anything else.
Essentially, what you want to do is simply an anti-pattern and will generally make common patterns and practices employed by experienced programmers into solutions that are prone to unexpected issues and/or failures... and this especially applies if/when advanced features are involved, such as employing this such anti-pattern into an application that utilizes galera cluster and/or application caching.
This question already has answers here:
SQL Server Insert if not exists
(13 answers)
Closed 8 years ago.
I have a table users with primary key column email.
I have a piece of code where I store the user that simply invokes userDao.store(user);
Since the constraint exists, I can catch the exception and show the error on the UI. This approach works fine.
Another solution is to check first if the user exists and then store him in the database. This would result in two consecutive queries - select and then insert. So basically if the user exists I show the error. The issue I see here that if two users with the same email try to register at the same time and provide the same email. It may happen than both threads check the existence of the user and return nothing. Then the first thread saves the user and the second throws exception.
The third approach is to use MERGE query (I use hsqldb). Basically in one query I insert the user only if he does not exist. Then I can see the result of the query. If no rows have changed then it means that the user exists and I can show the error. Either of these approaches would not violate the consistency of my data. But I am looking for the best practices on how to handle this kind of problem.
Your first instinct was correct. To protect against duplicates, define a UNIQUE constraint on that column. Then catch any exception resulting from a violation of that constraint.
SQL lacks an atomic insert-if-not-exists command. You will see code using a nested SELECT statement, but such code is not atomic, so you would still need to trap for the UNIQUE constraint violations.
This Question is basically a duplicate. Search StackOverflow for more discussion and examples.
By the way, I would recommend against using email address as a primary key. If a user wants to change their email address on their account, you will have to update all related records using that value as a Foreign Key. I suggest using a Surrogate Key instead of a Natural Key almost always.
The chance of that happening is so remote you really don't have to consider it. Especially if you use email validation before someone can use the system. If you still are worried you can minimize the chance by using a synchronize operation on the call that checks for the existence of the email. The only way this would not work is if you have a clustered environment with the code running on 2 or more load balanced servers.
I suggest this should be one of common cases but probably I use wrong keywords when googling around.
I just need to create new table record with completely random key. Assume I obtained key with good randomness (almost random). However I can't be 100% sure no row yet exists. So what I need to do atomically:
Having row key check no row exists yet.
Reject operation if row exists.
Create row if it does not exit.
Most useful piece of information I found on this topic is article about HBase row locks.
I see HBase row locks as suitable solution but I'd like to do it better way without explicit row locking.
ICV looks not suitable because I really do want key to be random.
CAS would be great if they could work on 'row does not exists' condition but it looks they can't.
Explicit row locks have disadvantages like issues on region split.
Could somebody please add useful advice?
Preferable API is Java based but actually it is more about concept rather than implementation.
'Good enough' solution for this case happened to be based on checkAndPut() method. What I intended to do is new row insertion with key duplication check and for individual inserts solution is perfect:
HTable checkAndPut() method can check certain column is not set (check it for null value).
As rows anyway contain some 'ID' field which is mandatory for all
objects (you can use any other field that you always set for your
object) it is possible to check if row exists.
Put object passed to checkAndPut() is to contain initial
object state with mandatory field set.
Well, for bulk insertion (what I really needed) it happened to be too slow so I moved to UUID used as row keys without any checks on new row insertion. For me it is much better. The only consideration in this case is really good random generator. Standard Java java.util.UUID class contains everything I need including it is based on somewhat slow but pretty strong java.security.SecureRandom generator.
Just note: it looks like HBase user row locking feature is going to be dropped due to security / other risks related to its usage.
Consider am using java , struts, hibernate and oracle. How can i prevent duplicate entries stored in database. One way is to make field as Unique . For example i am entering country "USA" in jsp page,USA is already available means how can i prevent it. Please let me know.
Regards,
sara
You should always indeed put a unique constraint on fields which must stay unique. This will, however, lead to a cryptic exception at commit time. If you want to be more user-friendly, you should check if the entry already exists (using a query) before inserting it, and display a useful and readable error message to the user if the entry already exists.
This still allows two concurrent users to check at the same time, then insert at the same time, but it greatly reduces the probability, and the unique constraint makes sure that one of the commits will fail, leaving your database in a consistent state.
Query your database whether it already contains USA or not. If it does, then don't store it. If not, then do.
Add a unique index to your database table on the country column.
Additionally you can annotate the country attribute of your hibernate object with #Column(unique=true).
Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.