HBase: atomic 'check row does not exist and create' operation

HBase: atomic 'check row does not exist and create' operation - java

I suggest this should be one of common cases but probably I use wrong keywords when googling around.
I just need to create new table record with completely random key. Assume I obtained key with good randomness (almost random). However I can't be 100% sure no row yet exists. So what I need to do atomically:
Having row key check no row exists yet.
Reject operation if row exists.
Create row if it does not exit.
Most useful piece of information I found on this topic is article about HBase row locks.
I see HBase row locks as suitable solution but I'd like to do it better way without explicit row locking.
ICV looks not suitable because I really do want key to be random.
CAS would be great if they could work on 'row does not exists' condition but it looks they can't.
Explicit row locks have disadvantages like issues on region split.
Could somebody please add useful advice?
Preferable API is Java based but actually it is more about concept rather than implementation.

'Good enough' solution for this case happened to be based on checkAndPut() method. What I intended to do is new row insertion with key duplication check and for individual inserts solution is perfect:
HTable checkAndPut() method can check certain column is not set (check it for null value).
As rows anyway contain some 'ID' field which is mandatory for all
objects (you can use any other field that you always set for your
object) it is possible to check if row exists.
Put object passed to checkAndPut() is to contain initial
object state with mandatory field set.
Well, for bulk insertion (what I really needed) it happened to be too slow so I moved to UUID used as row keys without any checks on new row insertion. For me it is much better. The only consideration in this case is really good random generator. Standard Java java.util.UUID class contains everything I need including it is based on somewhat slow but pretty strong java.security.SecureRandom generator.
Just note: it looks like HBase user row locking feature is going to be dropped due to security / other risks related to its usage.

Related

Table data overrides

I'm currently sourcing some static data from a third party. It's a simple one-to-many, like this
garage:
id
name
desc
location
garage_price:
id
garage_id
price_type
price
Sometimes, the data is incorrect, and I will need to correct it. At the same time, I'd like to preserve the original sourced data somewhere and potentially run some queries to show the changes.
My question is whether someone is doing something like this with SQL, Java and Hibernate, and what's the approach you've taken, or would take.
I could add a boolean column, "original_data", to both tables, and before an update happens, run a trigger to copy the row from garage or garage_price into an "original_garage" or "original_price" table as long as original_data is true. Then set original_data to false, and all further updates will just happen on the garage/garage_price tables.
Anything wrong with that approach, and how do people typically work with multiple tables with the same data in Hibernate/JPA? Previously, I'd create a class that holds all the data, and subclass it twice, once per each table, while setting
#Inheritance(strategy=InheritanceType.TABLE_PER_CLASS)
on the parent.

As so often there are various options:
Use Hibernate Envers. It will keep a complete history of changes, so if you do multiple changes each will result in a row in the auditing tables. These tables are separate from your main data tables which might be a pro or a con, depending on your requirements.
Use the approach that you described: Write the original dataset, copy it before modifying it. You'll need two additional attributes:
A flag marking the original and a technical id do have a unique primary key.
Just as the second version, but you could actually do that in a trigger in the database. Which probably is faster, works no matter how the data gets inserted and to copy rows in the database is actually really easy, while it feels rather cumbersome in Java. Of course, writing triggers is considered a PITA in itself by many Java developers. If your application doesn't usually use triggers and stored procedures it is also really easy to forget about the trigger and being rather confused where these additional rows come from.

How to redistribute unique integer ids in a MySQL database?

Consider this:
I have a database with 10 rows.
Each row has a unique id (int) paired with some value e.g. name (varchar).
These ids are incremented from 1 to 10.
I delete 2 of the records - 2 and 8.
I add 2 more records 11 and 12.
Questions:
Is there a good way to redistribute unique ids in this database so it would go from 1 to 10 again ?
Would this be considered bad practice ?
I ask this question, because after some use of this database: adding and deleting values the ids would differ significantly.

One way to approach this would be to just generate the row numbers you want at the time you actually query, something like this:
SET #rn = 0;
SELECT
(#rn:=#rn + 1) AS rn, name
FROM yourTable;
ORDER BY id;
Generally speaking, you should not be worrying about the auto increment values which MySQL is assigning. MySQL will make sure that the values are unique without your intervention.

If you set the ID column to be primary key and an auto-increment as well, "resetting" is not really necessary because it will keep assigning unique IDs anyways.
If the thing that bothers you are the "gaps" among the existing values, then you might resort to "sort deletion", by employing the is_deleted column with bit/boolean values. Default value would be 0 (or b0), of course. In fact, soft-deleting is advised if there are some really important data that might be useful later on, especially if it involves possibility for payment-related entries where user can delete one of such entries either by omission or deliberately.
There is no simple way to employ the deletion where you simply remove one value and re-arrange the remaining IDs to retain the sequence. A workaround might be to do the following steps:
DELETE entry first. i.e. delete from <table> where ID = _value
INSERT INTO SELECT (without id column). please note that the table need to be identical in terms of columns and types in order for this query to work properly, so to speak... and you can also utilize temporary as the backup_table. i.e. insert into <backup_table> select <coluum1, column2, ...> from <table>
TRUNCATE your table, i.e. truncate table <table>
copy the values from the temp table back into the existing table. You can utilize the INSERT INTO SELECT once again, but make sure to drop the temp table in the end
Please note that I would NOT advise you to do this, mainly because most people utilize some sort of caching in their applications and they also utilize the specific ways to evaluate whether a specific object is the same.
I.e. in Java, the equals() and hashCode() methods for POJOs are overriden and programmers generally rely on IDs to be permanent way of identifying a specific object. By utilizing the above method, you essentially break the whole concept and I would not advise you to change the object's autoincrement ID value for this reason, before anything else.
Essentially, what you want to do is simply an anti-pattern and will generally make common patterns and practices employed by experienced programmers into solutions that are prone to unexpected issues and/or failures... and this especially applies if/when advanced features are involved, such as employing this such anti-pattern into an application that utilizes galera cluster and/or application caching.

How to make a mutual exclusive code section for database access?

Suppose that we want to insert a record in some table. But in order to be allowed to do that, that table must not contain any record with duplicated values in some fields in such a way that database primary keys are not enough for doing that control and it must be done by the application's code. If the code for inserting a record looked like this...
check duplicates
if no duplicates:
insert the record
else:
show the user a error
That code would be wrong because two different threads could make the check of duplicates at the same time, then pass the check, then insert the same record, producing a situation where there are duplications so that the table state is now inconsistent.
As the code is a web application made in Java, I guess that it would be enough to synchronize the critical section with the same static object so that any user that makes the execution flow to get into the critical section must wait for another one that has previously got into that section. But, is that enough? Is there a more elegant way for doing that?

You can use database triggers for that. The correct one to use in this case is "Before Insert" trigger. If the use case is simpler, another option would be to use checks and constraints.
Third option would be to do it in the java code and synchronizing the section like you described would work too.

According to the clarification from the question comments, a unique index is exactly what's needed here. Create a unique index with the name column as well as the boolean column in it. It will then not allow two entries where both columns have the same values.

Strictly auto-increment value in MySQL

I have to create a MySQL InnoDB table using a strictly sequential ID to each element in the table (row). There cannot be any gap in the IDs - each element has to have a different ID and they HAVE TO be sequentially assigned. Concurrent users create data on this table.
I have experienced MySQL "auto-increment" behaviour where if a transaction fails, the PK number is not used, leaving a gap. I have read online complicated solutions that did not convince me and some other that dont really address my problem (Emulate auto-increment in MySQL/InnoDB, Setting manual increment value on synchronized mysql servers)
I want to maximise writing concurrency. I cant afford having users writing on the table and waiting long times.
I might need to shard the table... but still keeping the ID count.
The sequence of the elements in the table is NOT important, but the IDs have to be sequential (ie, if an element is created before another does not need to have a lower ID, but gaps between IDs are not allowed).
The only solution I can think of is to use an additional COUNTER table to keep the count. Then create the element in the table with an empty "ID" (not PK) and then lock the COUNTER table, get the number, write it on the element, increase the number, unlock the table. I think this will work fine but has an obvious bottle neck: during the time of locking nobody is able to write any ID.
Also, is a single point of failure if the node holding the table is not available. I could create a "master-master"? replication but I am not sure if this way I take the risk of using an out-of-date ID counter (I have never used replication).
Thanks.

I am sorry to say this, but allowing high concurrency to achieve high performance and at the same time asking for a strictly monotone sequence are conflicting requirements.
Either you have a single point of control/failure that issues the IDs and makes sure there are neither duplicates nor is one skipped, or you will have to accept the chance of one or both of these situations.
As you have stated, there are attempts to circumvent this kind of problem, but in the end you will always find that you need to make a tradeoff between speed and correctness, because as soon as you allow concurrency you can run into split-brain situations or race-conditions.
Maybe a strictly monotone sequence would be ok for each of possibly many servers/databases/tables?

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.