So I have a main table in Hive, it will store all my data.
I want to be able to load a incremental data update about every month
with a large amount of data couple billion rows. There will be new data
as well as updated entries.
What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.
What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.
The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.
Anyone got any ideas as to how to do this most efficiently and effectively ??
I'm pretty new to Hive and databases in general.
If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number.
To find all entries that will be updated you can join increment data with old data:
insert overwrite target_data [partition() if applicable]
SELECT
--select new if exists, old if not exists
case when i.PK is not null then i.PK else t.PK end as PK,
case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
...
case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
FROM
target_data t --restrict partitions if applicable
FULL JOIN increment_data i on (t.PK=i.PK);
It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.
Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://stackoverflow.com/a/44755825/2700344
Here is my solution/work around if you are using old hive version. This works better when you have large data in target table which we can't drop and recreate with full data every time.
create one more table say delete_keys table. This will hold all the key from main table which are deleted along with its surrogate key.
While loading incremental data into main table, do a left join with main table. For all the matching records, we ideally should update the main table. But instead, we take keys (along with surrogate key) from main table for all matching records and insert that to delete_keys table. Now we can insert all delta records into main table as it irrespective of whether they are to be updated or inserted.
Create view on main table using delete-keys table so that matching keys with delete-keys table are not fetched. So, this view will be final target table. This view will not show records from main table which are updated with latest records.
Related
I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.
Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).
It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.
I'm working on a Java project which needs to be able to alter all the primary keys in a table - and in most cases - some other values in the row as well.
The problem I have is that, if I update a row by selecting by its old primary key (SET pk=new_pk WHERE pk=old_pk) I get duplicate rows (since the old PK value may be equal to another row's new PK value and both rows are then updated).
I figured that in Oracle and some other DBs I might be able to do this with ROWNUM or something similar, but the system should work with most DB systems and right now we can't get this to work for MySQL.
I should add that I don't have access to change the schema of the DB - so, I can't add a column.
What I have tried:
Updating ResultSets directly with RS.updateRow() - this seems to
work, but is very slow.
Hashing the PK's in the table, storing the hash in code and selecting on the hashed PK. This acts sort of as a signature, since a
hashed PK indicates that the row has been read but not yet updated
and I can avoid appropriate rows that way. The issue with this seems
to have been hash collisions as I was getting duplicate PKs.
PS:
I realise this sounds like either a terrible idea, or terrible design, but we really have no choice. The system I'm working on aims to anonymize private data and that may entail changing PKs in some tables. Don't fear, we do account for FK references.
In this case you can use simple update with delta = max Pk from updating table
select delta
select max(pk) as delta from table
and then use it in query
update table SET pk=pk+delta+1
Before this operation you need to disable constraints. And don't forget that you should also update foreign keys.
I'm looking at trying to query a table in a MySQL database (I have the primary key, which is comprised of two categories, a name and a number but string comparison), such that this table could have anywhere from very few rows to upwards of hundreds of millions. Now, for efficiency, I'm not exactly sure how costly it is to actually do an INSERT query but I have a few options as to go about it:
I could query the database to see if the element EXISTS and then call an INSERT query if it doesn't.
I could try to brute force INSERT into the database and if it succeeds or fails, so be it.
I could initially on program execution, create a cache/store, grab the primary key columns and store them in a Map<String, List<Integer>> and then search the key for if the name exists, then if it does, does the key and value combination in the List<Integer> exists, if it doesn't, then INSERT query the database.
?
Option one really isn't on the table for what I would really implement, just on the list of possible choices. Option two would most likely average better for unique occurrences such that it isn't in the table already. Option three would favour if common occurrences are the case such that a lot are in the cache.
Bearing in mind that option chosen will be iterated over potentially millions of times. Memory usage aside (From option 3), from my calculations it's nothing significant in respect to the capacity available.
Let the database do the work.
You should do the second method. If you don't want to get a failure, you can use on duplicate key update:
insert into t(pk1, pk2, . . . )
values ( . . . )
on duplicate key update set pk1 = values(pk1);
The only purpose of on duplicate key update is to do nothing useful but not return an error.
Why is this the best solution? In a database, a primary key (or columns declared unique) have an index structure. This is efficient for the database to use.
Second, this requires only one round-trip to the database.
Third, there are no race conditions, if you have multiple threads or applications that might be attempting to insert the same record(s).
Fourth, the method with on duplicate key update will work for inserting multiple rows at once. (Without on duplicate key insert, then a multi-value statement would fail if a single row is duplicated.) Combining multiple inserts into a single statement can be another big efficiency.
Your second option is really the right way to go.
Rather than fetching all your result in the third option , you could try using Limit 1 , given the fact that the combination of your name and number form a primary key thus , using limit 1 to fetch the result and then if the result is empty then you can probably insert your desired data. It would lot faster that way.
MySQL has a neat way to perform an special insertion. The INSERT ON DUPLICATE KEY UPDATE is a MySQL extension to the INSERT statement. If you specify the ON DUPLICATE KEY UPDATE option in the INSERT statement and the new row causes a duplicate value in the UNIQUE or PRIMARY KEY index, MySQL performs an update to the old row based on the new values:
INSERT INTO table(column_list)
VALUES(value_list)
ON DUPLICATE KEY UPDATE column_1 = new_value_1, column_2 = new_value_2;
We have some useless historical data in a database which sums upto 190 million (19 crores) rows in database contributing to 33-GB . Now I got a task to delete these much rows in one go and if in any case something breaks, I should be able to rollback the transaction.
I will select them based on some flag like deleted ='1' which from my estimation counts to 190 million out of 200 million. So first I have to do a select operation and then delete those id's.
As mentioned in this article, it is taking 4 hours to delete 1.5 million records, which count is far less than my case and I am wondering if I proceed with single deleted approach how much time it would take to delete 190 million records.
Should I use Spring-Batch for selecting id's of rows and then delete them batch by batch or issue a single statement by passing id's in IN clause.
What would be a better approach please suggest.
Why not moving the required data from historical table to a new table and dropping the old table entirely? You might rename the new table to old table name later on.
you can do copying required data from historical table to a new table and drop the old table entirely and rename the new table to old table name later -- as said by Raj in above post. this is best way to do.
and also you can use nologging and parallel options to speed up for example :
create table History_new parallel 4 nologging as
select /*+parallel(source 4) */ * from History where col1 = 1 and ... ;
If doing it in Java is not mandatory, I'd create a PL/SQL procedure, open a cursor and use DELETE ... WHERE CURRENT OF. Maybe it's not super fast, but it's secure because you will have no rollback segment problems. Using a normal DELETE even without transaction is an atomic operation that must be rolled back if something fails.
Maybe what you said is usual and normal performance for Java, but at my notebook deleting of 1M records requires about a minute - without Java, of course.
If you wish to do it good, I'd say you should use partitions. First of all, convert the plain table(s) into the partitioned one(s) with all data into one (current) partition. Then, prepare "historical" partitions and move unnecessary data into them. And after that you'll be ready to do anything. You'll can to move it offline (but restore when needed), you'll be able to exclude this data in seconds using EXCHANGE PARTITION and so on.
I am looking for a fast way to move records from a MEMORY table to MYISAM table. MEMORY table has around 0.5 million records. Both tables have exactly the same structure (same number of columns, data types etc.). But the MYISAM table is indexed (B-TREE) on a few columns. There are around 25 columns most of which are unsigned integers.
I have already tried using "INSERT INTO SELECT * FROM " query. But is there any faster way to do this?
Appreciate your help.
Prashant
A others pointed out -- you should not use indexes during insert.
You can disable updating them on every insert:
ALTER TABLE table DISABLE KEYS;
INSERT INTO table
ALTER TABLE tbl_name ENABLE KEYS;
And also lock a table to get single index write:
LOCK TABLES table WRITE;
INSERT INTO table
UNLOCK TABLES;
Anyway, if you use it in a single INSERT ... SELECT you might
not get significant performance gain.
You can also tune bulk_insert_buffer_size setting in the server config.
More on: http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
In principle, you should get good performance by:
Create the target table without secondary indexes.
Sort the contents of the source table on the target table's primary key.
Insert sorted records into target table.
Add the secondary indexes one at a time.
It's probably mostly about tuning. Is the MyISAM table initially empty? If so, you can do a few cheats - disable indexes during the load, then enable them (this is NOT a good idea on a non-empty table).
Doing an ORDER BY on a memory table is not a particularly good idea, as they usually use hash indexes, hence cannot do an in-order index scan, so it would introduce an extra filesort(), which is probably bad.