Insert into Impala table vs write to HDFS

Insert into Impala table vs write to HDFS - java

I have about 10 thousand records (stored as ArrayList in Java). I want to insert these records to Impala.
Should I use insert into table partition values to directly insert to impala. (I am not sure how many records can be inserted in one sql statement.)
Or should I write these records to HDFS then alter impala table?
Which way is preferred? Or is there any other solutions?
And also if I do these in every 5 minutes, how can I avoid so many small files in one partition (partitioned by hour)? These will produce 12 small files in each partition, so will this affect the query speed?

The best you can do is to do:
Create your table in impala as an external table associated with an HDFS route
Make the insertions directly in HDFS, if possible daily, per hour is probably little
Execute the invalidate metada $ TABLE_NAME command so that the data is visible
I hope the answer serves you
Regards!

Related

Efficiant way to check large number string existing in database

I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.

Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).

It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.

How much time will take to insert 5,00,000 records in to my SQL server?

I have one table. From Java I am trying inserting records to table by batch wise (batch size: 5000).
There is no idle time from Java side.
Can you please let me know how much time it will take for inserting
5,00,000 records (with indexes).
How much time it will for inserting 5,00,000 records (with out
indexes).

Depends on your Hardware and your tables. A single table with 20 column without blob datatype. It should take around ~3-4min.

Deleting 190 million records from Oracle

We have some useless historical data in a database which sums upto 190 million (19 crores) rows in database contributing to 33-GB . Now I got a task to delete these much rows in one go and if in any case something breaks, I should be able to rollback the transaction.
I will select them based on some flag like deleted ='1' which from my estimation counts to 190 million out of 200 million. So first I have to do a select operation and then delete those id's.
As mentioned in this article, it is taking 4 hours to delete 1.5 million records, which count is far less than my case and I am wondering if I proceed with single deleted approach how much time it would take to delete 190 million records.
Should I use Spring-Batch for selecting id's of rows and then delete them batch by batch or issue a single statement by passing id's in IN clause.
What would be a better approach please suggest.

Why not moving the required data from historical table to a new table and dropping the old table entirely? You might rename the new table to old table name later on.

you can do copying required data from historical table to a new table and drop the old table entirely and rename the new table to old table name later -- as said by Raj in above post. this is best way to do.
and also you can use nologging and parallel options to speed up for example :
create table History_new parallel 4 nologging as
select /*+parallel(source 4) */ * from History where col1 = 1 and ... ;

If doing it in Java is not mandatory, I'd create a PL/SQL procedure, open a cursor and use DELETE ... WHERE CURRENT OF. Maybe it's not super fast, but it's secure because you will have no rollback segment problems. Using a normal DELETE even without transaction is an atomic operation that must be rolled back if something fails.

Maybe what you said is usual and normal performance for Java, but at my notebook deleting of 1M records requires about a minute - without Java, of course.
If you wish to do it good, I'd say you should use partitions. First of all, convert the plain table(s) into the partitioned one(s) with all data into one (current) partition. Then, prepare "historical" partitions and move unnecessary data into them. And after that you'll be ready to do anything. You'll can to move it offline (but restore when needed), you'll be able to exclude this data in seconds using EXCHANGE PARTITION and so on.

How to store user logging

Recently, our system need to store millions record per day. Each record is very simple, the userid and the clicked weburl. After that we use some machine learning algorithms on the data logs.
We tried neo4j, but the query time is very slow. For example : get all pair userid view same weburl.
So any suggestion?

Here is how I have made it for a database that support more than 1 billion transactions per days:
Make a frontal table like a buffer named TBUFFER for example.
In that table, insert informations that you want to insert in your log table.
Each seconds, from a job, read the TBUFFER and distribute the datas in yours final tables.
Why doing that ? To be able to make massive insert.
The key is to do insert by packet to divide numbers of transaction and then locks.
You can also pass XML datas, that contain many user logging to insert, to your database and insert it using a single transaction.

I think Neo4j is not the right database to store billions of simple, non-connected records. Use a key-value store (like riak, redis etc) for that.

Duplicate set of columns from one table to another table

My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.

Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.

The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.

On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.