My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.
Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.
The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.
On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here
Related
I'm writing an application that allows each user to label English words in three categories (some lexical exercise).
The main DB table, Word, contains ~4K different rows of words.
The Label table contains 3 labels.
--> The Word-Label table (that contains 3 columns: word_id, label_id, user_id) will add 4K rows per user (let's assume all the words starts with some pre-defined label when user register to the system).
The problem is that the table will grow very fast. 1:4000 (user/row) is bad in my opinion.
What can you suggest here to eliminate such a huge table? I've read that table-per-user is also considered bad practice.
In addition, I'm using Spring & Hibernate and the 4K insertions after the user get registered for the first time is pretty tough and takes time.
I can consider some NoSQL solution or another tool than Hibernate, but I'm consisting to use Spring & Java - so suggest something properly.
Will be glad for your help here!
There is no issue with data size. You may have an issue with Hibernate, but that is another issue.
If you end up with thousands of users, you'll have a few tens of millions of rows. That is not a large number of rows. If you want to insert default labels for a new user, then the code would look something like this:
insert into userLabels (userId, wordId, label)
select :userId, w.wordId, <default label>
from words w;
I would be surprised if this took more than a second or two.
If you knew that you would be having millions of users, then size might be more of an issue. The best solution would require better understanding of the application. The solution might vary from partitioning the tables, using arrays, or coming up with a different structure for representing your data.
You probably want various indexes on your tables to speed performance, but that depends on the queries you want to run. You might consider using a native interface to the database. Your use-case doesn't seem particularly complicated, so I don't know what advantage Hibernate or similar layers gets you.
First approach, you will just add new row to word-label for user after action. So, not every user will probably have 4k rows in that table. Now, when your database - query and stuff around that functionality will be a problem (bottleneck) then try to fix the issue and improve performance.
There are many performance tricks in sql databases you can use. For example, you wrote about table per user. That's not quite the best solution, next example, in mysql, u can create table patitions and it will be handled as one table but with performance improvement.
Second approach, for this type of data, of cource some NoSQL like MongoDB would perform great.
you could encode the user responsse-map into a 4000 entry bit-array, or string if you don't need the relational capabilities of the database
then it would be one record per user.
create table user_words (userid int, wiorddata text);
insert into user_words values (1,'YNYYNmmmYY'/* ... */ );
you application would need to have the list of words and kniow which wird each character refers to.
Currently I am facing a distinct issue, where I receive data from a webservice call, same need to be loaded into Oracle Table.
Scenario:
- I have a very huge table with 500 columns - all columns mandatory, and no choice to split table.
- Dataset is 50m records, which I am trying to export from source system - and its continuously increasing
- At a time I receive 50 column data by firing request to webservice (at source system), hence I need to submit 10 request of 50 column each for getting full record.
- Also at a time I can only receive 100000 (1 lac) records in one request for specific set of columns.
Now, to import same data into Oracle DB at destination system I have following two choices:
1. First export data on temporary tables of 50 columns each and then run join for all of them to create final table with all 500 columns
2. Fire 10 parallel request of 50 columns each and stitch data on my java program and then send insert query with all 500 columns
Here I would like to know, which technique works out better, to go with Oracle based table join or apply stitching on java side by using Primary Key column?
As the data set is very huge, I am purely looking on performance aspect. Also any more optimized ways to solve same problem?
From performance point of view the Oracle based solution would clearly win. From implementation point of view (aiming for a clear and simple solution) Oracle tables win again. Here is why:
Architecture point of view: Combining the data in your app will make your app stateful. From a simple stateless (receive-save-forget) application you would turn it into a complex state-aware (save-look for joint records-did not find anything-store-wait-look again-etc). This is much harder to develop, maintain or debug.
Performance point of view: Saving data into multiple tables and later combining them into one (either by views or stored procedures or simple selects) is something Oracle is designed for. Immense amount of development time was spent on optimizing these basic features. Whatever you would come up with to implement the same features (even though you are aware of some specifics) would likely performe worse.
So overall I would strongly suggest Option #1, leave it for Oracle to do the hard part. Depending on how you want to use this data after the import (almost real-time / once in a while / after extra filtering applied) you can choose how you construct the final records by using one of these:
stored procedures
Oracle jobs
views.
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.
Facts
Database: PostgreSQL (latest)
Programming language: Java
Problem statement (simplified)
We have 2 tables - overview and details. There could be millions of rows in "overview" and each row of "overview" can have millions of rows associated with it in "details". The foreign key details.overview_id refers to overview.id. Most queries are of the general formSELECT * FROM details WHERE overview_id = xxx AND details.id > yyy AND details.id < zzz; If we have a single table for details, the queries will be too slow (although the queries on details are almost always on primary keys). More on the nature of DB activities: INSERT and UPDATE on overview happens infrequently. INSERT on details happen at a rapid pace, while UPDATE on the same table almost never happens and bulk DELETE happens sometimes.
What we already have
In the past we used raw SQL to partition the table "details" against each row in "overview". (In practice, we did not actually partition, instead we created new tables based on a template. These tables did not have any column called overview_id (saving storage space), instead we had a separate table that did the mapping between overview.id and the table-name of the specific partition table.) So, as you can understand, the partitions had to be generated on the fly as new rows were inserted in overview and partitions were dropped as rows were deleted from overview. All of this was managed inside the application. The application-database interaction has been blazing fast, but the application code is fairly complex, implying it is hard to maintain. Also, with raw SQL lying around everywhere, it is hard to scale the DB horizontally - we have to reinvent what most JPA providers have already done.
Current goal
Currently we are exploring options for a mechanism by which this partitioning can happen behind the scene - possibly by a JPA provider (I understand that this is not part of the JPA spec), so that we can focus on the application while the underlying framework/layer takes care of the scalability issues.
I looked at openJPA Slice and EclipseLink. Both of them provide partition (shard) management across hosts. We certainly need that. But we also need partition management within a single host. However, if there is a better or more elegant solution to this or if there is a totally different angle to look at this, I will be really glad to know about that.
I will appreciate any insight you can provide.
Thanks.
Prajesh
Have you looked into using Postgres's table partitioning?
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
Thank you all for your comments/answers till date. We decided to stick to what we already have (see the section named "what we already have"), with minor modifications.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.