so, say I have a SQL(Mysql) database containing 4 tables {A, B, C, D}
and I want to create a testing database which contains a subset of data from the first database (both in time and type).
so for example:
"I want to create a new (identical in structure) database containing all of the data for user "bob" for the last two weeks."
the naive approach is to dump two weeks of data from the first database, use vagrant / chef to spin up a new empty database and import the dump data.
however, this does not work as each table has foreign keys with each other.
so, if I have two weeks of data of "A", it might rely on a year old data from "D".
My current solution is to use the data layer of my java application load the data in to memory and then inserted it into the database. however, this is not sustainable / scalable.
so, in a roundabout way, my question is, does anyone know of any tools or tricks to migrate a "complete" set of data from one database to another considering a time period of 1 table and including all other related data from the other tables as well?
any suggestions would be fantastic :)
Try your "naive approach", but SET FOREIGN_KEY_CHECKS=0; first, then run your backup queries, then SET FOREIGN_KEY_CHECKS=1;
There is the way to recreate similar database with part of data, not so simple, but can be used:
Create new database (only schema), for example using buckup or schema comparer tool.
The next step is to copy table data. You could do it with a help of data comparer tool (look at schema comparer link). Select tables you need and check record data to synchronize.
Related
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.
I'm not sure if something special exists for this use case - but it felt like a case where someone was likely to have made some sort of useful structure/technique/design-pattern.
My Situation
I have a set of SQL commands executed from middle tier (Java) to insert/update/delete data to any of a set of very large tables via joins from a related staging table.
I have more SQL commands which update various derived tables based on the staging table/actual table contents. Different tables will interact with different derived tables via different queries (as usual). These commands may have to be interleaved with the first set depending on the use case - so, I can't necessarily execute set 1 then set 2 all at once.
My Question
So, I need to build a chain of commands that get executed sequentially, and I need to trigger a rollback if any of them fail. I'd like to do this in the most clear, documented way possible.
Does anyone know a standard way of coding this? I'm sure anyone migrating from stored procedure code to middle tier code has done this before and I don't want to reinvent the wheel if there are good options out there.
Additional Information
One of my main concerns is making everything clear. To elaborate, I'll have a set of queries specifically designed to:
Truncate staging table A' and populate it with primary keys targeting deletion records
Delete from actual table A based on join with A'
Truncate staging table A' and populate it with full data for upserts
Update/Insert records from A' to A based on joins
The same logic will apply to tables B, C, D, etc. Unfortunately, it can be the case where just A and C need an extra step, like syncing deletes to a certain derived table, to be done after the deletions but before the upserts.
I'd obviously like to group all the logic for updating a table, and I'd like to group all the logic for updating a derived table as well, but at execution time they have to be intelligently interleaved and this sounds messy to me.
Don't write such a thing yourself. This is what JTA was born for.
You can use either JPA or Spring to do it.
Annotate the unit of work as transactional and let the database and JDBC handle it.
If you must do it yourself, follow the aspect-oriented approach and make it a decorative "before & after" implementation.
I want to read items from SQL table which are in Hierarchical manner. e.g. My table having Parent-Child relations.
Following is my table structure:
id name parentId
1 abc null
2 xyz 1
3 lmn 1
4 qwe null
5 asd 4
So I want to load all the Item hierarchically in one go, How can I do that?
I tried HashMap but it feels very complex....
thanks in advance...
You can retrieve the whole hierarchy through one SQL query, but the query depends on the database system - for instance, Oracle supports a "CONNECT BY" query for hierarchical queries. MySQL does not support this directly - but there are some very good approaches explained at Hierachical Queries in MySQL. I suggest that you try one of these. The basic approach is to create a function which imitates the CONNECT BY.
First of all, form your question it is not clear if you need help on SQL side or java data structure side.
I assume both i.e. you need to fetch data from database & then populate in appropriate java data structure.
So there is no direct way for this. Your data is mainly TREE STRUCTURE, so mainly you need to focus on java data structure for tree. So below are the steps that you need to take.
1) First create java class structure for tree structure. Refer below links for that.
Java tree data-structure?
http://www.java2s.com/Code/Java/Collections-Data-Structure/Yourowntreewithgenericuserobject.htm
http://www.codeproject.com/Articles/14799/Populating-a-TreeView-Control-from-the-Database
2) Then you don't need much from sql side if you finally need java objects. SO you can do plain select query to fetch data.
3) Now you need to have a helper method for preparing tree for you data using class structure form step 1. If you do better design, you tree class structure can be made self constructing using SQL data.
Following situations:
I got two databases featuring an identical structure. On top of each of these databases runs an instance of the same app using Hibernate for ORM. The two are completely independent.
Now I have to merge both applications into one. In some tables, adjustments need to be made to avoid violating unique key constraints.
Since both databases are identical in terms of structure and the same Hibernate mapping is used, is there a way to use Hibernate for the task? I'm thinking of loading an Object from database A, modifying it in code and simply saving it to a Session from a SessionFactory based on database B. I'm wondering whether Hibernate would be able to update the primary and foreign key values accordingly and how difficult it would be to handle dependencies to objects that are not copied from the database A (because they are not needed any more).
Any recommendations?
isn't it easier to just do a database dump from database A and import it into database B? Or as an alternative use insert into B.table (col1,col2) values (select col1,col3 from A.table) ?
If your databases are MySQL, you use the MERGE storage engine. Here are the steps:
-In one of your databases, update all your id via Hibernate using the cascade all. All your id have to be increment by the last id of your other database on each table:
User1 (2000 rows, lastId: 2000) and User2 (3000 rows, lastId: 3000) -> User1 (2000 rows, lastId: 2000) and User2 (3000 rows, firstId:3000, lastId: 6000)
-Create an other database that merge all your databases
-Extract a dump from your new database and load this dump in your final database -> http://dev.mysql.com/doc/refman/5.0/en/merge-storage-engine.html
This is one possible way :)
I know it is an old thread, but I had a similar problem.
I solved including two date fields : included_date and changed_date to my tables, and also, I included another field to save the date I last sync the databases somewhere else (I have a table with configuration info).
When my system connects to the server I send the date from the last sync, then my routine can compare which rows hava been included or changed since my last sync.
Every new row I set the date into the included_date field, so when I sync I know which rows were created after my last sync, then I can do an INSERT. The same happens with row changes and the changed_date field, then I do an UPDATE.
My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.
Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.
The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.
On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here