1) Is it possible to create a Phoenix table backed by existing HBase table?
Based on this info here
http://phoenix.apache.org/language/#create_table
it should be possible but what options exactly one needs to pass in, I am not sure. I see no examples there.
"HBase table and column configuration options may be passed through as key/value pairs to configure the HBase table as desired. Note that when using the IF NOT EXISTS clause, if a table already exists, then no change will be made to it."
2) Also, is it possible that in the process of table creation, I control myself the mapping of phoenix column names to HBase column names?
3) I know that a Phoenix view (backed by an HBase table) has certain problems/limitations to maintain its indexes if the writing process writes directly to the underlying HBase table and not to the Phoenix view.
https://issues.apache.org/jira/browse/PHOENIX-1499
https://issues.apache.org/jira/browse/PHOENIX-1619
So... will that be a problem also if we create a Phoenix table backed by an HBase table (and write to the HBase table behind Phoenix's back)? I mean, if the answer to 1) is positive, will I have the same problem with a Phoenix table as with a Phoenix view (provided that my writes don't go through Phoenix)?
Directly taking a hit at the answer here.
a) Specific properties can be passed to column families or table in general. These are the various options that have been defined using options. Options can be referenced here : http://phoenix.apache.org/language/#options. You can create a view that references an existing hbase table. I prefer views because , i can drop and recreate them without issues unlike tables whose drop will cause the underlying HBase table to vanish as well
b) This is not possible AFAIK. There are no mapping options between existing hbase tables and corresponding phoenix views (i.e fN in phoenix refers to firstName in HBase)
c) That is correct. Atleast in 4.x versions of Phoenix , this is true.
i) If you create a Phoenix table (the HBase table will get created automatically) and write to HBase directly , you will have to take care of using Phoenix types for writing so that they will be properly read from HBase (https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/types). Also note you will have to take care if you have SALT_BUCKETS and the phoenix salt if you have defined tables from phoenix like that
Related
How do i migrate data from one schema to another schema(the tables have also changed), both belonging to different databases, meaning i have to establish two connections. Can someone help me with inputs on how i can achieve the above functionality using JAVA.
Can i use liquibase to migrate data from one database to another, please note I have to establish two db connections since my schemas belong to different databases, also the tables design has also been changed.
Another option: Let SQL do all the work, no Java needed. Lets call the databases dbfrom and dbto. Now sign in to dbto and create a database link. Then your task basically becomes a insert statement.
-- in database dbto
create database link link_to_dbfrom;
-- ensure user has appropriate access on both databases.
-- copy data in dbfrom to dbo
insert into schema_in_dbto.table_in_dbto( column list)
select (column list)
from schema_in_dbfrom.table_in_dbfrom#link_to_dbfrom;
How to read from Hive without map reduce? I am trying to read a column from a table created on Hive, but I don't want the overhead that exist from map reduce. Basicaly I want to retrive the values from a table created on Hive without overhead and get them the fastest way possible.
Instead of MapReduce, you can use Tez or Spark as you execution engine in Hive.
See hive.execution.engine in Hive Configuration Properties.
There are also quite a few SQL engines compatible with the hive metadata e.g Presto, Spark SQL, Impala.
generally, if you do a "select *from" a table in hive mapreduce wont run..
In your case are you using just a select column from a hive table also mapreduce wont run.
or you can create a subtable on the main table with the needed columns and number of rows and just do a select * on the table.
I'm relatively new to working with JDBC and SQL. I have two tables, CustomerDetails and Cakes. I want to create a third table, called Transactions, which uses the 'Names' column from CustomerDetails, 'Description' column from Cakes, as well as two new columns of 'Cost' and 'Price'. I'm aware this is achievable through the use of relational databases, but I'm not exactly sure about how to go about it. One website I saw said this can be done using ResultSet, and another said using the metadata of the column. However, I have no idea how to go about either.
What you're probably looking to do is to create a 'SQL View' (to simplify - a virtual table), see this documentation
CREATE VIEW view_transactions AS
SELECT Name from customerdetails, Description from cakes... etc.
FROM customerdetails;
Or something along those lines
That way you can then query the View view_transactions for example as if it was a proper table.
Also why have you tagged this as mysql when you are using sqlite.
You should create the new table manually, i.e. outside of your program. Use the commandline 'client' sqlite3 for example.
If you need to, you can use the command .schema CustomerDetails in that tool to show the DDL ("metadata" if you want) of the table.
Then you can write your new CREATE TABLE Transactions (...) defining your new columns, plus those from the old tables as they're shown by the .schema command before.
Note that the .schema is only used here to show you the exact column definitions of the existing tables, so you can create matching columns in your new table. If you already know the present column definitions, because you created those tables yourself, you can of course skip that step.
Also note that SELECT Name from CUSTOMERDETAILS will always return the data from that table, but never the structure, i.e. the column definition. That data is useless when trying to derive a column definition from it.
If you really want/have to access the DB's metadata programatically, the documented way is to do so by querying the sqlite_master system table. See also SQLite Schema Information Metadata for example.
You should read up on the concept of data modelling and how relational databases can help you with it, then your transaction table might look just like this:
CREATE TABLE transactions (
id int not null primary key
, customer_id int not null references customerdetails( id )
, cake_id int not null references cakes( id )
, price numeric( 8, 2 ) not null
, quantity int not null
);
This way, you can ensure, that for each transaction (which is in this case would be just a single position of an invoice), the cake and customer exist.
And I agree with #hanno-binder, that it's not the best idea to create all this in plain JDBC.
I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.
The question is mainly about that, in my project, I want to create a table with 3 column family. the default replication number is 3. but I wanna change this replication number for centain column family, just because we dont need so much repliction for it.
for example, a table name table1, and has 3 column family, f1,f2,f3. In this case, we want to set the replication number of f3 is 1. so how can I set this config? Is there any solutions without change the source code?
PS: via hbase shell or JAVA?
First we should specify that the term replication is a little overloaded.
HBase uses HDFS as it's storage. HDFS will replicate, to multiple DataNodes, the blocks that make up any files that HBase generates. (see http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication ) This value isn't configurable per column family, or table. It's only configurable per server. (See http://hbase.apache.org/book.html#hdfs_client_conf )
If this is something you'd like to change then I would suggest filing a jira requesting a new feature.
HBase also has the ability to replicate edits from one HBase cluster to another cluster. This replication is per write ahead log and is configurable per column family. Setting REPLICATION_SCOPE to one will tell HBase to apply the edits from this region server onto another cluster. Setting this to 0 will turn replication off.
i looked a lot on this. as i see it - you can not define different replication for a the tables, let alone for column family.
The number of replications is defined in the hbase-site.xml which is for the whole table.
you can define if you want to replicate the column family or not using REPLICATION_SCOPE.