Hbase sort on column qualifiers

Hbase sort on column qualifiers - java

I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.

You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.

You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.

Related

~4K rows insertion to DB per user - design & performance

I'm writing an application that allows each user to label English words in three categories (some lexical exercise).
The main DB table, Word, contains ~4K different rows of words.
The Label table contains 3 labels.
--> The Word-Label table (that contains 3 columns: word_id, label_id, user_id) will add 4K rows per user (let's assume all the words starts with some pre-defined label when user register to the system).
The problem is that the table will grow very fast. 1:4000 (user/row) is bad in my opinion.
What can you suggest here to eliminate such a huge table? I've read that table-per-user is also considered bad practice.
In addition, I'm using Spring & Hibernate and the 4K insertions after the user get registered for the first time is pretty tough and takes time.
I can consider some NoSQL solution or another tool than Hibernate, but I'm consisting to use Spring & Java - so suggest something properly.
Will be glad for your help here!

There is no issue with data size. You may have an issue with Hibernate, but that is another issue.
If you end up with thousands of users, you'll have a few tens of millions of rows. That is not a large number of rows. If you want to insert default labels for a new user, then the code would look something like this:
insert into userLabels (userId, wordId, label)
select :userId, w.wordId, <default label>
from words w;
I would be surprised if this took more than a second or two.
If you knew that you would be having millions of users, then size might be more of an issue. The best solution would require better understanding of the application. The solution might vary from partitioning the tables, using arrays, or coming up with a different structure for representing your data.
You probably want various indexes on your tables to speed performance, but that depends on the queries you want to run. You might consider using a native interface to the database. Your use-case doesn't seem particularly complicated, so I don't know what advantage Hibernate or similar layers gets you.

First approach, you will just add new row to word-label for user after action. So, not every user will probably have 4k rows in that table. Now, when your database - query and stuff around that functionality will be a problem (bottleneck) then try to fix the issue and improve performance.
There are many performance tricks in sql databases you can use. For example, you wrote about table per user. That's not quite the best solution, next example, in mysql, u can create table patitions and it will be handled as one table but with performance improvement.
Second approach, for this type of data, of cource some NoSQL like MongoDB would perform great.

you could encode the user responsse-map into a 4000 entry bit-array, or string if you don't need the relational capabilities of the database
then it would be one record per user.
create table user_words (userid int, wiorddata text);
insert into user_words values (1,'YNYYNmmmYY'/* ... */ );
you application would need to have the list of words and kniow which wird each character refers to.

Is it better to use HBase columns or serialize data using Avro?

I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:
Use HBase column qualifiers as names for the keys. This would make rows wide, but very sparse.
Dump all the data into a single column and serialize it using Avro or Thrift.
What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?

In summary, I lean towards using distinct columns per key.
1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.
2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.
3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.
4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.
5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.

The right answer to this is a bit more complicated, so I'll give you the tl;dr first.
Use Avro/Thrift/Protobuf
You will need to strike a balance between how many fields to pack in a record vs. columns.
You'll typically want to put fields ("keys" in your original question) that are frequently accessed together into something like an avro record because as mentioned by cmonkey you don't want the overhead of retrieving extra data you won't use.
By making your row very wide, you'll increase seek times when fetching a subset of columns because of how HFiles are stored. Again, determining what is optimal, comes down to your access patterns.
I would also like to point out that by using something like avro, you're also providing yourself with evolvability. You don't need to delete the row and re-add it with the record containing a new field. Avro has rules for backward-compatibility and forward-compatibility. This actually makes your life much much easier because you can read both new and old records WITHOUT rewriting your data or forcing updates to older client code.
You should nearly always use compression in HBase (SNAPPY is always a good choice).

Cassandra Best way to deal with historical data?

I'm using Cassandra to store historical data. It's a collection of various objects that change it's value in time.
Column Family: Object type
Row: Object Id
Column Name: Timestamp
Column Value: Value at given time
At some time, the data becomes 'old' and instead of deleting it I want to store it somewhere else (like another Column family) or 'tag' in some way not to be retrieved along with the rest of the data.
Which is the fastest way to do this? At the moment I'm using Hector to do this:
1.Read the data (Using SliceQuery)
2.Write the data in antoher column family (Using ColumnFamilyUpdater)
3.Delete old data (Also using ColumnFamilyUpdater)
Not sure if it's the best practice to do this, but i'm quite new to Cassandra...
Thanks.

Your data will not only take place on HDD, but it will also consume JVM Heap because row bloom filters are always read on start-up - it's important to remember that.
Your solution is fine - you need to read this data and move it somewhere else. Now there are two options:
Generate reverse index, so that you can access old data in fast way.
Go over all data to find old records. If you data set is divided over many Cassandra nodes consider Hadoop Map Reduce
First solution will provide fast access to old data, but each insert operation will have to update index, which still in Cassandra case is super fast.
Second solution will not require extra inserts during daily usage, but it would require full table scan when you move old data. This is perfect, if you can run such jobs in the night.

What is the best way to iterate and process an entire table from database?

I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.

If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!

This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.

This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.

IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).

This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/

Duplicate set of columns from one table to another table

My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.

Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.

The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.

On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.