HBase Table Design for maintaining hourly visitors count per source

HBase Table Design for maintaining hourly visitors count per source - java

I am working on a project where I have to report the hourly unique visitors per source. That is I have to calculate unique visitors for each source for each hour. Visitors are identified by a unique id. What should be the design so that calculation of hourly unique visitors is efficient considering the data is of the order of 20k entries per 8 hours.
At present I am using sourceid+
visitorid as the row key.

Let's start by saying that 2500k entries per hour is a pretty low volume of data (not even 1/second). Unless you want to scale massively your project would be easily achievable with a single SQL server.
Anyway, you have 2 options:
1. Non-realtime
Log every visitorid+source and run a job (like mapreduce) to analyze the data every hour, or every day depending on your needs. In this case you can even completely avoid hbase and just stick to hadoop. You can log the data to a different file each hour, process it afterwards and store the results in SQL (or in HBase if you wish). Performance wise this would be the best approach.
2. Realtime
Track the data realtime by making use of HBase counters, in this case I'd consider using 2 tables:
Table unique_users: to track the last time a visitorid has visited the site (rowkey would be visitorid+source or just visitorid, depending on if a visitor id can have different sources or just one). This table can have a TTL of 3600 seconds if you want to automatically discard old data as soon as you can but I would let a few days of data.
Table date_source_stats: to track the unique visitorid per source per hour. This table can have a TTL of a few weeks or even years depending on your retention requirements.
When a visitor enters your site you read the unique_users table to check the last access date, if that date is older than 1 hour consider it a new visit and increment the counter for the date+hour+sourceid combination in the date_source_stats table. Afterwards, update the unique_users to set the last visit time to the current time.
That way, you can easily retrieve all the unique visits for a particular date+hour with a scan and get all the sources. You may also consider a source_date_stats table in case you want to perform queries for an specific source, i.e, an hourly report for last 7 days for X source... (you can even store all the stats in the same table by using different rowkeys).
Please notice a few things about this approach:
I've not being too detailed about the schemas, let me know if you need me to.
I would also store total visits in another counter (which would be incremented always regardless of if it's unique or not), it's an
useful value.
This proposal can be easily extended as much as you want to also track daily, weekly, and even monthly unique visitors, you'll just
need more counters and rowkeys: date+sourceid, month+sourceid... In this case you can have multiple column families with distinct TTL properties to adjust the retention policy of each set.
This proposal could face hotspotting issues due rowkeys being sequential if you have thousands of reqs per second, you can read more
about it here.
An alternative approach for date_source_stats could be to opt for a wide design in which you have just a sourceid as rowkey and the date_hour as columns.

Related

How to query records where datetime is greater than X in DynamoDB?

I have a table in DynamoDB, and I need to get a list of records (in Java) which are from the last day. They all have a dateTime attribute.
Relevant attributes of the table I'm referring to:
customerUrl(string, hashkey), dateTime(number, range key), and a few other attributes which aren't relevant
I've already tried setting a Global Secondary Index with a hashkey of dateTime and no range key. This index is named 'performanceIndex'. I then tried to query it as follows:
Map<String, AttributeValue> eav = new HashMap<>();
eav.put(":val1", new AttributeValue().withN(maximumAgeMillis));
DynamoDBQueryExpression<PingLog> pinglogQuery = new DynamoDBQueryExpression<PingLog>();
pinglogQuery.setKeyConditionExpression("dateTime > :val1");
pinglogQuery.setExpressionAttributeValues(eav);
pinglogQuery.setIndexName("performanceIndex");
pinglogQuery.setConsistentRead(false);
List<PingLog> pinglogs = PostDatabaseMapper.getInstance().query(PingLog.class, pinglogQuery);
However, the query permanently keeps going and never returns. I added a println statement before and after it, and only the first one actually printed.
Before this query I just did a scan with a filter, and that worked, but now we have so many records (80 million) that a scan takes forever. What should I do? Do I need a different secondary index? Is my query wrong?

You should create a GSI with yyyy-mm-dd as the partition key, and hh:mm:ss as the sort key. (This might require backfilling the entire table, but if you query by date often, it will be worth it.) Check out this answer to a related question, which has some more details on this approach.
There is a potential complication depending on what sort of data access patterns you have. Is it fairly steady, or is it bursty? Will current items have a much higher write throughput than any other day?
If you’re dealing with time-series data, such as IoT sensor readings, this strategy may not work for you. You could have a hot partition in your GSI, which could put back-pressure in your main table and cause writes to be throttled. This is unlikely because of DynamoDB’s adaptive capacity, but it is possible.
In this case, you should consider DynamoDB’s recommended best practice for handling time-series data. It discusses how to deal with data that has different access requirements over time. The gist of their solution is to create separate tables for each period of time (day/month/year/whatever) so that data from different time frames can have different provisioned capacity.

Sharding counters with 180 properties

Does it sound bad to have 180 unindexed properties(columns) with Integer/Long type per entity in datastore?
I need to count 6 requests per user saving by day for analytics reasons and I'm doing everything based on the sharding counters article and webcast:
https://cloud.google.com/appengine/articles/sharding_counters
So basically it's 6 values per day incrementing every new request, so I'm thinking in having:
1 Kind per Month
6 types of analytics * month days = 180
How much is too much in Google Datastore properties?
Thank you

Probably not a good idea.
Keep in mind that every time you want to update a single property value the entire entity will have to be re-written (i.e. retrieved from the datastore, deserialized, updated, re-serialized and re-sent to the datastore). The bigger the entity, the slower the performance.
IMHO it's better to have multiple smaller entities than a big one in such case. It is possible to split a single big entity into multiple smaller ones, efficiently related to each-other - see re-using an entity's ID for other entities of different kinds - sane idea?
Along the same line I believe it's even possible to find a way to encode the day info and the user ID into unique custom key IDs, for easy access. Something like <userid>_YYMMDD or just <userid>_DD

How do I store Frequent data (Combination of Source-Destination) in Database efficiently to get top10 searches for past 30 Days

I am trying to write an algorithm which does insert of frequent data search.
Let's say User can search different combination of two entities (Source-Destination), Each time user search I want to store data with count, and if he search same combination(Source-Destination) I will update the count.
In this case if Users are 1000, and if User searches for 0 different combination(Source-Destination) and data will be stored for 30 Days.
So total number of rows will be 100000*30*30=13500000(1.3 Billion) Rows. (using Mysql)
Please suggest me If there is better way to write this.
GOAL: I want to get top 10 Searach Combination of users at any point of time.

1,000 users and 60,000 rows are nothing by today's standards. Don't even think about it, there is no performance concern whatsoever, so just focus on doing it properly instead of worrying about slowness. There will be no slowness.
The proper way of doing it is by creating a table in which each row contains the search terms, ([source,destination] in your case,) and a sum, and using a unique index on the [source, destination] pair of columns. Which is the same as making those two columns the primary key.
If you had 100,000,000 rows, and performance was critical, and you also had a huge budget affording you the luxury to do whatever weird thing it takes to make ends meet, then you would perhaps want to do something exotic, like appending each search to an indexless table (allowing the fastest appends possible) and then compute the sums in a nightly batch process. But with less than a million rows such an approach would be a complete overkill.
Edit:
Aha, so the real issue is the OP's need for a "sliding window". Well, in that case, I cannot see any approach other than saving every single search, along with the time that it happened, and in a batch process a) computing sums, and b) deleting entries that are older than the "window".

HBase - query difference between two timestamps

I'm considering HBase as a trading system database due in part to its first-class timestamp-based versioning of rows. Trades get modified all the time, and we typically have to cope with that explicitly in the data model when working with SQL databases. In HBase, I'd model trades (not individual trade versions) as rows and let HBase do the hard work of giving me access to trade versions that were live at a previous point in time.
The query side of the system needs access to trade data in 3 major ways:
Current versions of all trades. This is clearly supported by HBase.
Versions of all trades that were active at a specific timestamp. Again, clear support from HBase. The purpose of this query is typically end of day processes that want to report the trade population at a specific time of day.
Activity between 2 timestamps. This is useful for feeding information at end of day to systems which are already in sync with the trading system and want to know what's change. It also forms the basis for a daily Profit and Loss (P&L) calculation which indicates what the change of P&L is from yesterday's daily value.
So, my question is: does HBase have any built-in support for performing a "diff" between two timestamps? Alternatively, is there any best-practice way of meeting this requirement at the database level? If not, I'd need consider building a process that fetches two timestamp-based queries out of the database and performs a difference operation.
I'd expect the output of the process to be:
A list of trades trades at the new timestamp that have changed since the old timestamp.
A separate list of trades as-of the old timestamp that have been made obsolete at the new timestamp.
Those pieces of information allow me to apply the positive changes and "back out" the negative changes. For example, if a trade has changed notional from 1m USD to 2m USD, I want to be able to apply a +2m change and then back out a -1m change to result in a net change of +1m for the day.

Duplicate set of columns from one table to another table

My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.

Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.

The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.

On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.