I'm considering HBase as a trading system database due in part to its first-class timestamp-based versioning of rows. Trades get modified all the time, and we typically have to cope with that explicitly in the data model when working with SQL databases. In HBase, I'd model trades (not individual trade versions) as rows and let HBase do the hard work of giving me access to trade versions that were live at a previous point in time.
The query side of the system needs access to trade data in 3 major ways:
Current versions of all trades. This is clearly supported by HBase.
Versions of all trades that were active at a specific timestamp. Again, clear support from HBase. The purpose of this query is typically end of day processes that want to report the trade population at a specific time of day.
Activity between 2 timestamps. This is useful for feeding information at end of day to systems which are already in sync with the trading system and want to know what's change. It also forms the basis for a daily Profit and Loss (P&L) calculation which indicates what the change of P&L is from yesterday's daily value.
So, my question is: does HBase have any built-in support for performing a "diff" between two timestamps? Alternatively, is there any best-practice way of meeting this requirement at the database level? If not, I'd need consider building a process that fetches two timestamp-based queries out of the database and performs a difference operation.
I'd expect the output of the process to be:
A list of trades trades at the new timestamp that have changed since the old timestamp.
A separate list of trades as-of the old timestamp that have been made obsolete at the new timestamp.
Those pieces of information allow me to apply the positive changes and "back out" the negative changes. For example, if a trade has changed notional from 1m USD to 2m USD, I want to be able to apply a +2m change and then back out a -1m change to result in a net change of +1m for the day.
Related
I am attempting to export a large amount of data from multiple separate tables from Oracle 11 into a NoSQL database via a Java app utilising JDBI.
The data is being read from the following tables: store, store2, staff and product.
The final desired data structure is a multi-tiered structure like so;
Country
Store1
StoreFloorSize
StoreAddress1
StorePostcode
StoreStaff
StaffMember1
StaffForename
StaffMember2
StaffForename
StoreProducts
Product1
ProductName
Product2
ProductName
Store2
...
There will be many countries and each country can have many stores and each store can have many staff members/products.
So far I've attempted to perform this export by querying the data from Oracle in the structure utilising cursor expressions (refcursors) and then
mapping the results to Java objects before saving to the new NoSQL database.
A very simplfied version of the query used to extract data from Oracle is below;
select countryName, cursor(storeFloorSize, storeAddress1, storePostcode, cursor(select staffForename from staff where staff.storeId = store.Id),
cursor(select productName from product where product.storeid = store.id)
from (select * from store union all select * from store2) store WHERE store.countryid = country.id) from country
This approach works however due to the volume of data it's taking a long time (a number of days to complete) and there are a few constraints with it.
The entire process takes a two to three days to complete however when looking at Oracle stats the time spent executing on Oracle is only approximately 6 hours.
So far in trying to track down where this additional time is taken I've done/checked the following;
First the NoSQL database has been removed from the equation entirely and the performance remains the same.
The Oracle server and machine on which the Java application is running on are both fine in terms of CPU and Memory resources (very little usage on both machines for both resources)
I've broken the task up across multiple threads each working on separate partitions of the table (country in the above example); each thread performs select from oracle-> map to java objects -> save to NoSQL - This parallel processing when done across a large number of threads reduced the execution time on Oracle but had no real affect on the overall time. (These are separate threads in Java and each has their own separate connection to Oracle via a connection pool)
I've tried modifying the fetchSize property however it seems to have a very small difference (This adds another complication as each result row contains three cursors and when parallised
across a large number of threads the MAX_OPEN_CURSORS setting on Oracle needs to increase drastically very quickly).
I can't seem to identify any particular bottle necks however resource utilisation is still very low.
As mentioned in the first line I'm using the JDBI wrapper around JDBC to perform the query and map the results to Java objects however if this was the bottle neck I believe that I'd see high usage on the machine running the Java application.
Is there anything I may have overlooked with regard to the above or might I be better of moving back to pure SQL queries and performing the transformation in Java?
I am hitting a REST API to get data from a service. I transform this data and store it in a database. I will have to do this on some interval, 15 minutes, and then make sure this database has latest information.
I am doing this in a Java program. I am wondering if it would be better, after I have queried all data, to do
1. SELECT statements and compare vs transformed data and do UPDATEs (DELETE all associated records to what was changed and INSERT new)
OR
DELETE ALL and INSERT ALL every time.
Option 1 has potential to be a lot less transactions, guaranteed SELECT on all records because we are comparing, but potentially not a lot of UPDATEs since I don't expect data to be changing much. But it has downside of doing comparisons on all records to detect a change
I am planning on doing this using Spring Boot, JPA layer and possibly postgres
The short answer is "It depends. Test and see for your usecase."
The longer answer: this feels like preoptimization. And the general response for preoptimization is "don't." Especially in DB realms like this, what would be best in one situation can be awful in another. There are a number of factors, including (and not exclusive to) schema, indexes, HDD backing speed, concurrency, amount of data, network speed, latency, and so on:
First, get it working
Identify what's wrong → get a metric
Measure against that metric
Make any obvious or necessary changes
Repeat 1 through 4 as appropriate
The first question I would ask of you is "What does better mean?" Once you define that, the path forward will likely become clearer.
I'm currently working on an analytics tool that every night (with a Java program) parses huge event logs (approx. 1 GB each) to a MySQL database - for each event there's about 40 attributes. The event logs are parsed "raw" to the database.
The user of the application needs to see different graphs and charts based on complicated calculations on the log data. For the user not to wait several minuts for a chart-request to be fulfilled, we need to store the preprocessed data somehow ready to display for the user (the user is able to filter by dates, units etc., but the largest parts of the calculations can be done on beforehand). My question is concerned about how to maintain such preprocessed data - currently, all calculations are expressed in SQL as we assume is the most efficient way (is this a correct assumption?). We need to be able to easily expand with new calculations for new charts, customer specific wishes etc.
Some kind of materialized view jumps to my mind, but MySQL doesn't seem to support this feature. Similarly, we could execute the SQL calculation each night after the event logs has been imported, but in this way each calculation/preprocessed data table needs to know which events it has processed and which it hasn't. The table will contain up to a year worth of data (i.e. events) so simply truncating the table and doing all calculations once again seems not to be the solution? Using triggers doesn't seem right neither, as some calculations need to consider for example the time difference between to specific kinds of events?
I'm having a hard time weighing the pros and cons of possible solutions.
"Materialized Views" are not directly supported by MySQL. "Summary Tables" is another name for them in this context. Yes, that is the technique to use. You must create and maintain the summary table(s) yourself. They would be updated either as you insert data into the 'Fact' table, or periodically through a cron job, or simply after uploading the nightly dump.
The details of such are far more than can be laid out in this forum, and the specific techniques that would work best for you involve many questions. I have covered much of it in three blogs: DW, Summary Tables, and High speed ingestion. If you have further, more specific, questions, please open a new Question and I will dig into more details as needed.
I have done such in several projects; usually the performance is 10x better than reading the Fact table; in one extreme case, it was 1000x. I always end up with UI-friendly "reports" coming from the Summary Table(s).
In some situations, you are actually better off building the Summary Tables and not saving the Fact rows in a table. Optionally, you could simply keep the source file in case of a need to reprocess it. Not building the Fact table will get the summary info to the end-user even faster.
If you are gathering data for a year, and then purging the 'old' data, see my blog on partitioning. I often use that on the Fact table, but rarely feel the need on a Summary Table, since the Summary table is much smaller (that is, not filling up disk).
One use case had a 1GB dump every hour. A perl script moved the data to a Fact table, plus augmented 7 Summary Tables, in less than 10 minutes. The system was also replicated, that added some extra challenges. So, I can safely say that 1GB a day is not a problem.
I am working on a project where I have to report the hourly unique visitors per source. That is I have to calculate unique visitors for each source for each hour. Visitors are identified by a unique id. What should be the design so that calculation of hourly unique visitors is efficient considering the data is of the order of 20k entries per 8 hours.
At present I am using sourceid+
visitorid as the row key.
Let's start by saying that 2500k entries per hour is a pretty low volume of data (not even 1/second). Unless you want to scale massively your project would be easily achievable with a single SQL server.
Anyway, you have 2 options:
1. Non-realtime
Log every visitorid+source and run a job (like mapreduce) to analyze the data every hour, or every day depending on your needs. In this case you can even completely avoid hbase and just stick to hadoop. You can log the data to a different file each hour, process it afterwards and store the results in SQL (or in HBase if you wish). Performance wise this would be the best approach.
2. Realtime
Track the data realtime by making use of HBase counters, in this case I'd consider using 2 tables:
Table unique_users: to track the last time a visitorid has visited the site (rowkey would be visitorid+source or just visitorid, depending on if a visitor id can have different sources or just one). This table can have a TTL of 3600 seconds if you want to automatically discard old data as soon as you can but I would let a few days of data.
Table date_source_stats: to track the unique visitorid per source per hour. This table can have a TTL of a few weeks or even years depending on your retention requirements.
When a visitor enters your site you read the unique_users table to check the last access date, if that date is older than 1 hour consider it a new visit and increment the counter for the date+hour+sourceid combination in the date_source_stats table. Afterwards, update the unique_users to set the last visit time to the current time.
That way, you can easily retrieve all the unique visits for a particular date+hour with a scan and get all the sources. You may also consider a source_date_stats table in case you want to perform queries for an specific source, i.e, an hourly report for last 7 days for X source... (you can even store all the stats in the same table by using different rowkeys).
Please notice a few things about this approach:
I've not being too detailed about the schemas, let me know if you need me to.
I would also store total visits in another counter (which would be incremented always regardless of if it's unique or not), it's an
useful value.
This proposal can be easily extended as much as you want to also track daily, weekly, and even monthly unique visitors, you'll just
need more counters and rowkeys: date+sourceid, month+sourceid... In this case you can have multiple column families with distinct TTL properties to adjust the retention policy of each set.
This proposal could face hotspotting issues due rowkeys being sequential if you have thousands of reqs per second, you can read more
about it here.
An alternative approach for date_source_stats could be to opt for a wide design in which you have just a sourceid as rowkey and the date_hour as columns.
I know (or think I know) that using things like prepared statements can help future executions of the same query execute faster. However, I was wondering, if you're using prepared statements but the actual values are the same every time, will it then also additionally optimize using the value?
To give a little more context, I want to test performance for a service request that uses an underlying database. The easy route would be to send in the same data each time. The more arduous route would be to ensure the data values were different each time. However, in either case, the same SQL query would be generated -- just the values would be different. So, will these scenarios end up testing the same thing or something different because of potential DB optimization?
I've tried to research this topic but I feel like a lot of what I'm reading is over my head. Any good links for someone that knows little about DB optimization would also be welcomed in addition to the central question.
It depends on exactly what you are doing and measuring. I would expect, though, that you'd need to use different values in order to get realistic results.
Caching
If you send the same values every time, you can probably guarantee that the particular row(s) that you're interested in are always going to be cached (in the buffer cache, in the file system cache, in the SAN cache, etc.) which is probably not terribly realistic if the set of possible inputs is large. On the other hand, if there are a small number of potential inputs and you're reasonably confident that the rows of interest will always be cached (for example, if you know that some other activity that takes place just before your service is called will cause the data you're interested in to be cached in memory before your service is called) then perhaps this is a realistic assumption.
Optimization
Ignoring caching, we can look at how the optimizer would treat the two cases. If you are generating SQL queries with embedded literals (a bad practice that is particularly harmful in Oracle but one that is very common), then you are generating different SQL statements. As far as Oracle is concerned
SELECT *
FROM emp
WHERE deptno = 10
is a completely different statement from
SELECT *
FROM emp
WHERE deptno = 20
There are some settings (i.e. cursor_sharing) you can tweak to ask Oracle to treat these two as identical queries (by having Oracle force them into using bind variables) but that is not without its own downsides and is generally only recommended when you're trying to apply a band-aid to a poorly written application while you work on refactoring the application to use bind variables properly.
Assuming that you are generating queries using bind variables in your application, preparing the statement, and then binding different values before executing the query multiple times, i.e.
SELECT *
FROM emp
WHERE deptno = :1
then you get into the realm of histograms, bind variable peeking, and adaptive cursor sharing. This can get pretty involved and depends heavily on the version of Oracle you're using, the edition you're using, and how you've configured the optimizer to work. I'll try to give a simplified high-level overview here-- if you want to delve too much deeper into one of these, we'll probably want a separate question.
Histograms
By default, the optimizer assumes that data is equally spaced and equally likely. So, for example, if the deptno column has 50 distinct values, the optimizer assumes by default that each value is equally likely. That's probably a pretty reasonable assumption for most columns but it's obviously not reasonable for all columns. If I have a table with all active duty military members, for example, and one of the columns is birth_year, there will be far more rows with a birth_year of 1994 (20 years ago) than 1934 (80 years ago). In these cases, you gather histograms on the column in question in order to tell the optimizer that the data isn't evenly distributed and to let the optimizer gather information about which values are more common and how common they are.
The optimizer doesn't care about the values you are passing for your bind variable values unless there is a histogram on one of the columns in your predicate (I'll ignore for the moment the possibility that you are passing a value that is out of range).
Bind variable peeking
If you do have a histogram on one or more columns, then Oracle (9.1 and later if memory serves) will "peek" at the first value that is passed in for a bind variable and use that value with the histogram to determine the best plan for all subsequent executions. This works reasonably well the vast majority of the time but it occasionally leads to hair-pullingly painful problems (and much swearing) when Oracle peeks at a "bad" value and generates a plan that is efficient for that one execution but terrible for all future executions. This is summed up by Tom Kyte's story about the database that has to be restarted if it's rainy on a Monday morning. If you have a histogram on the column and different values that you might pass in would likely benefit from different query plans, you'd likely want to take bind variable peeking into consideration to determine if passing in values in a different order created any performance issues.
Adaptive cursor sharing
In recent versions (if memory serves 11.1 and later) and depending on your configuration, Oracle can use adaptive cursor sharing to maintain multiple query plans for a single statement and to use the most appropriate version for the particular bind variable value that is passed in. This is a much more sophisticated version of bind variable peeking that peeks for each set of values you pass in and figures out whether it is close enough to some other set of values to use the previously generated plan or whether it needs to compute a new plan for the new set of values. Figuring out what constitutes "close enough" and how this interacts with various features for ensuring plan stability is a rather involved topic in its own right.
you could use db caching
http://www.oracle.com/technetwork/articles/sql/11g-caching-pooling-088320.html
if the app is making network roundtrip and caculating results, that will still eat considerable time