why don't almost all databases implement row cache?

why don't almost all databases implement row cache? - java

I noticed that almost all database don't implement row cache internally. I ask this question because I find someone add row-cache patch for innodb. At least, it has one advantage beyond performance gain, i.e. it is transparent for the client.
Is there any difficult technical reason which prevent doing so, or just because it's useful for very specific access pattern?
Thanks

Quite frankly, if you're getting the same version of the same row multiple times, you're doing it wrong. Data should be cached, as a general rule, if it's unlikely to change and needs to be accessed multiple times. Given this rule, if caching a DB row on the server side ever helps, it means you're making too many round trips to the database for the data you're interested in. You should instead be caching it client-side to cut down on the round trips. If the data changes often and so you need to access it often, caching still won't help because the cached data is out of date and the query must be re-executed. Only getting the data that's different from the cached data doesn't help; you have to figure out what's different and you're still making a query of the DB.
On top of that, most databases are designed for high-concurrency performance. Caching one guy's massive result set is going to eat into resources available for the next guy's massive result set and so on. In a high-user-count scenario, building a cache would likely simply result in the cached data being thrown away to make room for more cached data; it wouldn't be able to stick around long enough to be of use.

Related

Cache query results

Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?

OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)

If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.

Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.

The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.

This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

More efficient to do SELECT and compare in Java or DELETE and INSERT

I am hitting a REST API to get data from a service. I transform this data and store it in a database. I will have to do this on some interval, 15 minutes, and then make sure this database has latest information.
I am doing this in a Java program. I am wondering if it would be better, after I have queried all data, to do
1. SELECT statements and compare vs transformed data and do UPDATEs (DELETE all associated records to what was changed and INSERT new)
OR
DELETE ALL and INSERT ALL every time.
Option 1 has potential to be a lot less transactions, guaranteed SELECT on all records because we are comparing, but potentially not a lot of UPDATEs since I don't expect data to be changing much. But it has downside of doing comparisons on all records to detect a change
I am planning on doing this using Spring Boot, JPA layer and possibly postgres

The short answer is "It depends. Test and see for your usecase."
The longer answer: this feels like preoptimization. And the general response for preoptimization is "don't." Especially in DB realms like this, what would be best in one situation can be awful in another. There are a number of factors, including (and not exclusive to) schema, indexes, HDD backing speed, concurrency, amount of data, network speed, latency, and so on:
First, get it working
Identify what's wrong → get a metric
Measure against that metric
Make any obvious or necessary changes
Repeat 1 through 4 as appropriate
The first question I would ask of you is "What does better mean?" Once you define that, the path forward will likely become clearer.

Does Oracle SQL database optimize by value?

I know (or think I know) that using things like prepared statements can help future executions of the same query execute faster. However, I was wondering, if you're using prepared statements but the actual values are the same every time, will it then also additionally optimize using the value?
To give a little more context, I want to test performance for a service request that uses an underlying database. The easy route would be to send in the same data each time. The more arduous route would be to ensure the data values were different each time. However, in either case, the same SQL query would be generated -- just the values would be different. So, will these scenarios end up testing the same thing or something different because of potential DB optimization?
I've tried to research this topic but I feel like a lot of what I'm reading is over my head. Any good links for someone that knows little about DB optimization would also be welcomed in addition to the central question.

It depends on exactly what you are doing and measuring. I would expect, though, that you'd need to use different values in order to get realistic results.
Caching
If you send the same values every time, you can probably guarantee that the particular row(s) that you're interested in are always going to be cached (in the buffer cache, in the file system cache, in the SAN cache, etc.) which is probably not terribly realistic if the set of possible inputs is large. On the other hand, if there are a small number of potential inputs and you're reasonably confident that the rows of interest will always be cached (for example, if you know that some other activity that takes place just before your service is called will cause the data you're interested in to be cached in memory before your service is called) then perhaps this is a realistic assumption.
Optimization
Ignoring caching, we can look at how the optimizer would treat the two cases. If you are generating SQL queries with embedded literals (a bad practice that is particularly harmful in Oracle but one that is very common), then you are generating different SQL statements. As far as Oracle is concerned
SELECT *
FROM emp
WHERE deptno = 10
is a completely different statement from
SELECT *
FROM emp
WHERE deptno = 20
There are some settings (i.e. cursor_sharing) you can tweak to ask Oracle to treat these two as identical queries (by having Oracle force them into using bind variables) but that is not without its own downsides and is generally only recommended when you're trying to apply a band-aid to a poorly written application while you work on refactoring the application to use bind variables properly.
Assuming that you are generating queries using bind variables in your application, preparing the statement, and then binding different values before executing the query multiple times, i.e.
SELECT *
FROM emp
WHERE deptno = :1
then you get into the realm of histograms, bind variable peeking, and adaptive cursor sharing. This can get pretty involved and depends heavily on the version of Oracle you're using, the edition you're using, and how you've configured the optimizer to work. I'll try to give a simplified high-level overview here-- if you want to delve too much deeper into one of these, we'll probably want a separate question.
Histograms
By default, the optimizer assumes that data is equally spaced and equally likely. So, for example, if the deptno column has 50 distinct values, the optimizer assumes by default that each value is equally likely. That's probably a pretty reasonable assumption for most columns but it's obviously not reasonable for all columns. If I have a table with all active duty military members, for example, and one of the columns is birth_year, there will be far more rows with a birth_year of 1994 (20 years ago) than 1934 (80 years ago). In these cases, you gather histograms on the column in question in order to tell the optimizer that the data isn't evenly distributed and to let the optimizer gather information about which values are more common and how common they are.
The optimizer doesn't care about the values you are passing for your bind variable values unless there is a histogram on one of the columns in your predicate (I'll ignore for the moment the possibility that you are passing a value that is out of range).
Bind variable peeking
If you do have a histogram on one or more columns, then Oracle (9.1 and later if memory serves) will "peek" at the first value that is passed in for a bind variable and use that value with the histogram to determine the best plan for all subsequent executions. This works reasonably well the vast majority of the time but it occasionally leads to hair-pullingly painful problems (and much swearing) when Oracle peeks at a "bad" value and generates a plan that is efficient for that one execution but terrible for all future executions. This is summed up by Tom Kyte's story about the database that has to be restarted if it's rainy on a Monday morning. If you have a histogram on the column and different values that you might pass in would likely benefit from different query plans, you'd likely want to take bind variable peeking into consideration to determine if passing in values in a different order created any performance issues.
Adaptive cursor sharing
In recent versions (if memory serves 11.1 and later) and depending on your configuration, Oracle can use adaptive cursor sharing to maintain multiple query plans for a single statement and to use the most appropriate version for the particular bind variable value that is passed in. This is a much more sophisticated version of bind variable peeking that peeks for each set of values you pass in and figures out whether it is close enough to some other set of values to use the previously generated plan or whether it needs to compute a new plan for the new set of values. Figuring out what constitutes "close enough" and how this interacts with various features for ensuring plan stability is a rather involved topic in its own right.

you could use db caching
http://www.oracle.com/technetwork/articles/sql/11g-caching-pooling-088320.html
if the app is making network roundtrip and caculating results, that will still eat considerable time

Database Data Filtering Best Practice

I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.

I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.

i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.