My question is when should we worry about how much data do we ask for a given task from sql database?
Lets say I have PostgreSql database. I have a product table and product has 20 fields.
In a the system some places actually only need product id, name and price. Some people argue, that I should only ask those 3 fields from the database to be efficient. But I feel It's much easier for developers to always do productService.getProduct(id) and then pick the fields they need, than to make a separate class or query for those specific fields.
Does it really matter - for speed of the query - if I ask 3 or 20 fields? How much load could it increase?
(I'm java developer with the mindset of "Early optimization is the root of all evil".)
As with all optimization, when it matters. The profiler and other tools (for example postgres' EXPLAIN ANALYZE) will let you know.
The actual mechanisms depend on lots of things; the database being used, table/schema, tablespace settings etc. etc. so it's impossible to give any definite answer, but since the amount of data being moved is different, it will naturally make a difference whether you're moving 100,000 rows of 10 columns or 100,000 rows of 3 columns.
The actual query may not see a significant difference if the same amount of pages is being read from disk, but the memory and network use will naturally differ.
Thankfully you can refactor code and queries to select less data if the original query becomes a bottleneck.
Related
Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?
OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...
I recently got into an interview and I was asked a question
We have a table employee(id, name). And in our java code, we are writing a logic to fetch data from this table and display it in UI. The query is
Select id,name from employee
Query was that during debugging, we found that this jdbc call to fire the query and get the output is taking say 20 secs and we want to reduce this to say 5 seconds or to the optimal time. How can we you do that, or how will I tackle this problem?
As there is no where clause in the query, I didn't suggest to index the column.
As this logic is taking 20 secs every time, so, some other code getting a lock on this table is also out of question.
I suggested that limiting the number of records fetched from the table should help but the interviewer didn't look convinced
Is there anything else we can do as a developer to optimize the call. I guess DBA might tune database setting to improve the performance of this query, but is there any other way
OK, so this is an interview question, so both the problem and the solutions are hypothetical. The interviewer is asking for possible optimizations and / or approaches. Here are some that are most likely to help:
Modify the query to page the data rather than fetching the whole lot. This looks applicable for the example query. Note that this is not just "limiting the number of rows selected from the table" ... which is probably why the interviewer looked doubtful when you said that!
If you do need to display the entire selected record set but in a reduced form (e.g. summed, averaged, sorted, collated etc), do the reduction in the query rather than by fetching the records and doing it in the client.
Tune the fetchSize() as suggested by Ivan.
Here are some other ideas that are less likely to help and / or will require extensive reworking.
Look at the network configs. For example you may be able to get better throughput by OS-level tuning TCP buffer, or optimizing physical or virtual network paths.
Run the query on the database server itself (to eliminate network overheads)
Use an in-memory table
Query a secondary database server; e.g. a readonly snapshot or a slave
You can try to increase fetchSize() for Statement/PreparedStatement to decrease number of network roundtrips between application server/desktop and database server.
You can start several threads that will query some piece of data and then merge all data from several threads.
EDIT: doesn't apply to this situation because id and name are the only columns on this table, but still useful for other readers to note.
If you create an index covering both id and name, then the database can use that index to read the data faster since it wont even have to even read the table.
See this link for a more thorough explanation.
if the index contains all the columns you’re requesting it doesn’t even need to look in the table. That concept is known as index coverage.
I am hitting a REST API to get data from a service. I transform this data and store it in a database. I will have to do this on some interval, 15 minutes, and then make sure this database has latest information.
I am doing this in a Java program. I am wondering if it would be better, after I have queried all data, to do
1. SELECT statements and compare vs transformed data and do UPDATEs (DELETE all associated records to what was changed and INSERT new)
OR
DELETE ALL and INSERT ALL every time.
Option 1 has potential to be a lot less transactions, guaranteed SELECT on all records because we are comparing, but potentially not a lot of UPDATEs since I don't expect data to be changing much. But it has downside of doing comparisons on all records to detect a change
I am planning on doing this using Spring Boot, JPA layer and possibly postgres
The short answer is "It depends. Test and see for your usecase."
The longer answer: this feels like preoptimization. And the general response for preoptimization is "don't." Especially in DB realms like this, what would be best in one situation can be awful in another. There are a number of factors, including (and not exclusive to) schema, indexes, HDD backing speed, concurrency, amount of data, network speed, latency, and so on:
First, get it working
Identify what's wrong → get a metric
Measure against that metric
Make any obvious or necessary changes
Repeat 1 through 4 as appropriate
The first question I would ask of you is "What does better mean?" Once you define that, the path forward will likely become clearer.
I know (or think I know) that using things like prepared statements can help future executions of the same query execute faster. However, I was wondering, if you're using prepared statements but the actual values are the same every time, will it then also additionally optimize using the value?
To give a little more context, I want to test performance for a service request that uses an underlying database. The easy route would be to send in the same data each time. The more arduous route would be to ensure the data values were different each time. However, in either case, the same SQL query would be generated -- just the values would be different. So, will these scenarios end up testing the same thing or something different because of potential DB optimization?
I've tried to research this topic but I feel like a lot of what I'm reading is over my head. Any good links for someone that knows little about DB optimization would also be welcomed in addition to the central question.
It depends on exactly what you are doing and measuring. I would expect, though, that you'd need to use different values in order to get realistic results.
Caching
If you send the same values every time, you can probably guarantee that the particular row(s) that you're interested in are always going to be cached (in the buffer cache, in the file system cache, in the SAN cache, etc.) which is probably not terribly realistic if the set of possible inputs is large. On the other hand, if there are a small number of potential inputs and you're reasonably confident that the rows of interest will always be cached (for example, if you know that some other activity that takes place just before your service is called will cause the data you're interested in to be cached in memory before your service is called) then perhaps this is a realistic assumption.
Optimization
Ignoring caching, we can look at how the optimizer would treat the two cases. If you are generating SQL queries with embedded literals (a bad practice that is particularly harmful in Oracle but one that is very common), then you are generating different SQL statements. As far as Oracle is concerned
SELECT *
FROM emp
WHERE deptno = 10
is a completely different statement from
SELECT *
FROM emp
WHERE deptno = 20
There are some settings (i.e. cursor_sharing) you can tweak to ask Oracle to treat these two as identical queries (by having Oracle force them into using bind variables) but that is not without its own downsides and is generally only recommended when you're trying to apply a band-aid to a poorly written application while you work on refactoring the application to use bind variables properly.
Assuming that you are generating queries using bind variables in your application, preparing the statement, and then binding different values before executing the query multiple times, i.e.
SELECT *
FROM emp
WHERE deptno = :1
then you get into the realm of histograms, bind variable peeking, and adaptive cursor sharing. This can get pretty involved and depends heavily on the version of Oracle you're using, the edition you're using, and how you've configured the optimizer to work. I'll try to give a simplified high-level overview here-- if you want to delve too much deeper into one of these, we'll probably want a separate question.
Histograms
By default, the optimizer assumes that data is equally spaced and equally likely. So, for example, if the deptno column has 50 distinct values, the optimizer assumes by default that each value is equally likely. That's probably a pretty reasonable assumption for most columns but it's obviously not reasonable for all columns. If I have a table with all active duty military members, for example, and one of the columns is birth_year, there will be far more rows with a birth_year of 1994 (20 years ago) than 1934 (80 years ago). In these cases, you gather histograms on the column in question in order to tell the optimizer that the data isn't evenly distributed and to let the optimizer gather information about which values are more common and how common they are.
The optimizer doesn't care about the values you are passing for your bind variable values unless there is a histogram on one of the columns in your predicate (I'll ignore for the moment the possibility that you are passing a value that is out of range).
Bind variable peeking
If you do have a histogram on one or more columns, then Oracle (9.1 and later if memory serves) will "peek" at the first value that is passed in for a bind variable and use that value with the histogram to determine the best plan for all subsequent executions. This works reasonably well the vast majority of the time but it occasionally leads to hair-pullingly painful problems (and much swearing) when Oracle peeks at a "bad" value and generates a plan that is efficient for that one execution but terrible for all future executions. This is summed up by Tom Kyte's story about the database that has to be restarted if it's rainy on a Monday morning. If you have a histogram on the column and different values that you might pass in would likely benefit from different query plans, you'd likely want to take bind variable peeking into consideration to determine if passing in values in a different order created any performance issues.
Adaptive cursor sharing
In recent versions (if memory serves 11.1 and later) and depending on your configuration, Oracle can use adaptive cursor sharing to maintain multiple query plans for a single statement and to use the most appropriate version for the particular bind variable value that is passed in. This is a much more sophisticated version of bind variable peeking that peeks for each set of values you pass in and figures out whether it is close enough to some other set of values to use the previously generated plan or whether it needs to compute a new plan for the new set of values. Figuring out what constitutes "close enough" and how this interacts with various features for ensuring plan stability is a rather involved topic in its own right.
you could use db caching
http://www.oracle.com/technetwork/articles/sql/11g-caching-pooling-088320.html
if the app is making network roundtrip and caculating results, that will still eat considerable time
I want to use long timestamp value(may be generated by System.currentTimeInMillis()) as column names in my database. Can System.currentTimeInMillis() method guarantee an always increasing values ?? I have seen people complaining that sometimes it became slower.. !
I am also open to other alternatives that may be considerable for putting as increasing column names. I just want to guarantee uniqueness(until they fall in same millisecond when I can consider them ok..) & increasing sequence ( may be also perhaps smaller in size (less bytes) if anyhow possible!).
Edit: I have a NoSQL database where column names(& hence columns) are sorted in a row as ascending/descending number sequence. Thus I am looking to generate timestamps as column names that could enable me to sort the columns by time.
I am looking to store comments of a blog post in a single row using timestamp values as column names to enable sort by time. I think I wouldnt mind even if 10 ms is the resolution since probablity of someone commenting in the same 1/100 of a sec on the same blog post on my application would be very low.
Edit: Thank you all for your comments and suggestions. Really helpful.. I think I have got a solution to work around the problems of seldom failures of System.currentTimeInMillis(). I could implement like this:-
When a user adds a new comment to a post, the frontend with send an id 'suggestedId' which is one greater than the id of last comment( frontend would know about this from the previous database read). This id would be compared with the id generated using System.nanotime(). if the suggestedId is less than the generatedId then generatedId will be used else suggestedId would be used. So it simply means whatever is greater, use that Id. This guarantees monotonocity
Although not truly perfect but yes sounds good for practical usage!
Would you guys like to share your thoughts upon this? Thanks!!!
The general database design issues have been addressed by other commenters, but just on this point:
Can System.currentTimeInMillis() method guarantee an always increasing values ?? I have seen people complaining that sometimes it became slower.. !
For future reference, the word for this (always-increasing values) is monotonicity. No, System.currentTimeMillis() is not monotonic. Not only can it go more slowly, or speed up (if, say, the System it's running on is using NTP for time correction), but it can arbitrarily change up or down (if the user, or a script, changes the system time).
System.nanoTime() does not formally guarantee monotonicity; however, the Hotspot JVM does if and only if the underlying system supports it (modern Linux kernels on modern hardware certainly do). Sounds better - with the caveat that some processors use power management techniques etc which can screw this up in the presence of multiple cores. So it's better, but still not perfect.
On many systems, System.currentTimeMillis() does not resolve below 10 ms increments. So two different calls can easily return the same value.
I suggest that you keep an auxiliary table with a counter that you can increment to give the next value.
Why do you want this for column names? It seems a very odd sort of data base design.
I am looking to store comments of a blog post in a single row using timestamp values as column names to enable sort by time.
I'm no NoSQL expert, but I'd say it's not a good idea to store comments as columns in one row. Why don't you add a row per comments along with a timestamp you can sort by?
Using a traditional relational database the table could look like this:
comments
--------
id (PK)
blog_id (FK)
created_on (timestamp)
text
Selecting the comments in order would then be in SQL:
SELECT * from comments WHERE blog_id = ? ORDER BY created_on
System.currentTimeMillis() typically has around 10-20ms granularity, but even if it had 1ms granularity, in principle, 1ms is an eternity in computing time and it would be quite plausible, depending on what you're doing, for two calls to end up with the same value. However, I'm guessing that even 20ms is probably not an eternity compared to how frequently people make blog comments.
So, if two people post a comment within the same 20ms (or whatever), just sorting on this value will not define an order for the posts in question. But do you particularly care about this unlikely situation. If you do, then you need to build in a little bit of extra logic (have a counter for the number of messages posted "this millisecond"). I personally wouldn't bother in your use case.
As far as I can understand, you're also storing the data in a fundamentally silly way. Why not just have a "Comments" table with a row per comment and a single time column, which you can sort on as required.
Many databases provide a way to get serial numbers into column. For example see this -- PostgreSQL Autoincrement