Caching newest rows with H2 - java

We have a relatively large table in a H2 database with up to 12 million rows. The table contains status information that a user needs to see on a web interface. The user is mainly only interested in the last couple of hundred / thousand entries, or entries over the last n days. Of course sometimes it will also be necessary to query all entries, but we can suppose that this happens seldomly and can take its time. Now our main problem is that we do not have a full blown server as target platform but a more embedded solution and with tables that size, the embedded system is taking a couple of seconds to respond and the web ui (with ajax etc.) feels sluggish.
To make the query faster we already added indexes, max_row_memory and caching. This makes the query impressively faster, but is still not in the range where we would like to be.
As I understand, H2 flushes the cache of the table if an INSERT / UPDATE / DELETE is performed on the table. A large part of the application depends on the last n rows and I am searching for a way to always keep these n-rows in cache, so that if a SELECT query to get the last n rows is called, even after a previous INSERT, the rows are collected from the cache.
As I did not find any solution in H2 directly, my first approach would be to implement the caching as a second level inside the application. The solution would be ok, but from a design point of view find it more appealing to have it inside H2. Anybody have an idea how I could solve this with H2?

H2 doesn't flush the cache on modification.
Your best bet is to run a profiler over your application to see where the time is going to.

Related

Cache query results

Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?
OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...

How to improve performance of a simple select query in oracle

I recently got into an interview and I was asked a question
We have a table employee(id, name). And in our java code, we are writing a logic to fetch data from this table and display it in UI. The query is
Select id,name from employee
Query was that during debugging, we found that this jdbc call to fire the query and get the output is taking say 20 secs and we want to reduce this to say 5 seconds or to the optimal time. How can we you do that, or how will I tackle this problem?
As there is no where clause in the query, I didn't suggest to index the column.
As this logic is taking 20 secs every time, so, some other code getting a lock on this table is also out of question.
I suggested that limiting the number of records fetched from the table should help but the interviewer didn't look convinced
Is there anything else we can do as a developer to optimize the call. I guess DBA might tune database setting to improve the performance of this query, but is there any other way
OK, so this is an interview question, so both the problem and the solutions are hypothetical. The interviewer is asking for possible optimizations and / or approaches. Here are some that are most likely to help:
Modify the query to page the data rather than fetching the whole lot. This looks applicable for the example query. Note that this is not just "limiting the number of rows selected from the table" ... which is probably why the interviewer looked doubtful when you said that!
If you do need to display the entire selected record set but in a reduced form (e.g. summed, averaged, sorted, collated etc), do the reduction in the query rather than by fetching the records and doing it in the client.
Tune the fetchSize() as suggested by Ivan.
Here are some other ideas that are less likely to help and / or will require extensive reworking.
Look at the network configs. For example you may be able to get better throughput by OS-level tuning TCP buffer, or optimizing physical or virtual network paths.
Run the query on the database server itself (to eliminate network overheads)
Use an in-memory table
Query a secondary database server; e.g. a readonly snapshot or a slave
You can try to increase fetchSize() for Statement/PreparedStatement to decrease number of network roundtrips between application server/desktop and database server.
You can start several threads that will query some piece of data and then merge all data from several threads.
EDIT: doesn't apply to this situation because id and name are the only columns on this table, but still useful for other readers to note.
If you create an index covering both id and name, then the database can use that index to read the data faster since it wont even have to even read the table.
See this link for a more thorough explanation.
if the index contains all the columns you’re requesting it doesn’t even need to look in the table. That concept is known as index coverage.

Is there a good patterns for distributed software and one backend database for this problem?

I'm looking for a high level answer, but here are some specifics in case it helps, I'm deploying a J2EE app to a cluster in WebLogic. There's one Oracle database at the backend.
A normal flow of the app is
- users feed data (to be inserted as rows) to the app
- the app waits for the data to reach a certain size and does a batch insert into the database (only 1 commit)
There's a constraint in the database preventing "duplicate" data insertions. If the app gets a constraint violation, it will have to rollback and re-insert one row at a time, so the duplicate rows can be "renamed" and inserted.
Suppose I had 2 running instances of the app. Each of the instances is about to insert 1000 rows. Even if there is only 1 duplicate, one instance will have to rollback and insert rows one by one.
I can easily see that it would be smarter to re-insert the non-conflicting 999 rows as a batch in this instance, but what if I had 3 running apps and the 999 rows also had a chance of duplicates?
So my question is this: is there a design pattern for this kind of situation?
This is a long question, so please let me know where to clarify. Thank you for your time.
EDIT:
The 1000 rows of data is in memory for each instance, but they cannot see the rows of each other. The only way they know if a row is a duplicate is when it's inserted into the database.
And if the current application design doesn't make sense, feel free to suggest better ways of tackling this problem. I would appreciate it very much.
http://www.oracle-developer.net/display.php?id=329
The simplest would be to avoid parallel processing of the same data. For example, your size or time based event could run only on one node or post a massage to a JMS queue, so only one of the nodes would process it (for instance, by using similar duplicate-check, e.g. based on a timestamp of the message/batch).

Fast way to replicate a huge database table

We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.
Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.
A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.
I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.
However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.
So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable?
Replicating the data in a cache seems like replicating the functionality of the database.
From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:
CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;
If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than
SELECT name, city, ... , FROM a , b ;
Other database systems have a view concept which lets you do something like this
CREATE VIEW abView AS SELECT * FROM a , b ;
Changes in the underlying a and b table will be reflected in the abView.
If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.
A good database management system should be able to handle your data needs. So why reinvent the wheel?
SELECT * FROM YOUR_TABLE
Map results into an object or data structure
Assign a unique key for each object or data structure
Load the key and object or data structure into a WeakHashMap to act as your cache.
I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?
Be sure to think about thread safety.
I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.
How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Categories

Resources