I'm using the H2 database - running in embedded mode - and when my app starts up I load the H2 database with data from a mySQL database. I'm using linked tables to point to the mySQL tables.
My issue is that I'm trying to speed up the time that H2 takes to create the indexes on the tables, particularly for larger tables (5Million+).
Does anyone know if it is safe to run the CREATE INDEX commands in a separate thread while I load the next table's data into H2?
For example:
Thread 1: Loads table 1 -> tells Thread 2 to start creating indexes and then Thread 1 loads table 2, etc.
I can't use the MVCC mode when loading the tables because later on I need to use the MULTI_THREADED mode when I do my selects. When I try using the MULTI_THREADED mode I got locking errors even though I was loading data into discrete tables.
Many thanks!
What might work (but I'm not sure if it's faster) is to create the tables and indexes first, and then load the tables in parallel. This should avoid locking problems in the system table.
I would also like to add the method rst.findColumn("columnName") to find the indexes AFTER getting the result set of the table. rst is a ResultSet object. This is what I have used.
Another way to dramatically improve H2 loading and especially indexing performance is to set the initial memory close to what the expected memory requirement is. As one example, this one change allowed an app with about a 1.5GB requirement to startup in 47 seconds instead of failing after 15 - 20 minutes. Prior to this, we were seeing GC Overhead limit exceeded and JVMTI errors.
Add the following to your VM arguments (as an example):
-Xms2g
-Xmx4g
Related
I'm running queries in parallel against a MySql database. Each query takes less than a second and another half a second to a second to fetch.
This is acceptable for me. But when I run 10 of these queries in parallel and then attempt another set in a different session everything slows down and a single query can take some 20 plus seconds.
My ORM is hibernate and I'm using C3P0 with <property name="hibernate.c3p0.max_size">20</property>. I'm sending the queries in parallel by using Java threads. But I don't think these are related because the slowdown also happens when I run queries in MySql Workbench. So I'm assuming something in my MySql config is missing, or the machine is not fast enough.
This is the query:
select
*
FROM
schema.table
where
site = 'sitename' and (description like '% family %' or title like '% family %')
limit 100 offset 0;
How can I make this go faster when facing let's say 100 concurrent queries?
I'm guessing that this is slow because the where clause is doing a full text search on the description and title columns; this will require the database to look through the entire field on every record, and that's never going to scale.
Each of those 10 concurrent queries must read the 1 million rows to fulfill the query. If you have a bottleneck anywhere in the system - disk i/o, memory, CPU - you may not hit that bottleneck with a single query, but you do hit it with 10 concurrent queries. You could use one of these tools to find out which bottleneck you're hitting.
Most of the time, those bottlenecks (CPU, memory, disk) are too expensive to upgrade - especially if you need to scale to 100 concurrent queries. So it's better to optimize the query/ORM approach.
I'd consider using Hibernate's built-in free text capability here - it requires some additional configuration, but works MUCH better when looking for arbitrary strings in a textual field.
I have Java application that uses Spring JPA and Hibernate to connect to ORACLE 11g database.
From time to time, I need to drop partition in the DB and rebuild all the UNUSABLE global indexes to USABLE state. (The indexes become unusable due to drop partition command)
Between the time when my partition is dropped and UNUSABLE indexes are not yet rebuild, my online application fails with ORA-01502 error like below.
Caused by: java.sql.BatchUpdateException: ORA-01502: index 'USER.INDEX_NAME' or partition of such index is in unusable state
at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10070)
at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:213)
at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:70)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:268)
... 94 more
In SQL there is an option to ignore UNUSABLE indexes by setting skip_unusable_indexes=TRUE. This way query optimizer selects a different (expensive) execution plan that does not use index and does not report any failure on DML queries due to unusable indexes.
Is there any such similar option in Hibernate that I can use to not to fail when indexes are in UNUSABLE status?
Versions I am using
Hibernate: 3.6.9
Oracle: 11g
Java: 7
You can rebuild the index:
ALTER INDEX USER.INDEX_NAME REBUILD;
You may try to execute:
ALTER SESSION SET skip_unusable_indexes=true
like this but this session will be returned to the collection pool and reused so this will affect more than one query.
If I were you I would ask myself "Why my indexes are unusable?" This is a situation that should not happen unless you are executing some maintenance or executing some very large batch proccess. You may have a 24/7 system where you don't really wan't to stop the system for maintenance. In this case you can set the option system wise without a single change to your code. This way the system will be slower but behave nicer during maintenace. Just remember that index used to enforce constrains can't be ignored and insert/update queries will fail anyway. And add some automatic check that reports unusable indexes in PRO at certain times. Just a PL/SQL process that send emails can be OK
Another alternative is to change the option only during changes in the database:
ALTER SYSTEM SET skip_unusable_indexes=true;
ALTER TABLE T1 DROP PARTITION P1;
ALTER INDEX I1 REBUILD ONLINE
ALTER SYSTEM SET skip_unusable_indexes=false;
In dba.stackexchange.com there is a discussion about the better way to drop a partition. So you are not alone but the solution is for Oracle 12C.
We have a relatively large table in a H2 database with up to 12 million rows. The table contains status information that a user needs to see on a web interface. The user is mainly only interested in the last couple of hundred / thousand entries, or entries over the last n days. Of course sometimes it will also be necessary to query all entries, but we can suppose that this happens seldomly and can take its time. Now our main problem is that we do not have a full blown server as target platform but a more embedded solution and with tables that size, the embedded system is taking a couple of seconds to respond and the web ui (with ajax etc.) feels sluggish.
To make the query faster we already added indexes, max_row_memory and caching. This makes the query impressively faster, but is still not in the range where we would like to be.
As I understand, H2 flushes the cache of the table if an INSERT / UPDATE / DELETE is performed on the table. A large part of the application depends on the last n rows and I am searching for a way to always keep these n-rows in cache, so that if a SELECT query to get the last n rows is called, even after a previous INSERT, the rows are collected from the cache.
As I did not find any solution in H2 directly, my first approach would be to implement the caching as a second level inside the application. The solution would be ok, but from a design point of view find it more appealing to have it inside H2. Anybody have an idea how I could solve this with H2?
H2 doesn't flush the cache on modification.
Your best bet is to run a profiler over your application to see where the time is going to.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.
I am writing a program that does a lot of writes to a Postgres database. In a typical scenario I would be writing say 100,000 rows to a table that's well normalized (three foreign integer keys, the combination of which is the primary key and the index of the table). I am using PreparedStatements and executeBatch(), yet I can only manage to push in say 100k rows in about 70 seconds on my laptop, when the embedded database we're replacing (which has the same foreign key constraints and indices) does it in 10.
I am new at JDBC and I don't expect it to beat a custom embedded DB, but I was hoping it to be only 2-3x slower, not 7x. Anything obvious that I maybe missing? does the order of the writes matter? (i.e. say if it's not the order of the index?). Things to look at to squeeze out a bit more speed?
This is an issue that I have had to deal with often on my current project. For our application, insert speed is a critical bottleneck. However, we have discovered for the vast majority of database users, the select speed as their chief bottleneck so you will find that there are more resources dealing with that issue.
So here are a few solutions that we have come up with:
First, all solutions involve using the postgres COPY command. Using COPY to import data into postgres is by far the quickest method available. However, the JDBC driver by default does not currently support COPY accross the network socket. So, if you want to use it you will need to do one of two workarounds:
A JDBC driver patched to support COPY, such as this one.
If the data you are inserting and the database are on the same physical machine, you can write the data out to a file on the filesystem and then use the COPY command to import the data in bulk.
Other options for increasing speed are using JNI to hit the postgres api so you can talk over the unix socket, removing indexes and the pg_bulkload project. However, in the end if you don't implement COPY you will always find performance disappointing.
Check if your connection is set to autoCommit. If autoCommit is true, then if you have 100 items in the batch when you call executeBatch, it will issue 100 individual commits. That can be a lot slower than calling executingBatch() followed by a single explicit commit().
I would avoid the temptation to drop indexes or foreign keys during the insert. It puts the table in an unusable state while your load is running, since nobody can query the table while the indexes are gone. Plus, it seems harmless enough, but what do you do when you try to re-enable the constraint and it fails because something you didn't expect to happen has happened? An RDBMS has integrity constraints for a reason, and disabling them even "for a little while" is dangerous.
You can obviously try to change the size of your batch to find the best size for your configuration, but I doubt that you will gain a factor 3.
You could also try to tune your database structure. You might have better performances when using a single field as a primary key than using a composed PK. Depending on the level of integrity you need, you might save quite some time by deactivating integrity checks on your DB.
You might also change the database you are using. MySQL is supposed to be pretty good for high speed simple inserts ... and I know there is a fork of MySQL around that tries to cut functionalities to get very high performances on highly concurrent access.
Good luck !
try disabling indexes, and reenabling them after the insert. also, wrap the whole process in a transaction