I am considering moving from H2 to MemSQL - and I would greatly appreciate any comments:
My application has to query very quickly concurrently from large tables of up to 300Million rows. To achieve this I have been using the H2 in-memory database.
I'm currently using the H2 database which allows me to create linked tables in the H2 in-memory database that point to a MySQL database. This is very useful in loading data from MySQL to H2.
Can I create Linked tables in MemSQL - I see no references to this in the online MemSQL documentation?
Another challenge is that I will need to run multiple instances of the application across many servers, so having MemSQL running distributed across servers is very attractive rather than having to duplicate the H2 database in every JVM instance of the application across the servers. Running one instance of H2 via TCP to the other servers will be too slow.
The other advantage I see with MemSQL is that there is apparently no locking and the queries are compiled into native C++ which could speed them up.
Has anyone compared MemSQL performance with H2? - I've found nothing on line from real world tests.
Mark L here from MemSQL. I wanted to address a few of your questions and offer additional help in getting the info/benchmarks you're asking about.
MemSQL does support linked tables via the JDBC connector - which in practice works just as it would with MySQL - so you'll have no issues getting that to work. Running MemSQL in distributed mode is indeed going to provide a big performance advantage and you'll see some significant improvements across the board both on throughput and latency. There's no direct comparison that I've found directly between H2 and MemSQL - however, you can draw some indirect conclusions by looking at comparisons of MemSQL vs MySQL since we have the comparison data for H2 vs. MySQL from the website. From our field experience I would expect you to observe significant performance gains when using MemSQL.
In general a few observations: in the MemSQL distributed version you would have several advantages that you can't get from H2: reads never blocking writes thanks to lock-free indexes, full MVCC (H2 can only do this in single-box), and auto sharding of data being among the highlights. Out of all the features, auto-sharding is likely to be the most substantial for your use case - H2 can't auto-shard the data, and having that ability when distributed is obviously a big advantage even if speed were equal between the two. As I mentioned though it will be much faster with MemSQL distributed, as well as easier to manage vs. multiple instances of H2.
In any case we're more than happy to help you prove this out! Please feel free to reach out to me via email- larosa at memsql dot com.
Related
Application is hosted on multiple Virtual Machines and DB is on single server. All VMs are pointing to single Instance of DB.
In this architecture, I have a table having very few record. But this table is accessed and updated by threads running on VMs very heavily. This is causing a performance bottleneck and sometimes record level exception. Database level locking does not seem to be best option as it is introducing significant delays in request processing.
Please suggest if there is any other technique to solve this problem.
Few questions first!
Is your application using connection pooling? If not, please use it. Creating a JDBC connection is expensive!
Is your application read heavy/write heavy?
What kind of storage engine are you using in your MySQL tables? InnoDB or MyISAM. If your application is write heavy, please use InnoDB based tables as it uses row level locking and will serve concurrent requests better.
One special case - if you are implementing queues on top of database tables, find a database that has a built-in queue operation and use that, or use a reliable messaging service. Building queues on top of databases is typically not efficient. See e.g. http://mikehadlow.blogspot.co.uk/2012/04/database-as-queue-anti-pattern.html
In general, running transactions on databases is slow because at the end of each transaction the database needs to be sure that enough has been saved out to disk that if the system died right now the changes made by the transaction would be safely preserved. If you don't need this you might find it faster to write a single non-database application that does what the database does but doesn't write anything out to disk, or still does database IO but does the minimum possible. Then instead of all of the VMs talking to the database directly they would all talk to this application.
I am very new at MySQL but I have some experience in OOP in Java. When you have a "database" in memory (like when programming in Java) speed of lookups depend on whether it is stored in the form of a linked list, hashmap, etc. A MySQL database is stored on the hard disk, as it is a file. What is the running time of a lookup in that case? Is there a way to get constant time access to an item in the hard disk like you do when data is in memory? Does it depend on whether its a hard disk or an SSD?
Thanks!
MySQL performance is a large and complicated topic. You can find many books on the topic and no answer here can hope to touch on all the important issues.
From the perspective of data base design, the main steps (not necessarily in order) are to design a good schema, define the right kind of indexes to support the queries you expect to process, pick the right engine for each table, and design the queries themselves for best performance. The MySQL Query Analyzer is a tool for addressing these issues, as is the EXPLAIN statement.
In terms of server configuration, there's a huge amount of tuning that goes into improving performance. As the MySQL manual topic Estimating Query Performance describes, the fundamental unit of performance is a disk seek, and theoretically this is log N in the number of rows. However, both the underlying OS and the MySQL server may do quite a bit of caching, which can make the actual performance grow much more slowly in practice (not to mention making it faster in absolute terms). Follow the links from the above page, search for some server configuration guides on the web, and/or read a book or two to get some insight into how to use all the performance-related parameters that MySQL provides.
I have two test environments. My application is performing much worst on the seconds one. I suspect that this because the first one system is using database which runs on better hardware (more CPU, faster connection). I would like to verify my claims somehow. Are there any tools, which would help me with that? Should it helpful, I am using Oracle 11g and my app is using Hibernate to connect to the database.
Mind you, I am not interested in profiling my schema. I would like to compare how fast is the same database (meaning schema + data) on two different machines.
If you are interested, why I suspect that database is the problem: I profiled my application during tests on those two environments. During the second test environment methods responsible for connecting to database (namely org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery()) are using much more of the CPU time.
To answer the question: I believe you'd use JMeter to profile the two environments, and get comprehensive data out of the tests you run. VisualVM will also be helpful, but that depends on the kind of data you need, and how you need to present (or analyze) it.
But as for the general problem, is the data on the two databases exactly the same? Because if this is not the case, some possibilities are open - your transactions might be depending on data that is locked by another process (therefore, you'd need to look at your transactions and the transaction isolation they use).
I have been experimenting with Hibernate 3.6 and I am wondering about the capabilities of the provided infinispan distributed cache.
I have a requirement to have database replication between my main site and my disaster recovery site.
While it is possible to configure PostgreSQL to replicate, I am thinking that it might cause the same data to be sent twice from the main to the DR site. My application is expected to have a lot of updates, so that's something to keep in mind. Since this would be over a constrained WAN link, it feels like a lot of data would be sent and that just doesn't look like a good idea.
Can infinispan be configured to replicate between the two sites such that the underlying database doesn't need to ever be replicated itself?
If so, how? How bandwith intensive would it be?
Postgresql >=9.0 has very good replication, use it. You shouldn't replicate cache to DR center, if You have a lot of updates. You shouldn't replicate enything other than data needed to recovery.
DR Center == some backup <> load balancing etc.
the Hibernate cache will keep lots of data in memory that won't be flushed"
It depends. You can flush data manually, you can set flush mode to auto,commit etc.
This is a follow up to a question i asked at java disc based hashmap for a disk based hashmap.
The solution suggested works but at a high CPU cost. I've tried using a few embeded databases, including hsqldb and derby as well as an sqllite implementation in java.
The all get the job done, quite slowly for most of the ones i've tried, the three i mentioned beformed the best. I ran into one problem with all of them however.
Starting and maintaining each embeded database required a lot of CPU time, the ones i haven't mentioned used up 100% of the cpu most of the time,according to task manager.
My new question then is, are there any simple disc based storage that won't eat away my cpu.
for the record, the sqllite solution didn't spike cpu usage it was just crashing with a range of different errors. And apache derby had the best performance and cpu usage fluctuated with it but on average was about 80%
I have no experience with other embedded java DB then Apache Derby and HSQLDB.
Some links:
Open Source Database Engines in Java
LinkedIn answers
Did you tried some NoSQL DB?
Update
Here is a list of NoSQL databases. I have no experience with them. But MongoDB and CouchDB are quite famous. And also Db4o looks interesting.
Try H2 Database