This is a follow up to a question i asked at java disc based hashmap for a disk based hashmap.
The solution suggested works but at a high CPU cost. I've tried using a few embeded databases, including hsqldb and derby as well as an sqllite implementation in java.
The all get the job done, quite slowly for most of the ones i've tried, the three i mentioned beformed the best. I ran into one problem with all of them however.
Starting and maintaining each embeded database required a lot of CPU time, the ones i haven't mentioned used up 100% of the cpu most of the time,according to task manager.
My new question then is, are there any simple disc based storage that won't eat away my cpu.
for the record, the sqllite solution didn't spike cpu usage it was just crashing with a range of different errors. And apache derby had the best performance and cpu usage fluctuated with it but on average was about 80%
I have no experience with other embedded java DB then Apache Derby and HSQLDB.
Some links:
Open Source Database Engines in Java
LinkedIn answers
Did you tried some NoSQL DB?
Update
Here is a list of NoSQL databases. I have no experience with them. But MongoDB and CouchDB are quite famous. And also Db4o looks interesting.
Try H2 Database
Related
I want to create a hashmap kind of thing for fast lookup between IDs and assigned names.
The number of entries will be a few hundred thousand. Thus I don't want to keep everything in memory. Anyhow, as performance counts in the process, I don't want to make database queries for each ID.
So, what are my chances? How could I get fast lookups on large datasets?
A quick search found these:
Production-ready:
MapDB - I can personally recommend this. It's the successor to JDBM, if you found that one while googling.
Chronicle-Map
Voldemort
Probably nor production-ready, but worth looking at:
https://github.com/aloksingh/disk-backed-map
https://github.com/reines/persistenthashmap
Well there are couple of solutions in my mind !
1) Go for lucene -> store in files
2) Make views in the database -> store in database
So its upto you for which you go for !!
I had a similar requirement a few years back and was avoiding using databases thinking it would have highlook up times. Similar to you I had large set of values so could not use in memory datastructures. So I decided to sequentially parse the filesystem. It was a bit slow, but I could do nothing about it.
Then I explored more on DBs and used DB for my application, just to test. Initially it was slower compared to filesystem. But after indexing the table and Optimizing the database. It proved to be atleast 10-15 times faster than file system. I cant remember the exact performance results but it took just 150-200 ms to read data from a large dataset(around 700 mb of data-size on file system), whereas the same for filesystem was 3.5 seconds.
I used DB2 database, and this guide for performance tuning of DB2
Besides once the DB is setup, you can reuse it for mutiple applications over the network.
If you looking fast solution. Answer in-memory database
Redis, Memcached, Hazelcast, VoltDB etc.
I am very new at MySQL but I have some experience in OOP in Java. When you have a "database" in memory (like when programming in Java) speed of lookups depend on whether it is stored in the form of a linked list, hashmap, etc. A MySQL database is stored on the hard disk, as it is a file. What is the running time of a lookup in that case? Is there a way to get constant time access to an item in the hard disk like you do when data is in memory? Does it depend on whether its a hard disk or an SSD?
Thanks!
MySQL performance is a large and complicated topic. You can find many books on the topic and no answer here can hope to touch on all the important issues.
From the perspective of data base design, the main steps (not necessarily in order) are to design a good schema, define the right kind of indexes to support the queries you expect to process, pick the right engine for each table, and design the queries themselves for best performance. The MySQL Query Analyzer is a tool for addressing these issues, as is the EXPLAIN statement.
In terms of server configuration, there's a huge amount of tuning that goes into improving performance. As the MySQL manual topic Estimating Query Performance describes, the fundamental unit of performance is a disk seek, and theoretically this is log N in the number of rows. However, both the underlying OS and the MySQL server may do quite a bit of caching, which can make the actual performance grow much more slowly in practice (not to mention making it faster in absolute terms). Follow the links from the above page, search for some server configuration guides on the web, and/or read a book or two to get some insight into how to use all the performance-related parameters that MySQL provides.
For reasons that are beside the point, a company has bought an Exadata Eighth Rack. Some of the managers thought that this would improve performance of current applications. The problem is that hardly any application makes intensive database work (yes, this is a good moment for looking at facepalm animated gifs). So, at the moment, migrations have proven just little benefit.
The question is obvious. Most of the applications are written in Java, and some of them make intensive use of Solr and Cassandra. For what I know, Exadata is intended for storing data, while Exalogic can hold applications too. Anyway, I'm wondering if there is some way of taking advantage of mentioned infrastructure.
Replace Solr with Oracle Text.
Before I get down-voted: normally I would not recommend replacing existing code built with a popular, open-source program with a seldom-used, proprietary product. But if you want to use a lot of space and CPU on your database servers then Oracle Text can definitely help.
As more generic advice, the primary role of a database is not to store data. A file system can do that. Databases are built to join data. If an application is reading a large amount of data and doing ad hoc joins, those are the jobs you want to move to the database.
Exadata -> Oracle Database extreme performance.
Exalogic -> Fusion Middleware extreme performance. (Java goes here)
Your best move will be refactoring the application to put as much workload as possible on the DB (PL/SQL).
Another thing I could think of, but this would be a radical approach I have never really tried it myself (Yes I work with Exadatas too) maybe you can give it a shot and let us know here...
What about using all those GBs on the Exadata's RAM and start tuning your Java application's latency? I mean with that gruesome amount of Memory you can try and set a real nice amount of heap and avoid Garbage Collection induced latency. Please do let me know here what comes out if you actually try this.
Which protocol do the Java applications use to connect to Oracle?
If it's not IPC (inter process communication, aka BEQUEATH, aka shared memory), but maybe TCP and you have many fast & tiny roundtrips, than this would be your low-hanging fruit - eliminate the network stack.
edit: just realized that exadata cannot run java applications by default (only ODA does) - so it wouldn't be possible to make use of IPC. However, perhaps you're able to test the impact of IPC in one of your applications using the former infrastructure?
Exadata cannot host any customer application. You cannot install anything there. You only can host Oracle database on Exadata.
It means you can use database features like DBFS (file system over Oracle database), Java option (storing and executing java code in database). But you need to check what options you have license for. And internal JVM is used, which cannot be customized or upgraded.
Exadata is database appliance designed to work with large amount of differently accessed data in very effective and manageable way.
I am considering moving from H2 to MemSQL - and I would greatly appreciate any comments:
My application has to query very quickly concurrently from large tables of up to 300Million rows. To achieve this I have been using the H2 in-memory database.
I'm currently using the H2 database which allows me to create linked tables in the H2 in-memory database that point to a MySQL database. This is very useful in loading data from MySQL to H2.
Can I create Linked tables in MemSQL - I see no references to this in the online MemSQL documentation?
Another challenge is that I will need to run multiple instances of the application across many servers, so having MemSQL running distributed across servers is very attractive rather than having to duplicate the H2 database in every JVM instance of the application across the servers. Running one instance of H2 via TCP to the other servers will be too slow.
The other advantage I see with MemSQL is that there is apparently no locking and the queries are compiled into native C++ which could speed them up.
Has anyone compared MemSQL performance with H2? - I've found nothing on line from real world tests.
Mark L here from MemSQL. I wanted to address a few of your questions and offer additional help in getting the info/benchmarks you're asking about.
MemSQL does support linked tables via the JDBC connector - which in practice works just as it would with MySQL - so you'll have no issues getting that to work. Running MemSQL in distributed mode is indeed going to provide a big performance advantage and you'll see some significant improvements across the board both on throughput and latency. There's no direct comparison that I've found directly between H2 and MemSQL - however, you can draw some indirect conclusions by looking at comparisons of MemSQL vs MySQL since we have the comparison data for H2 vs. MySQL from the website. From our field experience I would expect you to observe significant performance gains when using MemSQL.
In general a few observations: in the MemSQL distributed version you would have several advantages that you can't get from H2: reads never blocking writes thanks to lock-free indexes, full MVCC (H2 can only do this in single-box), and auto sharding of data being among the highlights. Out of all the features, auto-sharding is likely to be the most substantial for your use case - H2 can't auto-shard the data, and having that ability when distributed is obviously a big advantage even if speed were equal between the two. As I mentioned though it will be much faster with MemSQL distributed, as well as easier to manage vs. multiple instances of H2.
In any case we're more than happy to help you prove this out! Please feel free to reach out to me via email- larosa at memsql dot com.
I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?
I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.
At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!
Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.
I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.
With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)
Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.
You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.
Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.
Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com