Terracotta + Compass = Hibernate + HSQLDB + JMS? - java

I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?

I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.

At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!

Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.

I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.

With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)

Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.

You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.

Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.

Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com

Related

How to persist a key-value map for fast lookup?

I want to create a hashmap kind of thing for fast lookup between IDs and assigned names.
The number of entries will be a few hundred thousand. Thus I don't want to keep everything in memory. Anyhow, as performance counts in the process, I don't want to make database queries for each ID.
So, what are my chances? How could I get fast lookups on large datasets?
A quick search found these:
Production-ready:
MapDB - I can personally recommend this. It's the successor to JDBM, if you found that one while googling.
Chronicle-Map
Voldemort
Probably nor production-ready, but worth looking at:
https://github.com/aloksingh/disk-backed-map
https://github.com/reines/persistenthashmap
Well there are couple of solutions in my mind !
1) Go for lucene -> store in files
2) Make views in the database -> store in database
So its upto you for which you go for !!
I had a similar requirement a few years back and was avoiding using databases thinking it would have highlook up times. Similar to you I had large set of values so could not use in memory datastructures. So I decided to sequentially parse the filesystem. It was a bit slow, but I could do nothing about it.
Then I explored more on DBs and used DB for my application, just to test. Initially it was slower compared to filesystem. But after indexing the table and Optimizing the database. It proved to be atleast 10-15 times faster than file system. I cant remember the exact performance results but it took just 150-200 ms to read data from a large dataset(around 700 mb of data-size on file system), whereas the same for filesystem was 3.5 seconds.
I used DB2 database, and this guide for performance tuning of DB2
Besides once the DB is setup, you can reuse it for mutiple applications over the network.
If you looking fast solution. Answer in-memory database
Redis, Memcached, Hazelcast, VoltDB etc.

MySQL data lookup?

I am very new at MySQL but I have some experience in OOP in Java. When you have a "database" in memory (like when programming in Java) speed of lookups depend on whether it is stored in the form of a linked list, hashmap, etc. A MySQL database is stored on the hard disk, as it is a file. What is the running time of a lookup in that case? Is there a way to get constant time access to an item in the hard disk like you do when data is in memory? Does it depend on whether its a hard disk or an SSD?
Thanks!
MySQL performance is a large and complicated topic. You can find many books on the topic and no answer here can hope to touch on all the important issues.
From the perspective of data base design, the main steps (not necessarily in order) are to design a good schema, define the right kind of indexes to support the queries you expect to process, pick the right engine for each table, and design the queries themselves for best performance. The MySQL Query Analyzer is a tool for addressing these issues, as is the EXPLAIN statement.
In terms of server configuration, there's a huge amount of tuning that goes into improving performance. As the MySQL manual topic Estimating Query Performance describes, the fundamental unit of performance is a disk seek, and theoretically this is log N in the number of rows. However, both the underlying OS and the MySQL server may do quite a bit of caching, which can make the actual performance grow much more slowly in practice (not to mention making it faster in absolute terms). Follow the links from the above page, search for some server configuration guides on the web, and/or read a book or two to get some insight into how to use all the performance-related parameters that MySQL provides.

Need good design pattern for caching database query result set

I'm part of a team architecting a Java web application wherein users will search for results in a relational database and then view them in tabular fashion in a browser. Users will then also have the option to subsequently view the same result set (or a subset of those results) in a separate browser window, using for example a charting tool. In other words, we need to give the user the ability to visualize the same result set records later (up to a limit of 24 hours).
Since searches on the system will be resource-intensive and just out of good common sense, we would like a clean way to cache each result set so that it can be pulled later from memory (RAM or disk). We are looking for a good approach to doing this caching, we believe others have done this before, and we prefer to use a best-practice or framework rather than building such a thing from scratch. The server will have plenty of RAM but since there could be hundreds of people using the system, we may need an approach that stores to RAM first but then can also cache to hard disk if RAM is getting full.
I believe it makes most sense to persist as Java objects but I'm open to better advice. We would like a vendor-neutral approach, so that if the database team chooses to switch vendors later we aren't stuck with a proprietary solution. Thanks.
I think what you might be looking for is Terracotta Ehcache. This does everything you mentioned and more. It is a free product that can be used to cache things in memory, overflow to disk, specify max cache sizes by either MB or # of items, and expire based on last access time or entry time.
I've seen http://www.jboss.org/infinispan/ used to do exactly that. It can cache to memory, disk and or database. I wouldn't say I love it (the configuration is not super easy and documentation is somewhat lacking) but it most certainly works and is actively maintained.
Being vendor neutral is all about writing an abstraction layer that is native to your application, then plugging in the cache service you would like to use behind this layer, while keeping your layer that exposes these operations to your main code the same.
There are plenty of ways to cache. Look into using various NoSql solutions.
Redis
Memcached
Most of the time you will serialize your object and persist it to your cache layer.

Cache the database content in memory to increase performance

I have a message driven bean running on Glassfish. It receives hundreds of messages per second. After receiving a message, it needs to read the values in the JDBC database and process.
However, the values in database will only be updated one time a day or less. So what the MDB will read is consistent most of the time. So, is it there a good way to cache the content into the memory in order to increase the performance?
Update: is it posible to configure a in-memory database JDBC Connection Pool in Glassfish for the MDB?
You may get inspired by the identity map pattern and implement your own cache mechanism (with your own expiration policy), if you think that using a third party solution (like memcached) would be overkill.
The most obvious answer is to define the table as a MEMORY table type. If the underlying hardware is not prone to crashing and the OS is stable and there's a UPS attached, you might want to think about this. It also depends on the consequences of losing a number of transactions since the last backup whenever this fails. But performance-wise, this is lightning fast. More information can be found here for MySQL. (YMMV)
I've implemented multiple tables that way, and it worked great for me.
You can use a simple SoftReference HashMap based cache. Here is the complete implementation example SoftReference Cache In addition you can clear the complete Map periodically to bring in fresh data.
Or in case you can use 3rd party library, you might use ReferenceMap provided as part of Apache Collections.

Long-running stats process - thoughts on language choice?

I am on a LAMP stack for a website I am managing. There is a need to roll up usage statistics (a variety of things related to our desktop product).
I initially tackled the problem with PHP (being that I had a bunch of classes to work with the data already). All worked well on my dev box which was using 5.3.
Long story short, 5.1 memory management seems to suck a lot worse, and I've had to do a lot of fooling to get the long-term roll up scripts to run in a fixed memory space. Our server guys are unwilling to upgrade PHP at this time. I've since moved my dev server back to 5.1 so I don't run into this problem again.
For mining of MySQL databases to roll up statistics for different periods and resolutions, potentially running a process that does this all the time in the future (as opposed to on a cron schedule), what language choice do you recommend? I was looking at Python (I know it more or less), Java (don't know it that well), or sticking it out with PHP (know it quite well).
Edit: design clarification for commenter
Resolutions: The way the rollup script works currently, is I have some classes for defining resolutions and buckets. I have year, month, week, day -- given a "bucket number" each class gives a start and end timestamp that defines the time range for that bucket -- this is based on arbitrary epoch date. The system maintains "complete" records, ie it will complete its rolled up dataset for each resolution since the last time it was run, currently.
SQL Strat: The base stats are located in many dissimilar schemas and tables. I do individual queries for each rolled up stat for the most part, then fill one record for insert. Your are suggesting nested subqueries such as:
INSERT into rolled_up_stats (someval, someval, someval, ... ) VALUES (SELECT SUM(somestat) from someschema, SELECT AVG(somestat2) from someschema2)
Those subqueries will generate temporary tables, right? My experience is that had been slow as molasses in the past. Is it a better approach?
Edit 2: Adding some inline responses to the question
Language was a bottleneck in the case of 5.1 php -- I was essentially told I made the wrong language choice (though the scripts worked fine on 5.3). You mention python, which I am checking out for this task. To be clear, what I am doing is providing a management tool for usage statistics of a desktop product (the logs are actually written by an EJB server to mysql tables). I do apache log file analysis, as well as more custom web reporting on the web side, but this project is separate. The approach I've taken so far is aggregate tables. I'm not sure what these message queue products could do for me, I'll take a look.
To go a bit further -- the data is being used to chart activity over time at the service and the customer level, to allow management to understand how the product is being used. You might select a time period (April 1 to April 10) and retrieve a graph of total minutes of usage of a certain feature at different granularities (hours, days, months etc) depending on the time period selected. Its essentially an after-the-fact analysis of usage. The need seems to be tending towards real-time, however (look at the last hour of usage)
There are a lot of different approaches to this problem, some of which are mentioned here, but what you're doing with the data post-rollup is unclear...?
If you want to utilize this data to provide digg-like 'X diggs' buttons on your site, or summary graphs or something like that which needs to be available on some kind of ongoing basis, you can actually utilize memcache for this, and have your code keep the cache key for the particular statistic up to date by incrementing it at the appropriate times.
You could also keep aggregation tables in the database, which can work well for more complex reporting. In this case, depending on how much data you have and what your needs are, you might be able to get away with having an hourly table, and then just creating views based on that base table to represent days, weeks, etc.
If you have tons and tons of data, and you need aggregate tables, you should look into offloading statistics collection (and perhaps the database queries themselves) to a queue like RabbitMQ or ActiveMQ. On the other side of the queue put a consumer daemon that just sits and runs all the time, updating things in the database (and perhaps the cache) as needed.
One thing you might also consider is your web server's logs. I've seen instances where I was able to get a somewhat large portion of the required statistics from the web server logs themselves after just minor tweaks to the log format rules in the config. You can roll the logs every , and then start processing them offline, recording the results in a reporting database.
I've done all of these things with Python (I released loghetti for dealing with Apache combined format logs, specifically), though I don't think language is a limiting factor or bottleneck here. Ruby, Perl, Java, Scala, or even awk (in some instances) would work.
I have worked on a project to do a similar thing in the past, so I have actual experience with performance. You would be hard pressed to beat the performance of "INSERT ... SELECT" (not "INSERT...VALUES (SELECT ...)". Please see http://dev.mysql.com/doc/refman/5.1/en/insert-select.html
The advantage is that if you do that, especially if you keep the roll-up code in MySQL procedures, is that all you need from the outside is just a cron-job to poke the DB into performing the right roll-ups at the right times -- as simple as a shell-script with 'mysql <correct DB arguments etc.> "CALL RollupProcedure"'
This way, you are guaranteeing yourself zero memory allocation bugs, as well as having decent performance when the MySQL DB is on a separate machine (no moving of data across machine boundary...)
EDIT: Hourly resolution is fine -- just run an hourly cron-job...
If you are running mostly SQL commands, why not just use MySQL etc on the command line? You could create a simple table that lists aggregate data then run a command like mysql -u[user] -p[pass] < commands.sql to pass SQL in from a file.
Or, split the work into smaller chunks and run them sequentially (as PHP files if that's easiest).
If you really need it to be a continual long-running process then a programming language like python or java would be better, since you can create a loop and keep it running indefinitely. PHP is not suited for that kind of thing. It would be pretty easy to convert any PHP classes to Java.

Categories

Resources