Why would one want to use an out of the box caching product like ehcache or memcached ?
Wont a simple hashmap do ? I understand this is a naive question but I would like to see some answers about when a simple hashmap will suffice and a thirdparty caching solution is overkill.
Some things Ehcache can give you, that you would have to manage yourself with a HashMap.
An eviction policy. If your data never grows, then no need to worry. But if you want to prevent a memory leak eventually breaking your app, then you need an eviction policy. With ehcache, you can configure the time to live, and time to idle of elements in your cache.
Clustered caching with Terracotta. If you have more than one tomcat for failover / scalability, then you can link Ehcache up to a Terracotta cluster, so that all instances can see the same data if needed.
Transparent disk overflow - be this on the tomcat server, or the terracotta cluster. When data doesn't fit into heap.
Off heap storage. New technologies such as BigMemory mean you have access to a much larger in-memory cache without GC overheads.
Concurrency. Ehcache can use a ConcurrentDistributedMap to give the optimal performance in a clustered configuration.
This is just the tip of the iceberg.
as Tom mentioned, requirements say everything. If all you need is a place to put in your data using key-value pairs, a hashmap will do.
But if you need overflow capabilities (writing to disk when the map is "full"), entry expiration (remove when an entry has not been "touched" in a while), clustered caches, redundant caches, you fall back on the don't reinvent the wheel paradigm, and use the third-party caching solution.
I've been using ehcache for almost 3 years now. I use just a slice of the total feature set, but the ones I do, work great.
Related
I want to use Guava caching mechanism to cache request-response pair of webservice calls to improve performance of website. But, before going ahead with solution want to know how does Guava caching stand in terms of performance?
Thanks,
Ashish.
Any in-memory cache is always significantly faster (magnitudes) than a round-trip to a database, file, another service, ... (talking to other computers or the file system is really, REALLY expensive compared to just a fetch from memory) Google Guava's cache is basically a Map that automatically triggers some fetching code if the key you're searching for isn't present (along with some automated eviction if you so choose). The Guava wiki page on cache explains it all. If for some reason this cache becomes a bottleneck (based on profiling, not "let me wet my finger and feel which way the wind is blowing"), it's much more likely the hardware you're running on isn't sufficient for the number of requests you're trying to handle, because a Map data structure is pretty much as low level as it gets in Java.
I have a SQL Table with disk size ~50 GB. The table is read-only and thus, ideal for caching. For faster and frequent look ups, what would be ideal -
Java 8 Hash Map.
Memcached.
Hibernate EH Cache.
or anything better ?
(provided 200 GB of main memory is available for JVM).
You can start trying Guava cache (Google Core Libraries for Java 1.6+)
Generally, the Guava caching utilities are applicable whenever:
You are willing to spend some memory to improve speed.
You expect that keys will sometimes get queried more than once.
Your cache will not need to store more data than what would fit in RAM. (Guava caches are local to a single run of your application.
They do not store data in files, or on outside servers.
If this does not fit your needs, consider a tool like Memcached.)
Disclaimer: I work for Terracotta on Ehcache
Another option would be to use the upcoming Ehcache 3 with its offheap tier. This would allow you to cache the whole table in RAM but outside of the control of the GC, thus not being a source of pause times.
I'm part of a team architecting a Java web application wherein users will search for results in a relational database and then view them in tabular fashion in a browser. Users will then also have the option to subsequently view the same result set (or a subset of those results) in a separate browser window, using for example a charting tool. In other words, we need to give the user the ability to visualize the same result set records later (up to a limit of 24 hours).
Since searches on the system will be resource-intensive and just out of good common sense, we would like a clean way to cache each result set so that it can be pulled later from memory (RAM or disk). We are looking for a good approach to doing this caching, we believe others have done this before, and we prefer to use a best-practice or framework rather than building such a thing from scratch. The server will have plenty of RAM but since there could be hundreds of people using the system, we may need an approach that stores to RAM first but then can also cache to hard disk if RAM is getting full.
I believe it makes most sense to persist as Java objects but I'm open to better advice. We would like a vendor-neutral approach, so that if the database team chooses to switch vendors later we aren't stuck with a proprietary solution. Thanks.
I think what you might be looking for is Terracotta Ehcache. This does everything you mentioned and more. It is a free product that can be used to cache things in memory, overflow to disk, specify max cache sizes by either MB or # of items, and expire based on last access time or entry time.
I've seen http://www.jboss.org/infinispan/ used to do exactly that. It can cache to memory, disk and or database. I wouldn't say I love it (the configuration is not super easy and documentation is somewhat lacking) but it most certainly works and is actively maintained.
Being vendor neutral is all about writing an abstraction layer that is native to your application, then plugging in the cache service you would like to use behind this layer, while keeping your layer that exposes these operations to your main code the same.
There are plenty of ways to cache. Look into using various NoSql solutions.
Redis
Memcached
Most of the time you will serialize your object and persist it to your cache layer.
I am evaluating various Java object distribution libraries (Terracotta, JCS, JBoss, Hazelcast ...) for an application server and I'm having trouble understanding their behavior on various axes.
My requirements for distributed objects are not many -- they boil down to one-to-one and one-to-many messaging. There's more, but for the rest we just use JDBC and I assume I can plop a cache in front of this using any of the available libraries.
I would like a system that distributes objects and exhibits locality properties -- in other words, a server that grabs an object tends to hold onto it without excess communication to other nodes. Hazelcast looks simple (and peer-to-peer is nice) but seems to require objects are distributed evenly across all nodes.
I'd like a way to persist objects, preferably transparently. I plan on using EC2, so I have the option of temporary, free, limited local storage (the disk) and permanent, non-free, unlimited storage (S3). It'd be great not to worry about OutOfMemoryErrors.
I like the simplicity and "magic" of Terracotta but it scares the beejeezus out of me. Also in order to truly scale you have to spend $$$$, otherwise you're communicating with a single hub.
I'm cheap and I want something not only free but mature and with a large userbase.
Thanks for any input.
Terracotta seems like a perfect fit for your situation.
It's simple to setup
it can be configured to be persistent (use an EBS volume for EC2)
it's closely integrated with Ehcache (actually Terracotta bought Ehcache) for great distributed caching performance
the free offering scales pretty well with several clients.
Just start playing around with it. I bet you'll love it. To ease your performance fears, simply run a through put test for message passing. This shouldn't take much more than an afternoon of your time.
I have to admit that I haven't used Terracotta for a year and that I don't know the others you suggested.
Terracotta does fit the bill. I understand your objections, but here's my comments:
1) Terracotta does exhibit locality - and is probably the best system at it compared to those you mentioned. Objects are only brought in to a local JVM where requested. Locking for reads or writes is performed using a leasing mechanism. This means if you exhibit perfect locality in your system then you will incur very little network overhead.
2) Terracotta provides disk persistence out of the box - in the OSS version (you don't have to pay $$$$)
3) Why does it scare you so much? Just use EHCache as a cache, or the Hibernate 2nd Level Plugin. It's incredibly easy to setup and use.
4) Yes, Terracotta FX requires you to pay (for scale-out servers). However I would suggest that if you have a system that is mostly read and exhibits true locality then I don't think you'll have a problem getting the scale you are looking for. With Terracotta 3.2 the performance of the Hibernate 2nd Level Cache is 100,000 ops/s using 8 application servers and one Terracotta server at 100/0 read/write ratio and 12,000 ops/s using the same config at 95/5 read/write ratio.
(I just did a talk for the Bay Area SDForum on these numbers so I happen to have them handy)
Yes Hazelcast will distribute your objects across the cluster. However you can enable near cache if you want to reduce the communication cost.
http://www.hazelcast.com/documentation.jsp#MapNearCache
Btw, it's not clear what you are looking for (messaging is not the same as clustering/distributed objects).
If you are looking for messaging in Java I recommend you have a look at RabbitMQ (it's Erlang based but that doesn't matter).
I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?
I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.
At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!
Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.
I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.
With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)
Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.
You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.
Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.
Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com