Caching / Mapping strategy for standalone java application

Caching / Mapping strategy for standalone java application - java

I have a SQL Table with disk size ~50 GB. The table is read-only and thus, ideal for caching. For faster and frequent look ups, what would be ideal -
Java 8 Hash Map.
Memcached.
Hibernate EH Cache.
or anything better ?
(provided 200 GB of main memory is available for JVM).

You can start trying Guava cache (Google Core Libraries for Java 1.6+)
Generally, the Guava caching utilities are applicable whenever:
You are willing to spend some memory to improve speed.
You expect that keys will sometimes get queried more than once.
Your cache will not need to store more data than what would fit in RAM. (Guava caches are local to a single run of your application.
They do not store data in files, or on outside servers.
If this does not fit your needs, consider a tool like Memcached.)

Disclaimer: I work for Terracotta on Ehcache
Another option would be to use the upcoming Ehcache 3 with its offheap tier. This would allow you to cache the whole table in RAM but outside of the control of the GC, thus not being a source of pause times.

Related

JAVA Lightest Thread Framework

I have a project where I have to send emails using Amazon SES REST API, now amazon allows concurrent connections at a same time based on account. So in my case amazon allows me to open 50 connections at a same time, which means I can send 50 emails/sec. To achieve this, currently I am using JAVA Executioner threads where I control the thread speed to be 50/sec. Also I have implemented this in Hibernate framework because I need to execute some SQL queries before sending emails.
This java program runs continuously in background(its a jar file). This takes around 512MB RAM, so my question is that can I use some other frameworks or better thread system to make it more lighter? The SQL query I execute is only a select query, update/delete/create queries are not used.
I am not good in JAVA so may be this sounds stupid.

I guess the smallest possible framework to use would be plain JDBC.
This would limit your libraries to those in the jre plus the DB driver and maybe libs for AWS / Email. Depending on what else you need, selecting a compact profile might be worth investigating.
Also check your memory settings:
If you set -Xms512m it's really not surprising your app uses 512m, is it?
Edit due to rephrased question
In your level of parallelism, most of your Memory is consumed by Objects, not by Threads (well, Threads are objects, but small ones). Threads are good the way they are in Java. You can run hundrets of them without them consuming 500 mb of heap or more as you claim.
So the issue with 50 threads consuming 512m of your memory is more likely rooted in your code and your objects, not (only) in your threads.
In order to reduce memory footprint, tra the follwing:
Remove hibernate. As you say you only have a simple select SQL, so you don't need the overhead and additional libraries.
Take a memory dump of your running app and analyse it. (MAT - Eclipse Memory Analyser tool comes to mind)
Check other objects and how you use them. When you say "sending emails" - how large are your emails? Might there be duplicate buffers do to bad choice of coding? Share your code for how you do it, then we can have a look.
Try running without any memory options and see how the program runs on defaults.
Add garbage collector output and check that

Need good design pattern for caching database query result set

I'm part of a team architecting a Java web application wherein users will search for results in a relational database and then view them in tabular fashion in a browser. Users will then also have the option to subsequently view the same result set (or a subset of those results) in a separate browser window, using for example a charting tool. In other words, we need to give the user the ability to visualize the same result set records later (up to a limit of 24 hours).
Since searches on the system will be resource-intensive and just out of good common sense, we would like a clean way to cache each result set so that it can be pulled later from memory (RAM or disk). We are looking for a good approach to doing this caching, we believe others have done this before, and we prefer to use a best-practice or framework rather than building such a thing from scratch. The server will have plenty of RAM but since there could be hundreds of people using the system, we may need an approach that stores to RAM first but then can also cache to hard disk if RAM is getting full.
I believe it makes most sense to persist as Java objects but I'm open to better advice. We would like a vendor-neutral approach, so that if the database team chooses to switch vendors later we aren't stuck with a proprietary solution. Thanks.

I think what you might be looking for is Terracotta Ehcache. This does everything you mentioned and more. It is a free product that can be used to cache things in memory, overflow to disk, specify max cache sizes by either MB or # of items, and expire based on last access time or entry time.

I've seen http://www.jboss.org/infinispan/ used to do exactly that. It can cache to memory, disk and or database. I wouldn't say I love it (the configuration is not super easy and documentation is somewhat lacking) but it most certainly works and is actively maintained.

Being vendor neutral is all about writing an abstraction layer that is native to your application, then plugging in the cache service you would like to use behind this layer, while keeping your layer that exposes these operations to your main code the same.
There are plenty of ways to cache. Look into using various NoSql solutions.
Redis
Memcached
Most of the time you will serialize your object and persist it to your cache layer.

Caching for file server

I have a java file server that serves file over http. Each file is uniquely addressable by an ID like so:
http://fileserver/id/123455555
I am looking to add a caching layer to this so that the most frequently accessed files stay in memory. I would also like to control the total size of the cache. I am thinking to use ehcache or oscache for this, but I have only used them to cache serialized object before. Would they be a good choice and are there any additional considerations for building a file cache?
Edit
Thanks for all the answers. Some more details to about the file server to simplify (or complicate) the problem:
Once a file is saved, it is never modified.
MD5 hash to avoid duplicating files on save. (I am
aware of possible collision and security concerns)
File server running on linux boxes.
Edit 2
Though the server it self does not put any limitation on the file type it supports, Files are mostly images (jpg,gif, pgn), Word, excel, PDF no bigger than 10Mb.

guava cache? http://code.google.com/p/guava-libraries/wiki/CachesExplained
nice API
time based eviction
size based eviction

Take advantage of the HTTP protocol
Your most effective caching mechanism by far will be to move caching off your own server and as close to the client as possible (data locality ;)). Use the HTTP protocol effectively to allow clients and caching proxies to do the caching whenever they can appropriately do so:
Set ETag's using some function of each file's content (e.g. MD5Sum) - cache this info too, so you don't re-calculate on each serve!
Set Expires / Last-Modified / Cache-Control headers as appropriate
edit: You updated to say that the files are never modified, so I would suggest setting the Expires header to a far-future date.
... Now to answer the question more directly ...
EhCache
My experience with EhCache is its a fine choice, and can satisfy the requirements you've mentioned.
You mentioned "the most frequently accessed files stay in memory" so it seems relevant to mention that, according to some performance testing I did (several years ago now) the LFU (Least Frequently Used) eviction policy is a lot slower than LRU (Least Recently Used) on cache writes - something like 30 times slower in fact. This is a product of the additional complexity of LFU vs LRU.
It would be a good idea to check the data usage pattern you really see in production to understand which eviction policy works best for you. In most circumstances I would suggest LRU as a starting point, as it approximates to LFU under conditions where the cache is large enough and there are no significant bursts of unusual data access.
OSCache
I have not used OSCache, so cannot say anything there.
Other considerations
In his answer Peter Lawrey suggested using the OS cache. Whilst this means that you pay a penalty for the read through from java to native I think the idea has great merit since it avoids a significant problem of caching in the Java heap: that the garbage collector has extra work to do trawling the large heap. (An alternative solution to that is to use off-heap caching, for example via BigMemory, but that has its own tradeoffs)
If the content is compressible you probably want to consider caching a compressed (gzip'd) version of the file (otherwise you will end up re-compressing it every time it is served!). This is one argument that goes against using the OS disk cache. Of course there are other caveats that go with compression (e.g. content is large enough to warrant compressing and compresses reasonably well) so it really does depend on what is in those files.

Ehcache provide ability to do web caching as well . You may want to try that http://www.ehcache.org/documentation/user-guide/web-caching

IMHO, you are better of making use of the OS disk cache as this has several advantages.
Its much simpler as the OS does all the real work.
The os can use all the available free memory which can vary depending on what else the system does.
You don't double up with the disk cache (as it is the disk cache).
The OS will keeps all the least recently used files in memory anyway.

third-party Caching software- what do they provide?

Why would one want to use an out of the box caching product like ehcache or memcached ?
Wont a simple hashmap do ? I understand this is a naive question but I would like to see some answers about when a simple hashmap will suffice and a thirdparty caching solution is overkill.

Some things Ehcache can give you, that you would have to manage yourself with a HashMap.
An eviction policy. If your data never grows, then no need to worry. But if you want to prevent a memory leak eventually breaking your app, then you need an eviction policy. With ehcache, you can configure the time to live, and time to idle of elements in your cache.
Clustered caching with Terracotta. If you have more than one tomcat for failover / scalability, then you can link Ehcache up to a Terracotta cluster, so that all instances can see the same data if needed.
Transparent disk overflow - be this on the tomcat server, or the terracotta cluster. When data doesn't fit into heap.
Off heap storage. New technologies such as BigMemory mean you have access to a much larger in-memory cache without GC overheads.
Concurrency. Ehcache can use a ConcurrentDistributedMap to give the optimal performance in a clustered configuration.
This is just the tip of the iceberg.

as Tom mentioned, requirements say everything. If all you need is a place to put in your data using key-value pairs, a hashmap will do.
But if you need overflow capabilities (writing to disk when the map is "full"), entry expiration (remove when an entry has not been "touched" in a while), clustered caches, redundant caches, you fall back on the don't reinvent the wheel paradigm, and use the third-party caching solution.
I've been using ehcache for almost 3 years now. I use just a slice of the total feature set, but the ones I do, work great.

Java object caching, which is faster, reading from a file or from a remote machine?

I am at a point where I need to take the decision on what to do when caching of objects reaches the configured threshold.
Should I store the objects in a indexed file (like provided by JCS) and read them from the file (file IO) when required or have the object stored in a distributed cache (network, serialization, deserialization)
We are using Solaris as OS.
============================
Adding some more information.
I have this question so as to determine if I can switch to distributed caching. The remote server which will have cache will have more memory and better disk and this remote server will only be used for caching.
One of the problems we cannot increase the locally cached objects is , it stores the cached objects in JVM heap which has limited memory(using 32bit JVM).
========================================================================
Thanks, we finally ended up choosing Coherence as our Cache product. This provides many cache configuration topologies, in process vs remote vs disk ..etc.

It's going to depend on many things such as disk speed, network latency and the amount of data, so some experimentation might be the best way to get an idea. I recommend you have a look at http://ehcache.org/, it might come in handy.

The only way to really know is to test it, but with good network latency from your cache, it could well be faster than local disk access.
Once you are dealing with a large enough rate of cache requests, serialised random access to the local disk is likely to become a problem.

Do you expect that the distributed nodes will keep your data in memory? I wouldn't.
If you can't be sure that the distributed nodes will keep your data in memory, then holding data on the network will take the time to read data from the disk, plus send the data over the network. Holding data locally will only take the time to read data from the disk.
Local is faster.

You're almost certainly guaranteed to be faster cacheing the data in a file as opposed to across the network.

The options are not mutually exclusive, there are products out there that combine both. Oracle Coherence for example can provide sophisticated distributed cache services with an option to overflow to disk when thresholds are exceeded.

Check out memcached, a distributed in-memory cache. You'll need to run performance comparisons for your own particular usages, but a distributed memory cache can often outperform a local disk cache.

I don't get the question. Do you need a distributed cache, or not? Just answer this question to find out what you need.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.