I have a java file server that serves file over http. Each file is uniquely addressable by an ID like so:
http://fileserver/id/123455555
I am looking to add a caching layer to this so that the most frequently accessed files stay in memory. I would also like to control the total size of the cache. I am thinking to use ehcache or oscache for this, but I have only used them to cache serialized object before. Would they be a good choice and are there any additional considerations for building a file cache?
Edit
Thanks for all the answers. Some more details to about the file server to simplify (or complicate) the problem:
Once a file is saved, it is never modified.
MD5 hash to avoid duplicating files on save. (I am
aware of possible collision and security concerns)
File server running on linux boxes.
Edit 2
Though the server it self does not put any limitation on the file type it supports, Files are mostly images (jpg,gif, pgn), Word, excel, PDF no bigger than 10Mb.
guava cache? http://code.google.com/p/guava-libraries/wiki/CachesExplained
nice API
time based eviction
size based eviction
Take advantage of the HTTP protocol
Your most effective caching mechanism by far will be to move caching off your own server and as close to the client as possible (data locality ;)). Use the HTTP protocol effectively to allow clients and caching proxies to do the caching whenever they can appropriately do so:
Set ETag's using some function of each file's content (e.g. MD5Sum) - cache this info too, so you don't re-calculate on each serve!
Set Expires / Last-Modified / Cache-Control headers as appropriate
edit: You updated to say that the files are never modified, so I would suggest setting the Expires header to a far-future date.
... Now to answer the question more directly ...
EhCache
My experience with EhCache is its a fine choice, and can satisfy the requirements you've mentioned.
You mentioned "the most frequently accessed files stay in memory" so it seems relevant to mention that, according to some performance testing I did (several years ago now) the LFU (Least Frequently Used) eviction policy is a lot slower than LRU (Least Recently Used) on cache writes - something like 30 times slower in fact. This is a product of the additional complexity of LFU vs LRU.
It would be a good idea to check the data usage pattern you really see in production to understand which eviction policy works best for you. In most circumstances I would suggest LRU as a starting point, as it approximates to LFU under conditions where the cache is large enough and there are no significant bursts of unusual data access.
OSCache
I have not used OSCache, so cannot say anything there.
Other considerations
In his answer Peter Lawrey suggested using the OS cache. Whilst this means that you pay a penalty for the read through from java to native I think the idea has great merit since it avoids a significant problem of caching in the Java heap: that the garbage collector has extra work to do trawling the large heap. (An alternative solution to that is to use off-heap caching, for example via BigMemory, but that has its own tradeoffs)
If the content is compressible you probably want to consider caching a compressed (gzip'd) version of the file (otherwise you will end up re-compressing it every time it is served!). This is one argument that goes against using the OS disk cache. Of course there are other caveats that go with compression (e.g. content is large enough to warrant compressing and compresses reasonably well) so it really does depend on what is in those files.
Ehcache provide ability to do web caching as well . You may want to try that http://www.ehcache.org/documentation/user-guide/web-caching
IMHO, you are better of making use of the OS disk cache as this has several advantages.
Its much simpler as the OS does all the real work.
The os can use all the available free memory which can vary depending on what else the system does.
You don't double up with the disk cache (as it is the disk cache).
The OS will keeps all the least recently used files in memory anyway.
Related
I have millions of rows to be read from database and multiple users come in a day to read the same data. so I want to create a cache. so that I don't have to go to database again for same data.
I have seen many option but couldn't find figure out which approach to use.
Creating my own cache I am thinking saving the data of a query result and writing in a file or
use some third party in memory caches?
Guava CacheBuilder ,LRUMap caching,whirlycache ,cache4j.
You are not the first person to have requirements like this, which is why there are dozens of cache implementations available as open source projects, and even a standard set of Java APIs for caching (JCache). If your needs go beyond those solutions, there are even commercial solutions that handle tens of terabytes of data transparently across RAM, flash, database, etc. If none of those are sufficient, then you should definitely write your own.
Its totally dependent on multiple factors. and i think answer will be based on environment, Size of data etc. here is the main points
You want to keep the cache in ram as much as possible because its faster to access than being in file system.
You can also use OS memory mapped files which does balance access vs utilization. I suggest any proven solution than creating your own
If you are running low on memory then you might need to ask question on what is more important like caching the top access data as they are most likely to be asked by client.
So there is not a sure or definite answer but you have to decide based on your constraints. Hope this helps
I think you are overengineering the problem, it isn't trivial to write a performant, transparent cache, unless you only need a simple HashMap to hold some values. You should focus on writing code to solve your domain problem and not writing too much framework code.
Stop reinventing the wheel, use either an in-memory cache (e.g. infinispan or redis) or a database (e.g. postgres). You will have less pain and better performance.
I'm part of a team architecting a Java web application wherein users will search for results in a relational database and then view them in tabular fashion in a browser. Users will then also have the option to subsequently view the same result set (or a subset of those results) in a separate browser window, using for example a charting tool. In other words, we need to give the user the ability to visualize the same result set records later (up to a limit of 24 hours).
Since searches on the system will be resource-intensive and just out of good common sense, we would like a clean way to cache each result set so that it can be pulled later from memory (RAM or disk). We are looking for a good approach to doing this caching, we believe others have done this before, and we prefer to use a best-practice or framework rather than building such a thing from scratch. The server will have plenty of RAM but since there could be hundreds of people using the system, we may need an approach that stores to RAM first but then can also cache to hard disk if RAM is getting full.
I believe it makes most sense to persist as Java objects but I'm open to better advice. We would like a vendor-neutral approach, so that if the database team chooses to switch vendors later we aren't stuck with a proprietary solution. Thanks.
I think what you might be looking for is Terracotta Ehcache. This does everything you mentioned and more. It is a free product that can be used to cache things in memory, overflow to disk, specify max cache sizes by either MB or # of items, and expire based on last access time or entry time.
I've seen http://www.jboss.org/infinispan/ used to do exactly that. It can cache to memory, disk and or database. I wouldn't say I love it (the configuration is not super easy and documentation is somewhat lacking) but it most certainly works and is actively maintained.
Being vendor neutral is all about writing an abstraction layer that is native to your application, then plugging in the cache service you would like to use behind this layer, while keeping your layer that exposes these operations to your main code the same.
There are plenty of ways to cache. Look into using various NoSql solutions.
Redis
Memcached
Most of the time you will serialize your object and persist it to your cache layer.
This looks like a long question because of all the context. There are 2 questions inside the novel below. Thank you for taking the time to read this and provide assistance.
Situation
I am working on a scalable datastore implementation that can support working with data files from a few KB to a TB or more in size on a 32-bit or 64-bit system.
The datastore utilizes a Copy-on-Write design; always appending new or modified data to the end of the data file and never doing in-place edits to existing data.
The system can host 1 or more database; each represented by a file on-disk.
The details of the implementation are not important; the only important detail being that I need to constantly append to the file and grow it from KB, to MB, to GB to TB while at the same time randomly skipping around the file for read operations to answer client requests.
First-Thoughts
At first glance I knew I wanted to use memory-mapped files so I could push the burden of efficiently managing the in-memory state of the data onto the host OS and out of my code.
Then all my code needs to worry about is serializing the append-to-file operations on-write, and allowing any number of simultaneous readers to seek in the file to answer requests.
Design
Because the individual data-files can grow beyond the 2GB limit of a MappedByteBuffer, I expect that my design will have to include an abstraction layer that takes a write offset and converts it into an offset inside of a specific 2GB segment.
So far so good...
Problems
This is where I started to get hung up and think that going with a different design (proposed below) might be the better way to do this.
From reading through 20 or so "memory mapped" related questions here on SO, it seems mmap calls are sensitive to wanting contiguous runs of memory when allocated. So, for example, on a 32-bit host OS if I tried to mmap a 2GB file, due to memory fragmentation, my chances are slim that mapping will succeed and instead I should use something like a series of 128MB mappings to pull an entire file in.
When I think of that design, even say using 1024MB mmap sizes, for a DBMS hosting up a few huge databases all represented by say 1TB files, I now have thousands of memory-mapped regions in memory and in my own testing on Windows 7 trying to create a few hundred mmaps across a multi-GB file, I didn't just run into exceptions, I actually got the JVM to segfault every time I tried to allocate too much and in one case got the video in my Windows 7 machine to cut out and re-initialize with a OS-error-popup I've never seen before.
Regardless of the argument of "you'll never likely handle files that large" or "this is a contrived example", the fact that I could code something up like that with those type of side effects put my internal alarm on high-alert and made consider an alternative impl (below).
BESIDES that issue, my understanding of memory-mapped files is that I have to re-create the mapping every time the file is grown, so in the case of this file that is append-only in design, it literally constantly growing.
I can combat this to some extent by growing the file in chunks (say 8MB at a time) and only re-create the mapping every 8MB, but the need to constantly be re-creating these mappings has me nervous especially with no explicit unmap feature supported in Java.
Question #1 of 2
Given all of my findings up to this point, I would dismiss memory-mapped files as a good solution for primarily read-heavy solutions or read-only solutions, but not write-heavy solutions given the need to re-create the mapping constantly.
But then I look around at the landscape around me with solutions like MongoDB embracing memory-mapped files all over the place and I feel like I a missing some core component here (I do know it allocs in something like 2GB extents at a time, so I imagine they are working around the re-map cost with this logic AND helping to maintain sequential runs on-disk).
At this point I don't know if the problem is Java's lack of an unmap operation that makes this so much more dangerous and unsuitable for my uses or if my understanding is incorrect and someone can point me North.
Alternative Design
An alternative design to the memory-mapped one proposed above that I will go with if my understanding of mmap is correct is as follows:
Define a direct ByteBuffer of a reasonable configurable size (2, 4, 8, 16, 32, 64, 128KB roughly) making it easily compatible with any host platform (don't need to worry about the DBMS itself causing thrashing scenarios) and using the original FileChannel, perform specific-offset reads of the file 1 buffer-capacity-chunk at a time, completely forgoing memory-mapped files at all.
The downside being that now my code has to worry about things like "did I read enough from the file to load the complete record?"
Another down-side is that I don't get to make use of the OS's virtual memory logic, letting it keep more "hot" data in-memory for me automatically; instead I just have to hope the file cache logic employed by the OS is big enough to do something helpful for me here.
Question #2 of 2
I was hoping to get a confirmation of my understanding of all of this.
For example, maybe the file cache is fantastic, that in both cases (memory mapped or direct reads), the host OS will keep as much of my hot data available as possible and the performance difference for large files is negligible.
Or maybe my understanding of the sensitive requirements for memory-mapped files (contiguous memory) are incorrect and I can ignore all that.
You might be interested in https://github.com/peter-lawrey/Java-Chronicle
In this I create multiple memory mappings to the same file (the size is a power of 2 up to 1 GB) The file can be any size (up to the size of your hard drive)
It also creates an index so you can find any record at random and each record can be any size.
It can be shared between processes and used for low latency events between processes.
I make the assumption you are using a 64-bit OS if you want to use large amounts of data. In this case a List of MappedByteBuffer will be all you ever need. It makes sense to use the right tools for the job. ;)
I have found it performance well even with data sizes around 10x your main memory size (I was using a fast SSD drive so YMMV)
I think you shouldn't worry about mmap'ping files up to 2GB in size.
Looking at the sources of MongoDB as an example of DB making use of memory mapped files you'll find it always maps full data file in MemoryMappedFile::mapWithOptions() (which calls MemoryMappedFile::map()). DB data spans across multiple files each up to 2GB in size. Also it preallocates data files so there's no need to remap as the data grows and this prevents file fragmentation. Generally you can inspire yourself with the source code of this DB.
I am at a point where I need to take the decision on what to do when caching of objects reaches the configured threshold.
Should I store the objects in a indexed file (like provided by JCS) and read them from the file (file IO) when required or have the object stored in a distributed cache (network, serialization, deserialization)
We are using Solaris as OS.
============================
Adding some more information.
I have this question so as to determine if I can switch to distributed caching. The remote server which will have cache will have more memory and better disk and this remote server will only be used for caching.
One of the problems we cannot increase the locally cached objects is , it stores the cached objects in JVM heap which has limited memory(using 32bit JVM).
========================================================================
Thanks, we finally ended up choosing Coherence as our Cache product. This provides many cache configuration topologies, in process vs remote vs disk ..etc.
It's going to depend on many things such as disk speed, network latency and the amount of data, so some experimentation might be the best way to get an idea. I recommend you have a look at http://ehcache.org/, it might come in handy.
The only way to really know is to test it, but with good network latency from your cache, it could well be faster than local disk access.
Once you are dealing with a large enough rate of cache requests, serialised random access to the local disk is likely to become a problem.
Do you expect that the distributed nodes will keep your data in memory? I wouldn't.
If you can't be sure that the distributed nodes will keep your data in memory, then holding data on the network will take the time to read data from the disk, plus send the data over the network. Holding data locally will only take the time to read data from the disk.
Local is faster.
You're almost certainly guaranteed to be faster cacheing the data in a file as opposed to across the network.
The options are not mutually exclusive, there are products out there that combine both. Oracle Coherence for example can provide sophisticated distributed cache services with an option to overflow to disk when thresholds are exceeded.
Check out memcached, a distributed in-memory cache. You'll need to run performance comparisons for your own particular usages, but a distributed memory cache can often outperform a local disk cache.
I don't get the question. Do you need a distributed cache, or not? Just answer this question to find out what you need.
I support a legacy Java application that uses flat files (plain text) for persistence. Due to the nature of the application, the size of these files can reach 100s MB per day, and often the limiting factor in application performance is file IO. Currently, the application uses a plain ol' java.io.FileOutputStream to write data to disk.
Recently, we've had several developers assert that using memory-mapped files, implemented in native code (C/C++) and accessed via JNI, would provide greater performance. However, FileOutputStream already uses native methods for its core methods (i.e. write(byte[])), so it appears a tenuous assumption without hard data or at least anecdotal evidence.
I have several questions on this:
Is this assertion really true?
Will memory mapped files always
provide faster IO compared to Java's
FileOutputStream?
Does the class MappedByteBuffer
accessed from a FileChannel provide
the same functionality as a native
memory mapped file library accessed
via JNI? What is MappedByteBuffer
lacking that might lead you to use a
JNI solution?
What are the risks of using
memory-mapped files for disk IO in a production
application? That is, applications
that have continuous uptime with
minimal reboots (once a month, max).
Real-life anecdotes from production
applications (Java or otherwise)
preferred.
Question #3 is important - I could answer this question myself partially by writing a "toy" application that perf tests IO using the various options described above, but by posting to SO I'm hoping for real-world anecdotes / data to chew on.
[EDIT] Clarification - each day of operation, the application creates multiple files that range in size from 100MB to 1 gig. In total, the application might be writing out multiple gigs of data per day.
Memory mapped I/O will not make your disks run faster(!). For linear access it seems a bit pointless.
A NIO mapped buffer is the real thing (usual caveat about any reasonable implementation).
As with other NIO direct allocated buffers, the buffers are not normal memory and wont get GCed as efficiently. If you create many of them you may find that you run out of memory/address space without running out of Java heap. This is obviously a worry with long running processes.
You might be able to speed things up a bit by examining how your data is being buffered during writes. This tends to be application specific as you would need an idea of the expected data writing patterns. If data consistency is important, there will be tradeoffs here.
If you are just writing out new data to disk from your application, memory mapped I/O probably won't help much. I don't see any reason you would want to invest time in some custom coded native solution. It just seems like too much complexity for your application, from what you have provided so far.
If you are sure you really need better I/O performance - or just O performance in your case, I would look into a hardware solution such as a tuned disk array. Throwing more hardware at the problem is often times more cost effective from a business point of view than spending time optimizing software. It is also usually quicker to implement and more reliable.
In general, there are a lot of pitfalls in over optimization of software. You will introduce new types of problems to your application. You might run into memory issues/ GC thrashing which would lead to more maintenance/tuning. The worst part is that many of these issues will be hard to test before going into production.
If it were my app, I would probably stick with the FileOutputStream with some possibly tuned buffering. After that I'd use the time honored solution of throwing more hardware at it.
From my experience, memory mapped files perform MUCH better than plain file access in both real time and persistence use cases. I've worked primarily with C++ on Windows, but Linux performances are similar, and you're planning to use JNI anyway, so I think it applies to your problem.
For an example of a persistence engine built on memory mapped file, see Metakit. I've used it in an application where objects were simple views over memory-mapped data, the engine took care of all the mapping stuff behind the curtains. This was both fast and memory efficient (at least compared with traditional approaches like those the previous version used), and we got commit/rollback transactions for free.
In another project I had to write multicast network applications. The data was send in randomized order to minimize the impact of consecutive packet loss (combined with FEC and blocking schemes). Moreover the data could well exceed the address space (video files were larger than 2Gb) so memory allocation was out of question. On the server side, file sections were memory-mapped on demand and the network layer directly picked the data from these views; as a consequence the memory usage was very low. On the receiver side, there was no way to predict the order into which packets were received, so it has to maintain a limited number of active views on the target file, and data was copied directly into these views. When a packet had to be put in an unmapped area, the oldest view was unmapped (and eventually flushed into the file by the system) and replaced by a new view on the destination area. Performances were outstanding, notably because the system did a great job on committing data as a background task, and real-time constraints were easily met.
Since then I'm convinced that even the best fine-crafted software scheme cannot beat the system's default I/O policy with memory-mapped file, because the system knows more than user-space applications about when and how data must be written. Also, what is important to know is that memory mapping is a must when dealing with large data, because the data is never allocated (hence consuming memory) but dynamically mapped into the address space, and managed by the system's virtual memory manager, which is always faster than the heap. So the system always use the memory optimally, and commits data whenever it needs to, behind the application's back without impacting it.
Hope it helps.
As for point 3 - if the machine crashes and there are any pages that were not flushed to disk, then they are lost. Another thing is the waste of the address space - mapping a file to memory consumes address space (and requires contiguous area), and well, on 32-bit machines it's a bit limited. But you've said about 100MB - so it should not be a problem. And one more thing - expanding the size of the mmaped file requires some work.
By the way, this SO discussion can also give you some insights.
If you write fewer bytes it will be faster. What if you filtered it through gzipoutputstream, or what if you wrote your data into ZipFiles or JarFiles?
As mentioned above, use NIO (a.k.a. new IO). There's also a new, new IO coming out.
The proper use of a RAID hard drive solution would help you, but that would be a pain.
I really like the idea of compressing the data. Go for the gzipoutputstream dude! That would double your throughput if the CPU can keep up. It is likely that you can take advantage of the now-standard double-core machines, eh?
-Stosh
I did a study where I compare the write performance to a raw ByteBuffer versus the write performance to a MappedByteBuffer. Memory-mapped files are supported by the OS and their write latencies are very good as you can see in my benchmark numbers. Performing synchronous writes through a FileChannel is approximately 20 times slower and that's why people do asynchronous logging all the time. In my study I also give an example of how to implement asynchronous logging through a lock-free and garbage-free queue for ultimate performance very close to a raw ByteBuffer.