This looks like a long question because of all the context. There are 2 questions inside the novel below. Thank you for taking the time to read this and provide assistance.
Situation
I am working on a scalable datastore implementation that can support working with data files from a few KB to a TB or more in size on a 32-bit or 64-bit system.
The datastore utilizes a Copy-on-Write design; always appending new or modified data to the end of the data file and never doing in-place edits to existing data.
The system can host 1 or more database; each represented by a file on-disk.
The details of the implementation are not important; the only important detail being that I need to constantly append to the file and grow it from KB, to MB, to GB to TB while at the same time randomly skipping around the file for read operations to answer client requests.
First-Thoughts
At first glance I knew I wanted to use memory-mapped files so I could push the burden of efficiently managing the in-memory state of the data onto the host OS and out of my code.
Then all my code needs to worry about is serializing the append-to-file operations on-write, and allowing any number of simultaneous readers to seek in the file to answer requests.
Design
Because the individual data-files can grow beyond the 2GB limit of a MappedByteBuffer, I expect that my design will have to include an abstraction layer that takes a write offset and converts it into an offset inside of a specific 2GB segment.
So far so good...
Problems
This is where I started to get hung up and think that going with a different design (proposed below) might be the better way to do this.
From reading through 20 or so "memory mapped" related questions here on SO, it seems mmap calls are sensitive to wanting contiguous runs of memory when allocated. So, for example, on a 32-bit host OS if I tried to mmap a 2GB file, due to memory fragmentation, my chances are slim that mapping will succeed and instead I should use something like a series of 128MB mappings to pull an entire file in.
When I think of that design, even say using 1024MB mmap sizes, for a DBMS hosting up a few huge databases all represented by say 1TB files, I now have thousands of memory-mapped regions in memory and in my own testing on Windows 7 trying to create a few hundred mmaps across a multi-GB file, I didn't just run into exceptions, I actually got the JVM to segfault every time I tried to allocate too much and in one case got the video in my Windows 7 machine to cut out and re-initialize with a OS-error-popup I've never seen before.
Regardless of the argument of "you'll never likely handle files that large" or "this is a contrived example", the fact that I could code something up like that with those type of side effects put my internal alarm on high-alert and made consider an alternative impl (below).
BESIDES that issue, my understanding of memory-mapped files is that I have to re-create the mapping every time the file is grown, so in the case of this file that is append-only in design, it literally constantly growing.
I can combat this to some extent by growing the file in chunks (say 8MB at a time) and only re-create the mapping every 8MB, but the need to constantly be re-creating these mappings has me nervous especially with no explicit unmap feature supported in Java.
Question #1 of 2
Given all of my findings up to this point, I would dismiss memory-mapped files as a good solution for primarily read-heavy solutions or read-only solutions, but not write-heavy solutions given the need to re-create the mapping constantly.
But then I look around at the landscape around me with solutions like MongoDB embracing memory-mapped files all over the place and I feel like I a missing some core component here (I do know it allocs in something like 2GB extents at a time, so I imagine they are working around the re-map cost with this logic AND helping to maintain sequential runs on-disk).
At this point I don't know if the problem is Java's lack of an unmap operation that makes this so much more dangerous and unsuitable for my uses or if my understanding is incorrect and someone can point me North.
Alternative Design
An alternative design to the memory-mapped one proposed above that I will go with if my understanding of mmap is correct is as follows:
Define a direct ByteBuffer of a reasonable configurable size (2, 4, 8, 16, 32, 64, 128KB roughly) making it easily compatible with any host platform (don't need to worry about the DBMS itself causing thrashing scenarios) and using the original FileChannel, perform specific-offset reads of the file 1 buffer-capacity-chunk at a time, completely forgoing memory-mapped files at all.
The downside being that now my code has to worry about things like "did I read enough from the file to load the complete record?"
Another down-side is that I don't get to make use of the OS's virtual memory logic, letting it keep more "hot" data in-memory for me automatically; instead I just have to hope the file cache logic employed by the OS is big enough to do something helpful for me here.
Question #2 of 2
I was hoping to get a confirmation of my understanding of all of this.
For example, maybe the file cache is fantastic, that in both cases (memory mapped or direct reads), the host OS will keep as much of my hot data available as possible and the performance difference for large files is negligible.
Or maybe my understanding of the sensitive requirements for memory-mapped files (contiguous memory) are incorrect and I can ignore all that.
You might be interested in https://github.com/peter-lawrey/Java-Chronicle
In this I create multiple memory mappings to the same file (the size is a power of 2 up to 1 GB) The file can be any size (up to the size of your hard drive)
It also creates an index so you can find any record at random and each record can be any size.
It can be shared between processes and used for low latency events between processes.
I make the assumption you are using a 64-bit OS if you want to use large amounts of data. In this case a List of MappedByteBuffer will be all you ever need. It makes sense to use the right tools for the job. ;)
I have found it performance well even with data sizes around 10x your main memory size (I was using a fast SSD drive so YMMV)
I think you shouldn't worry about mmap'ping files up to 2GB in size.
Looking at the sources of MongoDB as an example of DB making use of memory mapped files you'll find it always maps full data file in MemoryMappedFile::mapWithOptions() (which calls MemoryMappedFile::map()). DB data spans across multiple files each up to 2GB in size. Also it preallocates data files so there's no need to remap as the data grows and this prevents file fragmentation. Generally you can inspire yourself with the source code of this DB.
Related
I have an issue with fragmentation on my drive. I got a programm that generates over 50000 files in different folders, each file grows over time. Each file will be about 500MB in size and I need to read the files fast.
The issue I am facing is that each file will be spread over the drive and defragmenation would take over 4 weeks.
I heard about a filesystem that will spread each file on the drive so that the gap between each file is the same sice. I searched the internet for that filesystem but i couldn't find anything.
My program is written in Java, maybe there is a way to set the beginning of a file on a specific byte position on the drive.
I would be glad if someone could help me facing this issue.
I heard about a filesystem that will spread each file on the drive so that the gap between each file will be the same sice. I searched in the internet for that filesystem but i coudn't find anything.
Most likely you did not because it does not exist...
But we have RAID systems (Rapid Array of Inexpensive Disks) which could ease your pain...
As Timothy said, you can't get to that level by using Java.
I neither heard that filesystem, it hasn't got much logic though.
Perhaps, in the case that you are storing text, you can use a NoSQL database (like MongoDB) that stores data in binary size. Probably you'll get good speeds, and the Java connector is easy to use.
Use a Linux filesystem like ext4 where disk fragmentation is very low but also make sure you have plenty of disk space left else fragmentation will happen anyway.
I also don't know of a file system that does this. However I have some info that may help-
If you used an SSD, then fragmentation would be less of a concern for reading performance reasons. SSDs store data in chunks - NAND flash pages, 16 KB for instance. These are always stored in scattered order due to the wear-levelling algorithm used. That is very unlike how hard disks work in practice. Pages on SSDs are accessed in a very parallel fashion as well. As a result, you would have much less impact of fragmentation on reading performance with an SSD. Fragmentation would still have some penalty for writes/deletions.
RAID would also allow for higher performance on reads as Timothy mentions.
I am trying to optimize the logging system of an Android application which causes some unwanted latency. There are multiple files opened which log different parts and should be kept separate.
I am not very familiar with low-level filesystem design and even less with current flash and/or SSD memory used in mobile phones (opposed to traditional HDD). I assume that memory is organized in disk blocks (512B or 4096B more recently) and some form of continuous, linked or indexed allocations is used.
I am using BufferedOutputStreams with buffer size of 256B, but this values is chosen at random (this provides a good answer for buffer size).
Writing in append mode to multiple opened files creates additional overhead that can significantly decrease performance (from allocation strategy for ex.)? Is it influenced greatly by the buffered output buffer size (this particular case of multiple files)?
I am using Android which tends to have a variety of FSs and makes it hard to understand how each influences the append to multiple opened files. Probably the I/O functions of Java or any other are very similar.
My search for this particular issue turned empty or maybe I need some domain specific terms in my search that I am not familiar with.
I have a java file server that serves file over http. Each file is uniquely addressable by an ID like so:
http://fileserver/id/123455555
I am looking to add a caching layer to this so that the most frequently accessed files stay in memory. I would also like to control the total size of the cache. I am thinking to use ehcache or oscache for this, but I have only used them to cache serialized object before. Would they be a good choice and are there any additional considerations for building a file cache?
Edit
Thanks for all the answers. Some more details to about the file server to simplify (or complicate) the problem:
Once a file is saved, it is never modified.
MD5 hash to avoid duplicating files on save. (I am
aware of possible collision and security concerns)
File server running on linux boxes.
Edit 2
Though the server it self does not put any limitation on the file type it supports, Files are mostly images (jpg,gif, pgn), Word, excel, PDF no bigger than 10Mb.
guava cache? http://code.google.com/p/guava-libraries/wiki/CachesExplained
nice API
time based eviction
size based eviction
Take advantage of the HTTP protocol
Your most effective caching mechanism by far will be to move caching off your own server and as close to the client as possible (data locality ;)). Use the HTTP protocol effectively to allow clients and caching proxies to do the caching whenever they can appropriately do so:
Set ETag's using some function of each file's content (e.g. MD5Sum) - cache this info too, so you don't re-calculate on each serve!
Set Expires / Last-Modified / Cache-Control headers as appropriate
edit: You updated to say that the files are never modified, so I would suggest setting the Expires header to a far-future date.
... Now to answer the question more directly ...
EhCache
My experience with EhCache is its a fine choice, and can satisfy the requirements you've mentioned.
You mentioned "the most frequently accessed files stay in memory" so it seems relevant to mention that, according to some performance testing I did (several years ago now) the LFU (Least Frequently Used) eviction policy is a lot slower than LRU (Least Recently Used) on cache writes - something like 30 times slower in fact. This is a product of the additional complexity of LFU vs LRU.
It would be a good idea to check the data usage pattern you really see in production to understand which eviction policy works best for you. In most circumstances I would suggest LRU as a starting point, as it approximates to LFU under conditions where the cache is large enough and there are no significant bursts of unusual data access.
OSCache
I have not used OSCache, so cannot say anything there.
Other considerations
In his answer Peter Lawrey suggested using the OS cache. Whilst this means that you pay a penalty for the read through from java to native I think the idea has great merit since it avoids a significant problem of caching in the Java heap: that the garbage collector has extra work to do trawling the large heap. (An alternative solution to that is to use off-heap caching, for example via BigMemory, but that has its own tradeoffs)
If the content is compressible you probably want to consider caching a compressed (gzip'd) version of the file (otherwise you will end up re-compressing it every time it is served!). This is one argument that goes against using the OS disk cache. Of course there are other caveats that go with compression (e.g. content is large enough to warrant compressing and compresses reasonably well) so it really does depend on what is in those files.
Ehcache provide ability to do web caching as well . You may want to try that http://www.ehcache.org/documentation/user-guide/web-caching
IMHO, you are better of making use of the OS disk cache as this has several advantages.
Its much simpler as the OS does all the real work.
The os can use all the available free memory which can vary depending on what else the system does.
You don't double up with the disk cache (as it is the disk cache).
The OS will keeps all the least recently used files in memory anyway.
I support a legacy Java application that uses flat files (plain text) for persistence. Due to the nature of the application, the size of these files can reach 100s MB per day, and often the limiting factor in application performance is file IO. Currently, the application uses a plain ol' java.io.FileOutputStream to write data to disk.
Recently, we've had several developers assert that using memory-mapped files, implemented in native code (C/C++) and accessed via JNI, would provide greater performance. However, FileOutputStream already uses native methods for its core methods (i.e. write(byte[])), so it appears a tenuous assumption without hard data or at least anecdotal evidence.
I have several questions on this:
Is this assertion really true?
Will memory mapped files always
provide faster IO compared to Java's
FileOutputStream?
Does the class MappedByteBuffer
accessed from a FileChannel provide
the same functionality as a native
memory mapped file library accessed
via JNI? What is MappedByteBuffer
lacking that might lead you to use a
JNI solution?
What are the risks of using
memory-mapped files for disk IO in a production
application? That is, applications
that have continuous uptime with
minimal reboots (once a month, max).
Real-life anecdotes from production
applications (Java or otherwise)
preferred.
Question #3 is important - I could answer this question myself partially by writing a "toy" application that perf tests IO using the various options described above, but by posting to SO I'm hoping for real-world anecdotes / data to chew on.
[EDIT] Clarification - each day of operation, the application creates multiple files that range in size from 100MB to 1 gig. In total, the application might be writing out multiple gigs of data per day.
Memory mapped I/O will not make your disks run faster(!). For linear access it seems a bit pointless.
A NIO mapped buffer is the real thing (usual caveat about any reasonable implementation).
As with other NIO direct allocated buffers, the buffers are not normal memory and wont get GCed as efficiently. If you create many of them you may find that you run out of memory/address space without running out of Java heap. This is obviously a worry with long running processes.
You might be able to speed things up a bit by examining how your data is being buffered during writes. This tends to be application specific as you would need an idea of the expected data writing patterns. If data consistency is important, there will be tradeoffs here.
If you are just writing out new data to disk from your application, memory mapped I/O probably won't help much. I don't see any reason you would want to invest time in some custom coded native solution. It just seems like too much complexity for your application, from what you have provided so far.
If you are sure you really need better I/O performance - or just O performance in your case, I would look into a hardware solution such as a tuned disk array. Throwing more hardware at the problem is often times more cost effective from a business point of view than spending time optimizing software. It is also usually quicker to implement and more reliable.
In general, there are a lot of pitfalls in over optimization of software. You will introduce new types of problems to your application. You might run into memory issues/ GC thrashing which would lead to more maintenance/tuning. The worst part is that many of these issues will be hard to test before going into production.
If it were my app, I would probably stick with the FileOutputStream with some possibly tuned buffering. After that I'd use the time honored solution of throwing more hardware at it.
From my experience, memory mapped files perform MUCH better than plain file access in both real time and persistence use cases. I've worked primarily with C++ on Windows, but Linux performances are similar, and you're planning to use JNI anyway, so I think it applies to your problem.
For an example of a persistence engine built on memory mapped file, see Metakit. I've used it in an application where objects were simple views over memory-mapped data, the engine took care of all the mapping stuff behind the curtains. This was both fast and memory efficient (at least compared with traditional approaches like those the previous version used), and we got commit/rollback transactions for free.
In another project I had to write multicast network applications. The data was send in randomized order to minimize the impact of consecutive packet loss (combined with FEC and blocking schemes). Moreover the data could well exceed the address space (video files were larger than 2Gb) so memory allocation was out of question. On the server side, file sections were memory-mapped on demand and the network layer directly picked the data from these views; as a consequence the memory usage was very low. On the receiver side, there was no way to predict the order into which packets were received, so it has to maintain a limited number of active views on the target file, and data was copied directly into these views. When a packet had to be put in an unmapped area, the oldest view was unmapped (and eventually flushed into the file by the system) and replaced by a new view on the destination area. Performances were outstanding, notably because the system did a great job on committing data as a background task, and real-time constraints were easily met.
Since then I'm convinced that even the best fine-crafted software scheme cannot beat the system's default I/O policy with memory-mapped file, because the system knows more than user-space applications about when and how data must be written. Also, what is important to know is that memory mapping is a must when dealing with large data, because the data is never allocated (hence consuming memory) but dynamically mapped into the address space, and managed by the system's virtual memory manager, which is always faster than the heap. So the system always use the memory optimally, and commits data whenever it needs to, behind the application's back without impacting it.
Hope it helps.
As for point 3 - if the machine crashes and there are any pages that were not flushed to disk, then they are lost. Another thing is the waste of the address space - mapping a file to memory consumes address space (and requires contiguous area), and well, on 32-bit machines it's a bit limited. But you've said about 100MB - so it should not be a problem. And one more thing - expanding the size of the mmaped file requires some work.
By the way, this SO discussion can also give you some insights.
If you write fewer bytes it will be faster. What if you filtered it through gzipoutputstream, or what if you wrote your data into ZipFiles or JarFiles?
As mentioned above, use NIO (a.k.a. new IO). There's also a new, new IO coming out.
The proper use of a RAID hard drive solution would help you, but that would be a pain.
I really like the idea of compressing the data. Go for the gzipoutputstream dude! That would double your throughput if the CPU can keep up. It is likely that you can take advantage of the now-standard double-core machines, eh?
-Stosh
I did a study where I compare the write performance to a raw ByteBuffer versus the write performance to a MappedByteBuffer. Memory-mapped files are supported by the OS and their write latencies are very good as you can see in my benchmark numbers. Performing synchronous writes through a FileChannel is approximately 20 times slower and that's why people do asynchronous logging all the time. In my study I also give an example of how to implement asynchronous logging through a lock-free and garbage-free queue for ultimate performance very close to a raw ByteBuffer.
I am developing a J2ME application that has a large amount of data to store on the device (in the region of 1MB but variable). I can't rely on the file system so I'm stuck the Record Management System (RMS), which allows multiple record stores but each have a limited size. My initial target platform, Blackberry, limits each to 64KB.
I'm wondering if anyone else has had to tackle the problem of storing a large amount of data in the RMS and how they managed it? I'm thinking of having to calculate record sizes and split one data set accross multiple stores if its too large, but that adds a lot of complexity to keep it intact.
There is lots of different types of data being stored but only one set in particular will exceed the 64KB limit.
For anything past a few kilobytes you need to use either JSR 75 or a remote server. RMS records are extremely limited in size and speed, even in some higher end handsets. If you need to juggle 1MB of data in J2ME the only reliable, portable way is to store it on the network. The HttpConnection class and the GET and POST methods are always supported.
On the handsets that support JSR 75 FileConnection it may be valid alternative but without code signing it is an user experience nightmare. Almost every single API call will invoke a security prompt with no blanket permission choice. Companies that deploy apps with JSR 75 usually need half a dozen binaries for every port just to cover a small part of the possible certificates. And this is just for the manufacturer certificates; some handsets only have carrier-locked certificates.
RMS performance and implementation varies wildly between devices, so if platform portability is a problem, you may find that your code works well on some devices and not others. RMS is designed to store small amounts of data (High score tables, or whatever) not large amounts.
You might find that some platforms are faster with files stored in multiple record stores. Some are faster with multiple records within one store. Many are ok for storage, but become unusably slow when deleting large amounts of data from the store.
Your best bet is to use JSR-75 instead where available, and create your own file store interface that falls back to RMS if nothing better is supported.
Unfortunately when it comes to JavaME, you are often drawn into writing device-specific variants of your code.
I think the most flexible approach would be to implement your own file system on top of the RMS. You can handle the RMS records in a similar way as blocks on a hard drive and use a inode structure or similar to spread logical files over multiple blocks. I would recommend implementing a byte or stream-oriented interface on top of the blocks, and then possibly making another API layer on top of that for writing special data structures (or simply make your objects serializable to the data stream).
Tanenbaum's classical book on operating systems covers how to implement a simple file system, but I am sure you can find other resources online if you don't like paper.
Under Blackberry OS 4.6 the RMS store size limit has been increased to 512Kb but this isn't much help as many devices will likely not have support for 4.6. The other option on Blackberry is the Persistent Store which has a record size limit of 64kb but no limit on the size of the store (other than the physical limits of the device).
I think Carlos and izb are right.
It is quite simple, use JSR75 (FileConnection) and remember to sign your midlet with a valid (trusted) certificate.
For read only I'm arriving at acceptable times (within 10s), by indexing a resource file. I've got two ~800KB CSV price list exports. Program classes and both those files compress to a 300KB JAR.
On searching I display a List and run a two Threads in the background to fill it, so the first results come pretty quickly and are viewable immediately. I first implemented a simple linear search, but that was too slow (~2min).
Then I indexed the file (which is alphabetically sorted) to find the beginnings of each letter. Now before parsing line by line, I first InputStreamReader.skip() to the desired position, based on first letter. I suspect the delay comes mostly from decompressing the resource, so splitting resources would speed it up further. I don't want to do that, not to loose the advantage of easy upgrade. CSV are exported without any preprocessing.
I'm just starting to code for JavaME, but have experience with old versions of PalmOS, where all data chunks are limited in size, requiring the design of data structures using record indexes and offsets.
Thanks everyone for useful commments. In the end the simplest solution was to limit the amount of data being stored, implementing code that adjusts the data according to how large the store is, and fetching data from the server on demand if its not stored locally. Thats interesting that the limit is increased in OS 4.6, with any luck my code will simply adjust on its own and store more data :)
Developing a J2ME application for Blackberry without using the .cod compiler limits the use of JSR 75 some what since we can't sign the archive. As pointed out by Carlos this is a problem on any platform and I've had similar issues using the PIM part of it. The RMS seems to be incredibly slow on the Blackberry platform so I'm not sure how useful a inode/b-tree file system on top would be, unless data was cached in memory and written to RMS in a low priority background thread.