I'm testing data structure performance with very large data.
As a temporary workaround (see here) I want to write memory to disk.
I want to test with very big datasets - how can I make it so that when the java VM runs out of memory it writes some of it to disk?
Since we're talking about temporary fixes here you could always increase your page file if you need a little extra space (swap file in most linux distros)
Here's a link from Microsoft:
http://windows.microsoft.com/en-us/windows-vista/change-the-size-of-virtual-memory
Linux:
http://www.cyberciti.biz/faq/linux-add-a-swap-file-howto/
Now let me say that this isn't a good long term fix, but I understand that sometimes developers just need to make it work. If this is something that will ever see a production environment you may want to look at a tool like Hadoop. It allows you to distribute your data processing over multiple JVM's--a tool built for a "big data" application like the one you're describing
Maybe you can use stream, or some buffered one. I think that will be the best choice for testing such structure. If you can read from disk using stream and that will be not make any additional objects(only that which are necessary) so you can have all jvm memory for your structure. But maybe you can describe your problem more?
Related
This looks like a long question because of all the context. There are 2 questions inside the novel below. Thank you for taking the time to read this and provide assistance.
Situation
I am working on a scalable datastore implementation that can support working with data files from a few KB to a TB or more in size on a 32-bit or 64-bit system.
The datastore utilizes a Copy-on-Write design; always appending new or modified data to the end of the data file and never doing in-place edits to existing data.
The system can host 1 or more database; each represented by a file on-disk.
The details of the implementation are not important; the only important detail being that I need to constantly append to the file and grow it from KB, to MB, to GB to TB while at the same time randomly skipping around the file for read operations to answer client requests.
First-Thoughts
At first glance I knew I wanted to use memory-mapped files so I could push the burden of efficiently managing the in-memory state of the data onto the host OS and out of my code.
Then all my code needs to worry about is serializing the append-to-file operations on-write, and allowing any number of simultaneous readers to seek in the file to answer requests.
Design
Because the individual data-files can grow beyond the 2GB limit of a MappedByteBuffer, I expect that my design will have to include an abstraction layer that takes a write offset and converts it into an offset inside of a specific 2GB segment.
So far so good...
Problems
This is where I started to get hung up and think that going with a different design (proposed below) might be the better way to do this.
From reading through 20 or so "memory mapped" related questions here on SO, it seems mmap calls are sensitive to wanting contiguous runs of memory when allocated. So, for example, on a 32-bit host OS if I tried to mmap a 2GB file, due to memory fragmentation, my chances are slim that mapping will succeed and instead I should use something like a series of 128MB mappings to pull an entire file in.
When I think of that design, even say using 1024MB mmap sizes, for a DBMS hosting up a few huge databases all represented by say 1TB files, I now have thousands of memory-mapped regions in memory and in my own testing on Windows 7 trying to create a few hundred mmaps across a multi-GB file, I didn't just run into exceptions, I actually got the JVM to segfault every time I tried to allocate too much and in one case got the video in my Windows 7 machine to cut out and re-initialize with a OS-error-popup I've never seen before.
Regardless of the argument of "you'll never likely handle files that large" or "this is a contrived example", the fact that I could code something up like that with those type of side effects put my internal alarm on high-alert and made consider an alternative impl (below).
BESIDES that issue, my understanding of memory-mapped files is that I have to re-create the mapping every time the file is grown, so in the case of this file that is append-only in design, it literally constantly growing.
I can combat this to some extent by growing the file in chunks (say 8MB at a time) and only re-create the mapping every 8MB, but the need to constantly be re-creating these mappings has me nervous especially with no explicit unmap feature supported in Java.
Question #1 of 2
Given all of my findings up to this point, I would dismiss memory-mapped files as a good solution for primarily read-heavy solutions or read-only solutions, but not write-heavy solutions given the need to re-create the mapping constantly.
But then I look around at the landscape around me with solutions like MongoDB embracing memory-mapped files all over the place and I feel like I a missing some core component here (I do know it allocs in something like 2GB extents at a time, so I imagine they are working around the re-map cost with this logic AND helping to maintain sequential runs on-disk).
At this point I don't know if the problem is Java's lack of an unmap operation that makes this so much more dangerous and unsuitable for my uses or if my understanding is incorrect and someone can point me North.
Alternative Design
An alternative design to the memory-mapped one proposed above that I will go with if my understanding of mmap is correct is as follows:
Define a direct ByteBuffer of a reasonable configurable size (2, 4, 8, 16, 32, 64, 128KB roughly) making it easily compatible with any host platform (don't need to worry about the DBMS itself causing thrashing scenarios) and using the original FileChannel, perform specific-offset reads of the file 1 buffer-capacity-chunk at a time, completely forgoing memory-mapped files at all.
The downside being that now my code has to worry about things like "did I read enough from the file to load the complete record?"
Another down-side is that I don't get to make use of the OS's virtual memory logic, letting it keep more "hot" data in-memory for me automatically; instead I just have to hope the file cache logic employed by the OS is big enough to do something helpful for me here.
Question #2 of 2
I was hoping to get a confirmation of my understanding of all of this.
For example, maybe the file cache is fantastic, that in both cases (memory mapped or direct reads), the host OS will keep as much of my hot data available as possible and the performance difference for large files is negligible.
Or maybe my understanding of the sensitive requirements for memory-mapped files (contiguous memory) are incorrect and I can ignore all that.
You might be interested in https://github.com/peter-lawrey/Java-Chronicle
In this I create multiple memory mappings to the same file (the size is a power of 2 up to 1 GB) The file can be any size (up to the size of your hard drive)
It also creates an index so you can find any record at random and each record can be any size.
It can be shared between processes and used for low latency events between processes.
I make the assumption you are using a 64-bit OS if you want to use large amounts of data. In this case a List of MappedByteBuffer will be all you ever need. It makes sense to use the right tools for the job. ;)
I have found it performance well even with data sizes around 10x your main memory size (I was using a fast SSD drive so YMMV)
I think you shouldn't worry about mmap'ping files up to 2GB in size.
Looking at the sources of MongoDB as an example of DB making use of memory mapped files you'll find it always maps full data file in MemoryMappedFile::mapWithOptions() (which calls MemoryMappedFile::map()). DB data spans across multiple files each up to 2GB in size. Also it preallocates data files so there's no need to remap as the data grows and this prevents file fragmentation. Generally you can inspire yourself with the source code of this DB.
Using log4j on Unix, which Appender would perform the best to write 1000Meg :
1) Using RollingFileAppender writing 10 file of 100 Meg
or
2) Using a FileAppender and writing a single 1000Meg file
In other words, using java on unix, does the size matter?
Thank you
There no Java-side performance difference between writing to a small file or writing to a large file. There might be a small difference at the OS level when a file gets big enough that an extra level of index blocks is required (FS dependent), but it is probably not worth worrying about.
There will be a performance cost in implementing the file rolling behavior. The appender has to:
test / remember how big the file is,
close the current one,
rename it,
open a new file.
My gut feeling is that this is not likely to be significant. (However, it would be worth measuring to see if the performance impact should be a concern. Also, you should probably ask yourself if you are not doing too much logging.)
You have to compare all of the above against the advantages of file rolling:
Having a bounded size on log files means that your logging won't fill the disk, causing problems for the application and potentially others on the same machine.
Smaller log files can make it easier / quicker to do searches for events at specific times. (Running less on a 1000Mb file can be painful ...)
They'll both easily write 1000MB files. I don't see why they should perform differently.
You do need the RollingFileAppender though in order to set the total maximum size that the log file(s) can reach. Otherwise you may run out of hard disk space, assuming that you application has got traffic.
I think it's always preferable to use small files than large files because they are more manageable. Also, consider that with large files you may have problems in the case of file-system full because of the risk of having to remove the log file when the process is up and running to free up disk space.
Yesterday I read something abouth application optimization and how a programmer should find the most used parts of the program and by profiling and modifying them getting the most benefit (when looking at the time/work invested vs. memory/speed gains). Now, I've run the Eclipse profiler, got VisualVM but I don't how to use this data properly. My primary concerns are memory usage (i'm generating an XML and either storing it to disk as a zip or flushing it as zip to the user for download) and slowdowns from the database (i'm suspecting my indexes aren't there or aren't good, and in any case, don't know much about them so I can't tell you more :) but I don't even know how to start this. For the first case VisalVM shows that the program uses up to 200MB, but when I inspect the Heap Dump and click the most used object (or how it's called), the information is overwhelming. For the second case I know even less, other than that Toad has some tools.
What I want to know is how to start doing this, and when I'm satisfied with the local performance, how to do it on the production application.
Edit1: So, for a concrete example of memory usage (i'm generating an XML and either storing it to disk as a zip or flushing it as zip to the user for download). This is what I get when I choose "Heap dump", then choose top 20 objects by retained size and open the details:
and this is what I get when I opened Profiler on the same use case:
The question is, what do this screens tell me? :)
As far a database applications go, I would start from reading Cary Millsap's excellent articles:
http://method-r.com/downloads/cat_view/38-papers-and-articles
Search for "Making friends with the Oracle Database" for example...
I am writing a media transcoding server in which I would need to move files in the filesystem and till now I am in the dilemma of whether using java renameTo can be replaced by something else that would give me better performance. I was considering using exec("mv file1 file2") but that would be my last bet.
Anyone has had similar experiences or can help me find a solution?
First of all, renameTo is likely just wrapping a system call.
Secondly, moving a file does not involve copying any data from the file itself (at least, in unix). All that happens is that the link from the old directory is removed, and a link from the new directory is added. I don't think you're going to find any performance improvements here.
I don't think that using the default methods for file has a (mentionable) performance penalty as most of this JVMtoOS functions are wrapping native calls already.
The only case where an exec would be needed is if you wanted to do something with different rights than the program or use a special tool to copy/move the file. (e.g. smart-move when ntfs-junctions are involved)
If rename is a significant performance bottleneck, then you need to improve your hardware as this is your main contraint. The software is a trivial portion of the time spent and optimising it will make little difference.
What is your disk confiugration? How is it optimised for writes?
I need to perform a simple grep and other manipulations on large files in Java. I am not that familiar with the Java NIO utilities, but I am assuming that is what I need to use. What resources or helpful tips do you have for reading/writing large files. Also, I am working on a SWT application and need to display parts of that data within a text area on a GUI.
java.io.RandomAccessFile uses long for file-pointer offset so should be able to cope. However, you should read a chunk at a time otherwise overheads will be high. FileInputStream works similarly.
Java NIO shouldn't be too difficult. You don't need to mess around with Selectors or similar. In fact, prior to JDK7 you can't select with files. However, avoid mapping files. There is no unmap, so if you try to do it lots you'll run out of address space on 32-bit systems, or run into other problems (NIO does attempt to call GC, but it's a bit of a hack).
If all you are doing is reading the entire file a chunk at a time, with no special processing, then nio and java.io.RandomAccessFile are probably overkill. Just read and process the content of the file a block at a time. Ensure that you use a BufferedInputStream or BufferedReader.
If you have to read the entire file to do what you are doing, and you read only one file at a time, then you will gain little benefit from nio.
Maybe a little bit off topic: Have a look on VFS by apache. It's originally meant to be a library for hiding the ftp-http-file-whatever system behind a file system facade from your application's point of view. I mentioning it here because I have positive experience with accessing large files (via ftp) for searching, reading, copying etc. (large in that context means > 15MB) with this library.