Mapping files bigger than 2GB with Java

Mapping files bigger than 2GB with Java - java

It could be generally stated: how do you implement a method byte[] get(offset, length) for a memory-mapped file that is bigger than 2GB in Java.
With context:
I'm trying to read efficiently files that are bigger than 2GB with random i/o. Of course the idea is to use Java nio and memory-mapped API.
The problem comes with the limit of 2GB for memory mapping. One of the solutions would be to map multiple pages of 2GB and index through the offset.
There's a similar solution here:
Binary search in a sorted (memory-mapped ?) file in Java
The problem with this solution is that it's designed to read byte while my API is supposed to read byte[] (so my API would be something like read(offset, length)).
Would it just work to change that ultimate get() to a get(offset, length)? What happens then when the byte[] i'm reading lays between two pages?

No, my answer to Binary search in a sorted (memory-mapped ?) would not work to change get() to get(offset, length) because of the memory mapped file array boundary, like you suspect. I can see two possible solutions:
Overlap the memory mapped files. When you do a read, pick the memory mapped file with the start byte immediately before the read's start byte. This approach won't work for reads larger than 50% of the maximum memory map size.
Create a byte array creation method that reads from two different two different memory mapped files. I'm not keen on this approach as I think some of the performance gains will be lost because the resulting array will not be memory mapped.

Related

Efficient way of reading, loading and unloading to memory parts of a very big binary file in java.

I need to read the contents of a binary file containing floats that represent the positions of a big number of particles, and pass this information to my program so they can be rendered to the screen. The file contains the position of every particle at every frame, and thus can be as big as 10 or more GB.
My plan is to fill all available memory with the next frames the program will need. Once a frame has been shown, I need to free that space and load a new frame from the file. If a certain frame is requested (say, frame 12), I need to go to this frame in the file and read it, as well as all other successive frames that fit in the memory.
The question is, how can I read, and then store this information in a way that is efficient?
I have done some research, and read similar questions.
Arrays and Vectors are not an option since they can fit about 2gb of data, when I could be loading more than that to ram. Besides, initializing an array of such a size seems to lead to an "out of memory exception", and expanding an array takes too much time.
Buffer a large file; BufferedInputStream limited to 2gb; Arrays limited to 2^31 bytes
MappedByteBuffer could be an option, but it is also limited to 2 gb, and I do not understand exactly how it works.
I could also use external libraries that save data outside of the buffer (but which one?), or use a ByteBuffer, I guess, but what is the best option in this case?
In short, I need to read the file fast (probably in parts, unless there is another way), and save as much in memory as I can, and pass this information to another thread, frame by frame. How do I read this file, and where do I store the data in memory?
Read large files in Java
This source seems to address something similar, but it is about reading text, and it seems he could use a MappedByteBuffer or a BufferedInputStream, since his file is of only 1.5 GB.

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Large-scale processing of seralized Integer objects

I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.

You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.

So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.

Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this

Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.

Reading Big File in Java

I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output.
The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).
Note:
I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)
For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.
Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.
Please suggest to reduce Memory utilization as I need to process 2GB file.

Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.
From their website, the following are the features they support:
Arbitrary numbers of values per line
Ignoring commas in quoted elements
Handling quoted entries with embedded carriage returns (ie entries
that span multiple lines)
Configurable separator and quote characters (or use sensible
defaults)
Read all the entries at once, or use an Iterator style model
Creating csv files from String[] (ie. automatic escaping of embedded
quote chars)

Use best practices to upgrade your program
Write Multithread in program to get better cpu utilization.
Set heap minimum and maximum heap size to get better use of ram.
Use proper data structure and design.

Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.
I think you should revise your design. Given your requirements
The Upper part of output screen shows each record from file in that
order in JTable, whereas lower part shows statistics computed based on
that data
you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator

Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.
Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.

Smart buffering in an environment with limited amount of memory Java

Dear StackOverflowers,
I am in the process of writing an application that sorts a huge amount of integers from a binary file. I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a multitude of reads it slows down the algorithm quite significantly.
The standard way of doing this would be to fill ~50% of the available memory with a buffered object of some sort (BufferedInputStream etc) then transfer the integers from the buffered object into an array of integers (which takes up the rest of free space) and sort the integers in the array. Save the sorted block back to disk, repeat the procedure until the whole file is split into sorted blocks and then merge the blocks together.
The strategy for sorting the blocks utilises only 50% of the memory available since the data is essentially duplicated (50% for the cache and 50% for the array while they store the same data).
I am hoping that I can optimise this phase of the algorithm (sorting the blocks) by writing my own buffered class that allows caching data straight into an int array, so that the array could take up all of the free space not just 50% of it, this would reduce the number of disk accesses in this phase by a factor of 2. The thing is I am not sure where to start.
EDIT:
Essentially I would like to find a way to fill up an array of integers by executing only one read on the file. Another constraint is the array has to use most of the free memory.
If any of the statements I made are wrong or at least seem to be please correct me,
any help appreciated,
Regards

when you say limited, how limited... <1mb <10mb <64mb?
It makes a difference since you won't actually get much benefit if any from having large BufferedInputStreams in most cases the default value of 8192 (JDK 1.6) is enough and increasing doesn't ussually make that much difference.
Using a smaller BufferedInputStream should leave you with nearly all of the heap to create and sort each chunk before writing them to disk.

You might want to look into the Java NIO libraries, specifically File Channels and Int Buffers.

You dont give many hints. But two things come to my mind. First, if you have many integers, but not that much distinctive values, bucket sort could be the solution.
Secondly, one word (ok term), screams in my head when I hear that: external tape sorting. In early computer days (i.e. stone age) data relied on tapes, and it was very hard to sort data spread over multiple tapes. It is very similar to your situation. And indeed merge sort was the most often used sorting that days, and as far as I remember, Knuths TAOCP had a nice chapter about it. There might be some good hints about the size of caches, buffers and similar.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.