I'm reading in a NetCDF file and I want to read in each array as a float array and then write the float array to a new file. I can make it work if I read in the float array and then iterate over each element in the array (using a DataOutputStream), but this is very, very slow, my NetCDF files are over 1GB.
I tried using an ObjectOutputStream, but this writes extra bytes of information.
So, to recap.
1. Open NetCDF file
2. Read float array x from NetCDF file
3. Write float array x to raw data file in a single step
4. Repeat step 2 with x+1
Ok, You have 1 GB to read and 1 GB to write. Depending on your hard drive, you might get about 100 MB/s read and 60 MB/s write speed. This means it will take about 27 seconds to read and write.
What is the speed of your drive and how much slower than this are you seeing?
If you want to test the speed of your disk without any processing, time how long it takes to copy a file which you haven't accessed recently (i.e. it is not in disk cache) This will give you an idea of the minimum delay you can expect to read then write most of the data from the file (i.e. with no processing or Java involved)
For the benefit of anyone who would like to know how to do a loop less copy of data i.e. it doesn't just call a method which loops for you.
FloatBuffer src = // readable memory mapped file.
FloatByffer dest = // writeable memory mapped file.
src.position(start);
src.limit(end);
dest.put(src);
If you have mixed types of data you can use ByteBuffer which notionally copies a byte at a time but in reality could use long or wider type to copy 8 or more bytes at a time. i.e. whatever the CPU can do.
For small blocks this will use a loop but for large blocks it can use page mapping tricks in the OS. In any case, how it does it is not defined in Java, but its likely to be the fastest way to copy data.
Most of these tricks only make a difference if you are copying file already in memory to a cached file. As soon as you read a file from disk or the file is too large to cache the IO bandwidth of the your physical disk is the only thing which really matters.
This is because a CPU can copy data at 6 GB/s to main memory but only 60-100 MB/s to a hard drive. If the copy in the CPU/memory is 2x, 10x or 50x slower than it could be, it will still be waiting for the disk. Note: with no buffering this is entirely possible and worse, but provided you have any simple buffering the CPU will be faster than the disk.
I ran into the same problem and will dump my solution here just for future refrerence.
It is very slow to iterate over an array of floats and calling DataOutputStream.writeFloat for each of them. Instead, transform the floats yourself into a byte array and write that array all at once:
Slow:
DataOutputStream out = ...;
for (int i=0; i<floatarray.length; ++i)
out.writeFloat(floatarray[i]);
Much faster
DataOutputStream out = ...;
byte buf[] = new byte[4*floatarray.length];
for (int i=0; i<floatarray.length; ++i)
{
int val = Float.floatToRawIntBits(probs[i]);
buf[4 * i] = (byte) (val >> 24);
buf[4 * i + 1] = (byte) (val >> 16) ;
buf[4 * i + 2] = (byte) (val >> 8);
buf[4 * i + 3] = (byte) (val);
}
out.write(buf);
If your array is very large (>100k), break it up into chunks to avoid heap overflow with the buffer array.
1) when writing, use BufferedOutputStream, you will get a factor of 100 speedup.
2) when reading, read at least 10K per read, probably 100K is better.
3) post your code.
If you are using the Unidata NetCDF library your problem may not be the writing, but rather the NetCDF libraries caching mechanism.
NetcdfFile file = NetcdfFile.open(filename);
Variable variable = openFile.findVariable(variable name);
for (...) {
read data
variable.invalidateCache();
}
Lateral solution:
If this is a one-off generation (or if you are willing to automate it in an Ant script) and you have access to some kind of Unix environment, you can use NCDUMP instead of doing it in Java. Something like:
ncdump -v your_variable your_file.nc | [awk] > float_array.txt
You can control the precision of the floats with the -p option if you desire. I just ran it on a 3GB NetCDF file and it worked fine. As much as I love Java, this is probably the quickest way to do what you want.
Related
While working with encription of big files I've been forced to learn how this files are being read or written in/from the filesystem at a lower abstraction level than usual.
But there are some things that I can't understand at all being the main one why does java.io.File.size return a long value when, if it's possible, to persist it's bytes in memory I need to create a byte array whose constructor needs an int "parameter", e.g:
File file = // ...
byte[] total = new byte[(int) file.size()];
byte[] buffer = new byte[1024];
...
while working with normal files I wouldn't even think about this since int can hold up to 2^32 values which would be, if I'm not mistaken, a bit more than 2GB.
I wonder why is this happening, if this is actually done this way or I'm misunderstanding Java's API and which alternatives I have to simplify this task.
PD: I know that by working with big files my interest doesn't rely on reading and completely maintaining it in memory since I'll easily raise an OutOfMemoryException but just a random use case.
I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.
Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.
Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)
I am creating two CSV files using String buffers and byte arrays.
I use ZipOutputStream to generate the zip files. Each csv file will have 20K records with 14 columns. Actually the records are fetched from DB and stored in ArrayList. I have to iterate the list and build StringBuffer and convert the StringBuffer to byte Array to wirte it to the zip entry.
I want to know the memory required by JVM to do the entire process starting from storing the records in the ArrayList.
I have provide the code snippet below.
StringBuffer responseBuffer = new StringBuffer();
String response = new String();
response = "Hello, sdksad, sfksdfjk, World, Date, ask, askdl, sdkldfkl, skldkl, sdfklklgf, sdlksldklk, dfkjsk, dsfjksj, dsjfkj, sdfjkdsfj\n";
for(int i=0;i<20000;i++){
responseBuffer.append(response);
}
response = responseBuffer.toString();
byte[] responseArray = response.getBytes();
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
zout.write(responseArray);
zout.closeEntry();
ZipEntry childEntry = new ZipEntry("child.csv");
zout.putNextEntry(childEntry);
zout.write(responseArray);
zout.closeEntry();
zout.close();
Please help me with this. Thanks in advance.
I'm guessing you've already tried counting how many bytes will be allocated to the StringBuffer and the byte array. But the problem is you can't really know how much memory your app will use unless you have upper bounds on the sizes of the CSV records. I'm If you want your software to be stable, robust and scalable, I'm afraid you're asking the wrong question: you should strive on performing the task you need to do using a fixed amount of memory, which in your case seems easily possible.
The key is, that in your case the processing is entirely FIFO - you read records from the database, and then write them (in the same order) into a FIFO stream (OutputStream in that case). Even zip compression is stream-based, and uses a fixed amount of memory internally, so you're totally safe there.
Instead of buffering the entire input in a huge String, then converting it to a huge byte array, then writing it to the output stream - you should read each response element separately from the database (or chunks of fixed size, say 100 records at a time), and write it to the output stream. Something like
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
while (... fetch entries ...)
zout.write(...data...)
zout.closeEntry();
The advantage of this approach is that because it works with small chunks you can easily estimate their sizes, and allocate enough memory for your JVM so it never crashes. And you know it will still work if your CSV files become much more than 20K lines in the future.
To analyze the memory usage you can use a Profiler.
JProfiler or YourKit is very good at doing this.
VisualVM is also good to an extent.
You can measure the memory with the MemoryTestbench.
http://www.javaspecialists.eu/archive/Issue029.html
This article desribes what to do. Its simple, and acurate to 1 byte, I often use it.
It even could be run form a junit test case, so its very usefull, while a profiler could not be run
from a junit test case.
With that apporach, you even can measure the memory size of one Integer object.
But with zip there is one special thing. Zipstream uses a native c library, in that case the MemoryTestbench may not measure that memory, only the java part.
You should try both variants, the MemroyTestbench, and with profilers (jprof).
I am really in trouble: I want to read HUGE files over several GB using FileChannels and MappedByteBuffers - all the documentation I found implies it's rather simple to map a file using the FileChannel.map() method.
Of course there is a limit at 2GB as all the Buffer methods use int for position, limit and capacity - but what about the system implied limits below that?
In reality, I get lots of problems regarding OutOfMemoryExceptions! And no documentation at all that really defines the limits!
So - how can I map a file that fits into the int-limit safely into one or several MappedByteBuffers without just getting exceptions?
Can I ask the system which portion of a file I can safely map before I try FileChannel.map()? How?
Why is there so little documentation about this feature??
I can offer some working code. Whether this solves your problem or not is difficult to say. This hunts through a file for a pattern recognised by the Hunter.
See the excellent article Java tip: How to read files quickly for the original research (not mine).
// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
// Use a mapped and buffered stream for best speed.
// See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
FileChannel ch = f.getChannel();
long red = 0L;
do {
long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
int nGet;
while (mb.hasRemaining() && p.ok()) {
nGet = Math.min(mb.remaining(), SIZE);
mb.get(buffer, 0, nGet);
for (int i = 0; i < nGet && p.ok(); i++) {
p.check(buffer[i]);
}
}
red += read;
} while (red < ch.size() && p.ok());
// Finish off.
p.close();
ch.close();
f.close();
}
What I use is a List<ByteBuffer> where each ByteBuffer maps to the file in block of 16 MB to 1 GB. I uses powers of 2 to simplify the logic. I have used this to map in files up to 8 TB.
A key limitation of memory mapped files is that you are limited by your virtual memory. If you have a 32-bit JVM you won't be able to map in very much.
I wouldn't keep creating new memory mappings for a file because these are never cleaned up. You can create lots of these but there appears to be a limit of about 32K of them on some systems (no matter how small they are)
The main reason I find MemoryMappedFiles useful is that they don't need to be flushed (if you can assume the OS won't die) This allows you to write data in a low latency way, without worrying about losing too much data if the application dies or too much performance by having to write() or flush().
You don't use the FileChannel API to write the entire file at once. Instead, you send the file in parts. See example code in Martin Thompson's post comparing performance of Java IO techniques: Java Sequential IO Performance
In addition, there is not much documentation because you are making a platform-dependent call. from the map() JavaDoc:
Many of the details of memory-mapped files are inherently dependent
upon the underlying operating system and are therefore unspecified.
The bigger the file, the less you want it all in memory at once. Devise a way to process the file a buffer at a time, a line at a time, etc.
MappedByteBuffers are especially problematic, as there is no defined release of the mapped memory, so using more than one at a time is essentially bound to fail.
I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.
I create the FileChannel like this...
f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();
I then use a private ByteBuffer the size of a double to read from it
private ByteBuffer _double_bb = ByteBuffer.allocate(8);
and my reading code looks like this
public double GetValue(long lRow, long lCol)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long position = idx * BLOCK_SIZE;
double d = 0;
try
{
_double_bb.position(0);
_ioChannel.read(_double_bb, position);
d = _double_bb.getDouble(0);
}
...snip...
return d;
}
and I write to it like this...
public void SetValue(long lRow, long lCol, double d)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long offset = idx * BLOCK_SIZE;
try
{
_double_bb.putDouble(0, d);
_double_bb.position(0);
_ioChannel.write(_double_bb, offset);
}
...snip...
}
The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.
So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.
Thanks in advance
As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.
This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.
So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.
Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.
Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().
Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).
You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.
Presumably if we can reduce the number of reads then things will go more quickly.
3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.
Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.
Or, if you have the capacity, read the whole thing into memory, in at the start of processing.
Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.
May be This article helps you ...
You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.
The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.