We have a binary file which contains a large amount of float data (about 80MB). we need to process it in our Java application. The data is from a medical scanner. One file contains data from one Rotation. One Rotation contains 960 Views. One View contains 16 Rows and one Rows contains 1344 Cells. Those numbers (their relationship) are fixed.
We need to read ALL the floats into our application with a code structure reflect above structure about Rotation-view-row-cell.
What we are doing now is using float[] to hold data for Cells and then using ArrayList for Rotation, View and Row to hold their data.
I have two questions:
how to populate the Cell data (read floats into our float[]) quickly?
do you have better idea to hold those data?
Use a DataInputStream (and its readFloat() method) wrapping a FileInputStream, possibly with e BufferedInputStream in between (try whether the buffer helps performance or not).
Your data structure looks fine.
Assuming you don't make changes to the data (add more views, etc.) why not put everything in one big array? The point of ArrayLists is you can grow and shrink them, which you don't need here. You can write access methods to get the right cell for a given view, rotation, etc.
Using arrays of arrays is a better idea, that way the system is figuring out how to access what for you and it is just as fast as a single array.
Michael is right, you need to buffer the input, otherwise you will be doing a file access operation for every byte and your performance will be awful.
If you want to stick with the current approach as much as possible, you can minimize the memory used by your ArrayLists by setting their capacity to the number of elements they hold. Otherwise they keep a number of slots in reserve, expecting you to add more.
For the data loading:
DataInputStream should work well. But make sure you wrap the underlying FileInputStream in a BufferedInputStream, otherwise you run the risk of doing I/O operations for every float which can kill performance.
Several options for holding the data:
The (very marginally) most memory-efficient way will be to store the entire array in on large float[], and calculate offsets into it as needed. A bit ugly to use, but might make sense if you are doing a lot of calculations or processing loops over the entire set.
The most "OOP" way would be to have separate objects for Rotation, View, Row and Cell. But having each cell as a separate object is pretty wasteful, might even blow your memory limits.
You could use nested ArrayLists with a float[1344] to represent the lowest level data for the cells in each row. I understand this is what you're currently doing - in fact I think it's a pretty good choice. The overhead of the ArrayLists won't be much compared to the overall data size.
A final option would be to use a float[rotationNum][rowNum][cellNum] to represent each rotation. A bit more efficient than ArrayLists, but arrays are usually less nice to manipulate. However this seems a pretty good option if, as you say, the array sizes will always be fixed. I'd probably choose this option myself.
Are you having any particular performance/usage problems with your current approach?
The only thing I can suggest based on the information that you provide is to try representing a View as float[][] of rows and cells.
I also think that you can put all your data structure into a float[][][] (same as Nathan Hughes suggests). You could have a method that reads your file and return a float[][][], where the first dimension is that of views (960), the second is that of rows (16), and the third is that of cells (1344): if those numbers are fixes, you'd better use this approach: you save memory, and it's faster.
80 MB shouldn't be so much data that you need to worry so terribly much. I would really suggest:
create Java wrapper objects representing the most logical structure/hierarchy for the data you have;
one way or another, ensure that you're only making an actual "raw" I/O call (so an InputStream.read() or equivalent) every 16K or so of data-- e.g. you could read into a 16K/32K byte array that is wrapped in a ByteBuffer for the purpose of pulling out the floats, or whatever you need for your data;
if you actually have a performance problem with this approach, try to identify, not second-guess, what that performance problem actually is.
I understand that you are looking effective way of store data you described above, though size you mentioned is not very huge i would suggest you to have look at Huge Collections.
Related
As the title states, I'm trying to read an unknown # of elements from a file into an array. What's the most simple (professor wants us to avoid using things she hasnt taught) yet effective way of going about this?
I've thought about reading and counting the elements in the file one by one, then create the array after I know what size to make it, then actually store the elements in there. But that seems a little inefficient. Is there a better way?
There are only two ways to do this: the way you suggested (count, then read), and making an array and hoping it's good enough, then resize if it's not (which is the easier of the two, as ArrayList does that automatically for you).
Which is better depends on whether you're more limited by time or memory (as typically reading the file twice will be slower than reallocating an array even multiple times).
EDIT: There is a third way, which is only available if each record in the file has a fixed width, and the file is not compressed (or encoded in any other way that would mess with the content layout): get the size of the file, divide by record size, and that's exactly how many records you have to allocate for. Unfortunately, life is not always that easy. :)
As you have not mentioned that your professor has taught ArrayList or not so, in that case I would go for ArrayList for sure. The sole purpose f ArrayList is to deal with this kind of situations. Features like Dynamic Memory gives ArrayList some advantages over array.
You can count first and then create array accordingly. But this will be slow and why reinvent the wheel?
So, your approach should be to use ArrayList read the element and just use ArrayList.add()...
I have a problem. I work in Java, Eclipse. My program calculates some mathematical physics, and I need to draw animation (Java SWT package) of the process (some hydrodynamics). The problem is 2D, so each iteration returns two dimensional array of numbers. One iteration takes rather long time and time needed for iteration changes from one iteration to another, so showing pictures dynamically as program works seems like a bad idea. In this case my idea was to store a three dimensional array, where third index represents time, and building an animation when calculations are over. But in this case, as I want accuracuy from my program, I need a lot of iterations, so program easily reaches maximal array size. So the question is: how do I avoid creating such an enormous array or how to avoid limitations on array size? I thought about creating a special file to store data and then reading from it, but I'm not sure about this. Do you have any ideas?
When I was working on a procedural architecture generation system at university for my dissertation I created small, extremely easily read and parsed binary files for calculated data. This meant that the data could be read in within an acceptable amount of time, despite being quite a large amount of data...
I would suggest doing the same for your animations... It might be of value storing maybe five seconds of animation per file and then caching each of these as they are about to be required...
Also, how large are your arrays, you could increase the amount of memory your JVM is able to allocate if it's not maximum array size, but maximum memory limitations you're hitting.
I hope this helps and isn't just my ramblings...
For ultra-fast code it essential that we keep locality of reference- keep as much of the data which is closely used together, in CPU cache:
http://en.wikipedia.org/wiki/Locality_of_reference
What techniques are to achieve this? Could people give examples?
I interested in Java and C/C++ examples. Interesting to know of ways people use to stop lots of cache swapping.
Greetings
This is probably too generic to have clear answer. The approaches in C or C++ compared to Java will differ quite a bit (the way the language lays out objects differ).
The basic would be, keep data that will be access in close loops together. If your loop operates on type T, and it has members m1...mN, but only m1...m4 are used in the critical path, consider breaking T into T1 that contains m1...m4 and T2 that contains m4...mN. You might want to add to T1 a pointer that refers to T2. Try to avoid objects that are unaligned with respect to cache boundaries (very platform dependent).
Use contiguous containers (plain old array in C, vector in C++) and try to manage the iterations to go up or down, but not randomly jumping all over the container. Linked Lists are killers for locality, two consecutive nodes in a list might be at completely different random locations.
Object containers (and generics) in Java are also a killer, while in a Vector the references are contiguous, the actual objects are not (there is an extra level of indirection). In Java there are a lot of extra variables (if you new two objects one right after the other, the objects will probably end up being in almost contiguous memory locations, even though there will be some extra information (usually two or three pointers) of Object management data in between. GC will move objects around, but hopefully won't make things much worse than it was before it run.
If you are focusing in Java, create compact data structures, if you have an object that has a position, and that is to be accessed in a tight loop, consider holding an x and y primitive types inside your object rather than creating a Point and holding a reference to it. Reference types need to be newed, and that means a different allocation, an extra indirection and less locality.
Two common techniques include:
Minimalism (of data size and/or code size/paths)
Use cache oblivious techniques
Example for minimalism: In ray tracing (a 3d graphics rendering paradigm), it is a common approach to use 8 byte Kd-trees to store static scene data. The traversal algorithm fits in just a few lines of code. Then, the Kd-tree is often compiled in a manner that minimalizes the number of traversal steps by having large, empty nodes at the top of tree ("Surface Area Heuristics" by Havran).
Mispredictions typically have a probability of 50%, but are of minor costs, because really many nodes fit in a cache-line (consider that you get 128 nodes per KiB!), and one of the two child nodes is always a direct neighbour in memory.
Example for cache oblivious techniques: Morton array indexing, also known as Z-order-curve-indexing. This kind of indexing might be preferred if you usually access nearby array elements in unpredictable direction. This might be valuable for large image or voxel data where you might have 32 or even 64 bytes big pixels, and then millions of them (typical compact camera measure is Megapixels, right?) or even thousands of billions for scientific simulations.
However, both techniques have one thing in common: Keep most frequently accessed stuff nearby, the less frequently things can be further away, spanning the whole range of L1 cache over main memory to harddisk, then other computers in the same room, next room, same country, worldwide, other planets.
Some random tricks that come to my mind, and which some of them I used recently:
Rethink your algorithm. For example, you have an image with a shape and the processing algorithm that looks for corners of the shape. Instead of operating on the image data directly, you can preprocess it, save all the shape's pixel coordinates in a list and then operate on the list. You avoid random the jumping around the image
Shrink data types. Regular int will take 4 bytes, and if you manage to use e.g. uint16_t you will cache 2x more stuff
Sometimes you can use bitmaps, I used it for processing a binary image. I stored pixel per bit, so I could fit 8*32 pixels in a single cache line. It really boosted the performance
Form Java, you can use JNI (it's not difficult) and implement your critical code in C to control the memory
In the Java world the JIT is going to be working hard to achieve this, and trying to second guess this is likely to be counterproductive. This SO question addresses Java-specific issues more fully.
Dear StackOverflowers,
I am in the process of writing an application that sorts a huge amount of integers from a binary file. I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a multitude of reads it slows down the algorithm quite significantly.
The standard way of doing this would be to fill ~50% of the available memory with a buffered object of some sort (BufferedInputStream etc) then transfer the integers from the buffered object into an array of integers (which takes up the rest of free space) and sort the integers in the array. Save the sorted block back to disk, repeat the procedure until the whole file is split into sorted blocks and then merge the blocks together.
The strategy for sorting the blocks utilises only 50% of the memory available since the data is essentially duplicated (50% for the cache and 50% for the array while they store the same data).
I am hoping that I can optimise this phase of the algorithm (sorting the blocks) by writing my own buffered class that allows caching data straight into an int array, so that the array could take up all of the free space not just 50% of it, this would reduce the number of disk accesses in this phase by a factor of 2. The thing is I am not sure where to start.
EDIT:
Essentially I would like to find a way to fill up an array of integers by executing only one read on the file. Another constraint is the array has to use most of the free memory.
If any of the statements I made are wrong or at least seem to be please correct me,
any help appreciated,
Regards
when you say limited, how limited... <1mb <10mb <64mb?
It makes a difference since you won't actually get much benefit if any from having large BufferedInputStreams in most cases the default value of 8192 (JDK 1.6) is enough and increasing doesn't ussually make that much difference.
Using a smaller BufferedInputStream should leave you with nearly all of the heap to create and sort each chunk before writing them to disk.
You might want to look into the Java NIO libraries, specifically File Channels and Int Buffers.
You dont give many hints. But two things come to my mind. First, if you have many integers, but not that much distinctive values, bucket sort could be the solution.
Secondly, one word (ok term), screams in my head when I hear that: external tape sorting. In early computer days (i.e. stone age) data relied on tapes, and it was very hard to sort data spread over multiple tapes. It is very similar to your situation. And indeed merge sort was the most often used sorting that days, and as far as I remember, Knuths TAOCP had a nice chapter about it. There might be some good hints about the size of caches, buffers and similar.
I'm working with a big matrix (not sparse), it contains about 10^10 double.
Of course I cannot keep it in memory, and I need just 1 row at time.
I thought to split it in files, every file 1 row (it requires a lot of files) and just read a file every time I need a row. do you know any more efficient way?
Why do you want to store it in different files? Can't u use a single file?
You could use functions inside RandomAccessFile class to perform the reading from that File.
So, 800KB per file, sounds like a good division. Nothing really stops you from using one giant file, of course. A matrix, at least one like yours that isn't sparse, can be considered a file of fixed length records, making random access a trivial matter.
If you do store it one file per row, I might suggest making a directory tree corresponding to decimal digits, so 0/0/0/0 through 9/9/9/9.
Considerations one way or the other...
is it being backed up? Do you have high-capacity backup media or something ordinary?
does this file ever change?
if it does change and it is backed up, does it change all at once or are changes localized?
It depends on the algorithms you want to execute, but I guess that in most cases a representation where each file contains some square or rectangular region would be better.
For example, matrix multiplication can be done recursively by breaking a matrix into submatrices.
If you are going to be saving it in a file, I believe serializing it will save space/time over storing it as text.
Serializing the doubles will store them as 2 bytes (plus serialization overhead) and means that you will not have to convert these doubles back and forth to and from Strings when saving or loading the file.
I'd suggest to use a disk-persistent cache like Ehcache. Just configure it to keep as many fragments of your matrix in memory as you like and it will take care of the serialization. All you have to do is decide on the way of fragmentation.
Another approach that comes to my mind is using Terracotta (which recently bought Ehache by the way). It's great to get a large network-attached heap that can easily manage your 10^10 double values without caring about it in code at all.