I'm working with a big matrix (not sparse), it contains about 10^10 double.
Of course I cannot keep it in memory, and I need just 1 row at time.
I thought to split it in files, every file 1 row (it requires a lot of files) and just read a file every time I need a row. do you know any more efficient way?
Why do you want to store it in different files? Can't u use a single file?
You could use functions inside RandomAccessFile class to perform the reading from that File.
So, 800KB per file, sounds like a good division. Nothing really stops you from using one giant file, of course. A matrix, at least one like yours that isn't sparse, can be considered a file of fixed length records, making random access a trivial matter.
If you do store it one file per row, I might suggest making a directory tree corresponding to decimal digits, so 0/0/0/0 through 9/9/9/9.
Considerations one way or the other...
is it being backed up? Do you have high-capacity backup media or something ordinary?
does this file ever change?
if it does change and it is backed up, does it change all at once or are changes localized?
It depends on the algorithms you want to execute, but I guess that in most cases a representation where each file contains some square or rectangular region would be better.
For example, matrix multiplication can be done recursively by breaking a matrix into submatrices.
If you are going to be saving it in a file, I believe serializing it will save space/time over storing it as text.
Serializing the doubles will store them as 2 bytes (plus serialization overhead) and means that you will not have to convert these doubles back and forth to and from Strings when saving or loading the file.
I'd suggest to use a disk-persistent cache like Ehcache. Just configure it to keep as many fragments of your matrix in memory as you like and it will take care of the serialization. All you have to do is decide on the way of fragmentation.
Another approach that comes to my mind is using Terracotta (which recently bought Ehache by the way). It's great to get a large network-attached heap that can easily manage your 10^10 double values without caring about it in code at all.
Related
As the title states, I'm trying to read an unknown # of elements from a file into an array. What's the most simple (professor wants us to avoid using things she hasnt taught) yet effective way of going about this?
I've thought about reading and counting the elements in the file one by one, then create the array after I know what size to make it, then actually store the elements in there. But that seems a little inefficient. Is there a better way?
There are only two ways to do this: the way you suggested (count, then read), and making an array and hoping it's good enough, then resize if it's not (which is the easier of the two, as ArrayList does that automatically for you).
Which is better depends on whether you're more limited by time or memory (as typically reading the file twice will be slower than reallocating an array even multiple times).
EDIT: There is a third way, which is only available if each record in the file has a fixed width, and the file is not compressed (or encoded in any other way that would mess with the content layout): get the size of the file, divide by record size, and that's exactly how many records you have to allocate for. Unfortunately, life is not always that easy. :)
As you have not mentioned that your professor has taught ArrayList or not so, in that case I would go for ArrayList for sure. The sole purpose f ArrayList is to deal with this kind of situations. Features like Dynamic Memory gives ArrayList some advantages over array.
You can count first and then create array accordingly. But this will be slow and why reinvent the wheel?
So, your approach should be to use ArrayList read the element and just use ArrayList.add()...
I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.
Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.
As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.
I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.
Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.
I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.
So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.
Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this
Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.
First of all, sorry for my English and for the length of the message.
I'm writing a simple application in Java for visual cryptography for a school project that takes a schema File and a secret image, then creates n images using the information contained in the schema.
For each pixel in the secret image the application looks for a matrix in the schema file and write m pixels in the n shares (one row for each share).
A schema file contains the matrices (n*m) for every color needed for encoding and it is composed as follows
COLLECTION COLOR 1
START MATRIX 1
RGB
GBR
BGR
END
START MATRIX 2
.....
COLLECTION COLOR 2
START MATRIX 1
XXX
XXX
XXX
END
......
//
This file can be a few lines or many thousands so I can't save the matrices in the application, but I need to always read the file.
To test the performance I created a parser that simply search the matrix looking line by line, but it is very slow.
I thought I'd save the line number of each matrix and then use RandomAccessFile to read it but I wanted to know if there is a more powerful method for doing this.
Thanks
If you are truly dealing with massive, massive input files that exceed your ability to load the entire thing into RAM, then using a persistent key/value store like MapDB may be an easy way to do this. Parse the file once and build of an efficient [Collection+Color]->Matrix map. Store that in a persistent HTree. That'll take care of all of the caching, etc... for you. Make sure to create a good hash function for the Collection+Color tuple, and it should be very performant.
If your data access pattern tends to clump together, it may be faster to store in a B+Tree index - you can play with that and see what works best.
For your schema file, use a FileChannel and call .map() on it. With a little effort, you can calculate the necessary offsets into the mapped representation of the file and use that, or even encapsulate this mapping into a custom structure.
We have a binary file which contains a large amount of float data (about 80MB). we need to process it in our Java application. The data is from a medical scanner. One file contains data from one Rotation. One Rotation contains 960 Views. One View contains 16 Rows and one Rows contains 1344 Cells. Those numbers (their relationship) are fixed.
We need to read ALL the floats into our application with a code structure reflect above structure about Rotation-view-row-cell.
What we are doing now is using float[] to hold data for Cells and then using ArrayList for Rotation, View and Row to hold their data.
I have two questions:
how to populate the Cell data (read floats into our float[]) quickly?
do you have better idea to hold those data?
Use a DataInputStream (and its readFloat() method) wrapping a FileInputStream, possibly with e BufferedInputStream in between (try whether the buffer helps performance or not).
Your data structure looks fine.
Assuming you don't make changes to the data (add more views, etc.) why not put everything in one big array? The point of ArrayLists is you can grow and shrink them, which you don't need here. You can write access methods to get the right cell for a given view, rotation, etc.
Using arrays of arrays is a better idea, that way the system is figuring out how to access what for you and it is just as fast as a single array.
Michael is right, you need to buffer the input, otherwise you will be doing a file access operation for every byte and your performance will be awful.
If you want to stick with the current approach as much as possible, you can minimize the memory used by your ArrayLists by setting their capacity to the number of elements they hold. Otherwise they keep a number of slots in reserve, expecting you to add more.
For the data loading:
DataInputStream should work well. But make sure you wrap the underlying FileInputStream in a BufferedInputStream, otherwise you run the risk of doing I/O operations for every float which can kill performance.
Several options for holding the data:
The (very marginally) most memory-efficient way will be to store the entire array in on large float[], and calculate offsets into it as needed. A bit ugly to use, but might make sense if you are doing a lot of calculations or processing loops over the entire set.
The most "OOP" way would be to have separate objects for Rotation, View, Row and Cell. But having each cell as a separate object is pretty wasteful, might even blow your memory limits.
You could use nested ArrayLists with a float[1344] to represent the lowest level data for the cells in each row. I understand this is what you're currently doing - in fact I think it's a pretty good choice. The overhead of the ArrayLists won't be much compared to the overall data size.
A final option would be to use a float[rotationNum][rowNum][cellNum] to represent each rotation. A bit more efficient than ArrayLists, but arrays are usually less nice to manipulate. However this seems a pretty good option if, as you say, the array sizes will always be fixed. I'd probably choose this option myself.
Are you having any particular performance/usage problems with your current approach?
The only thing I can suggest based on the information that you provide is to try representing a View as float[][] of rows and cells.
I also think that you can put all your data structure into a float[][][] (same as Nathan Hughes suggests). You could have a method that reads your file and return a float[][][], where the first dimension is that of views (960), the second is that of rows (16), and the third is that of cells (1344): if those numbers are fixes, you'd better use this approach: you save memory, and it's faster.
80 MB shouldn't be so much data that you need to worry so terribly much. I would really suggest:
create Java wrapper objects representing the most logical structure/hierarchy for the data you have;
one way or another, ensure that you're only making an actual "raw" I/O call (so an InputStream.read() or equivalent) every 16K or so of data-- e.g. you could read into a 16K/32K byte array that is wrapped in a ByteBuffer for the purpose of pulling out the floats, or whatever you need for your data;
if you actually have a performance problem with this approach, try to identify, not second-guess, what that performance problem actually is.
I understand that you are looking effective way of store data you described above, though size you mentioned is not very huge i would suggest you to have look at Huge Collections.