How to Store Large Java Objects/Multidimensional Arrays? - java

I'm writing a Web Application with a Java back-end running on a Tomcat server and a JavaScript client.
In the backend I have to process a large int[][][] array, which holds the information of a CT-Scan. Size is approx. 1024x1024x200.
I want to load this array into memory only when it's needed to process new data like image slices, and store it in some kind of database for the remaining time.
Things I tried so far:
Using JDBM3 to store a String, int[][][] Hashmap, runs into out of memory error
Serializing object and save it into PostgreSQL-DB using bytea[] data type, stores correctly but is getting memory error while loading again.
So my first question is, how can I save such a big array (which db, method)? It should load fast, and there should be some kind of multi-user access security because multiple users will be able to use front-end and therefore load the int[][][] into the back-end. The database should have a non-commercial license eg. GPL, MIT, Apache...
Second question, I know I could save the array serialized in the file system and keep the link in the db, but is the access safe for multiple users?

If you have enough RAM on the client machines, you could start by simply increasing the size of the JVM heap. This way, you should be able to create larger arrays without running into 'out of memory' errors.
You'll need at least approximately 800 Mb to play with (1024 x 1024 x 200 x 32 bits) for the array alone.

I think a MemoryMappedFile was born to handle exactly stuff like this. It offers you an array-like view of a file on disk, with random access for read and write. All you have to do is develop a scheme that lays out an int[][][] over a byte[] and that shouldn't be a problem. If you do it this way, you'd never have to hold the whole array in memory, but create only the slices that you are actually using. Even if you need to iterate over all the slices, you can instantiate only a single slice at a time.

If its a CAT-Scan, are the pixels 256 color grayscale? if so, you can save a significant amount of memory by storing the data as a byte-array rather than an int array. If it is 64K grayscale, use a short rather than an int.

Related

Storing enormously large arrays

I have a problem. I work in Java, Eclipse. My program calculates some mathematical physics, and I need to draw animation (Java SWT package) of the process (some hydrodynamics). The problem is 2D, so each iteration returns two dimensional array of numbers. One iteration takes rather long time and time needed for iteration changes from one iteration to another, so showing pictures dynamically as program works seems like a bad idea. In this case my idea was to store a three dimensional array, where third index represents time, and building an animation when calculations are over. But in this case, as I want accuracuy from my program, I need a lot of iterations, so program easily reaches maximal array size. So the question is: how do I avoid creating such an enormous array or how to avoid limitations on array size? I thought about creating a special file to store data and then reading from it, but I'm not sure about this. Do you have any ideas?
When I was working on a procedural architecture generation system at university for my dissertation I created small, extremely easily read and parsed binary files for calculated data. This meant that the data could be read in within an acceptable amount of time, despite being quite a large amount of data...
I would suggest doing the same for your animations... It might be of value storing maybe five seconds of animation per file and then caching each of these as they are about to be required...
Also, how large are your arrays, you could increase the amount of memory your JVM is able to allocate if it's not maximum array size, but maximum memory limitations you're hitting.
I hope this helps and isn't just my ramblings...

How to Handle Large Data in Java?

My application needs to use data in a text file which is up to 5 GB in size. I cannot load all of this data into RAM as it is far too large.
The data is stored like a table, 5 million records (rows) and 40 columns each containing text that will be converted in memory to either string, ints, or doubles.
I've tried caching only 10 - 100 MB of data in memory and reloading from the file when I need data outside but it is way too slow! When I run calculations because I can randomly jump from any row within the table it would constantly need to open the file, read and close.
I need something fast, I was thinking of using some sort of DB. I know calculations with large data like this may take a while which is fine. If I do use a DB it needs to be setup on launch of the desktop application and not require some sort of server component to be installed before.
Any tips? Thanks
I think you need to clarify some things:
This is desktop application (I assume yes), what is the memory limit for it?
Do you use your file in read-only mode?
What kind of calculations are you trying to do? (how often random rows are accessed, how often consequent rows are read, do you need to modify data)
Currently I see two ways for further investigation:
Use SQLite. This is small single-file DB, oriented mainly for desktop applications and single-user use. It's doesn't require any server, all you need is to have appropriate jdbc library.
Create some kind of index, using, for example, binary tree. First time you read your file, index the start position of the rows within the file. In conjunction with permanently open random access file this will help you to seek and read quickly desired row. For binary tree, your index may be approximately 120M. (it's RowsCount * 2 * IndexValueSize for binary tree)
You can use an embedded database, you can find a comparison here: Java Embedded Databases Comparison.
Or, depending on your use case you may even try to use Lucene which is a full text search engine.

Large-scale processing of seralized Integer objects

I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.
So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.
Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this
Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.

Reading Big File in Java

I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output.
The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).
Note:
I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)
For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.
Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.
Please suggest to reduce Memory utilization as I need to process 2GB file.
Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.
From their website, the following are the features they support:
Arbitrary numbers of values per line
Ignoring commas in quoted elements
Handling quoted entries with embedded carriage returns (ie entries
that span multiple lines)
Configurable separator and quote characters (or use sensible
defaults)
Read all the entries at once, or use an Iterator style model
Creating csv files from String[] (ie. automatic escaping of embedded
quote chars)
Use best practices to upgrade your program
Write Multithread in program to get better cpu utilization.
Set heap minimum and maximum heap size to get better use of ram.
Use proper data structure and design.
Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.
I think you should revise your design. Given your requirements
The Upper part of output screen shows each record from file in that
order in JTable, whereas lower part shows statistics computed based on
that data
you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator
Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.
Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.

Mapping files bigger than 2GB with Java

It could be generally stated: how do you implement a method byte[] get(offset, length) for a memory-mapped file that is bigger than 2GB in Java.
With context:
I'm trying to read efficiently files that are bigger than 2GB with random i/o. Of course the idea is to use Java nio and memory-mapped API.
The problem comes with the limit of 2GB for memory mapping. One of the solutions would be to map multiple pages of 2GB and index through the offset.
There's a similar solution here:
Binary search in a sorted (memory-mapped ?) file in Java
The problem with this solution is that it's designed to read byte while my API is supposed to read byte[] (so my API would be something like read(offset, length)).
Would it just work to change that ultimate get() to a get(offset, length)? What happens then when the byte[] i'm reading lays between two pages?
No, my answer to Binary search in a sorted (memory-mapped ?) would not work to change get() to get(offset, length) because of the memory mapped file array boundary, like you suspect. I can see two possible solutions:
Overlap the memory mapped files. When you do a read, pick the memory mapped file with the start byte immediately before the read's start byte. This approach won't work for reads larger than 50% of the maximum memory map size.
Create a byte array creation method that reads from two different two different memory mapped files. I'm not keen on this approach as I think some of the performance gains will be lost because the resulting array will not be memory mapped.

Categories

Resources