How to optimize the memory usage for large file processing

How to optimize the memory usage for large file processing - java

I have a file and from file I am populating the HashMap<String, ArrayList<Objects>>. HashMap size will be 25 for sure, means 25 keys, but the List will be huge say million records for each key.
So what I use to do now is for each key retrieve the list of records and process them parallel using threads. Things went on good until I faced the larger file and so I am facing the "java.lang.OutOfMemoryError: Java heap space".
I would like to ask you what is the best way instead populating the HashMap with the list of objects? What I am thinking is to get the 25 offsets of the file and instead of putting the lines I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought. But before I execute, I would like to know any better ways to optimize memory usage.

I will populate the HashMap<String, ArrayList<Objects>>
After populating the HashMap what do you need to do with it? I believe that just populating the Map is not your task. Whatever the scenario, you don't need to read the whole file in memory.
Increasing the heap size may not be a good solution as someday you may get a file even bigger than your heap size.
Read the file in chunks using a BufferedReader or BufferedInputStream depending on your needs and do your task as you read. The two APIs only read a part of the file in memory at a time.
I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought.
Using multiple threads will not prevent java.lang.OutOfMemoryError because all the threads will be in same JVM. Furthermore, no matter you read the file in one list or multiple lists, all the data from the file will be read into the same heap memory.
If you mention what you actually want to do with the data from file, this answer can be more specific.

Ditto what ares said. Need more information. What do you plan on doing with the map. Is it an operation that requires the whole file to be loaded onto memory ? Or can it be done in parts ?
Also, have you considered splitting the file into parts once its size surpasses a threshold size ?
Like Pshemo's answer here : How to break a file into pieces using Java?
Also, If you want to process in parallel, you could consider processing a map which covers a part of the file. Process that map in parallel and store the results in a queue of some sort. Provided the queue will contain a subset of the data you are processing(to avoid OutOfMemory exceptions).

Related

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Make uniq and indexing a huge set of String in java

I have a huge dataset of utf8 strings to process, I need to eliminate duplicate in order to have uniq set of string.
I'm using a hashet to check if the string is already know, but now I reached 100 000 000 strings, I do not have enough RAM and the process crash. Moreover, I only processed 1% of the dataset so in memory solution is impossible.
What I would like is a hybrid solution like a "in-memory index" and "disk-based storage" so I could use the 10Go of RAM I have to speed up the process.
=> Do you known a java library already doing this ? If not which algorithm should i look after ?
Using a bloom filter in memory to check if the string is not present could be a solution, but I still have to check the disk sometime (false positive) and I would like to know different solution.
=> How to store the strings on the disk to have a fast read and write access ?
_ I don't want to use an external service like a nosql db or mysql, it must be embedded.
_ I already try file based light SQL db like h2sql or hsql but they are very bad at handling massive dataset.
_ I don't consider using Trove/Guava Collections as a solution (unless they offer disk based solution I'm not aware of), I'm already using an extremly memory efficient custom hashset and I don't even store String but byte[] in memory. I already tweaked -Xmx stuff for the jvm.
EDIT: The dataset I'm processing is huge, the raw unsorted dataset doesn't fit on my hard disk. I'm streaming it byte per byte and processing it.

What you could do would be to use an External Sorting Technique such as the External Merge Sort in which you would sort your data first.
Once that this is done, you could iterate through the sorted set and keep the last element you have encountered. Once that you have that, you would check the current item with the next. If they are the same, you move on to the next item. If not, you would update the item you currently have.
To avoid huge memory consumption, you could dump your list of unique items to hard drive whenever a particular threshold is reached and keep on going.
Long story short:
Let data be the data set you need to work with
Let sorted_data = External_Merge_Sort(data)
Data_Element last_data = {}
Let unique_items be the set of unique items you want to yield
foreach element e in sorted_data
if(e != last_data)
{
last_data = e
add e in unique_items
if (size(unique_items) == threshold)
{
dump_to_drive(unique_items)
}
}

What is the total data size you have ? If that is not in tera bytes and suppose you can use say 10 machines, I would suggest some external cache like memcached (spymemcached is a good java client form memcached).
Install memcached on the 10 nodes. Spymemcached client should be initialized with the list of memcached servers, so that they become a virtual cluster for our program.
For each string you read:
check if it is already in memcache
if it is in memcache:
will check the next string
continue
else:
add it to memcache
add it to list of string buffers to be flushed to disk
if size of the list of strings to be flushed > certain threshold:
flush them to disk
flush any remaining string to disk
Another approach is to use some kind of map-reduce :), without Hadoop:)
Deduplicate first 2 GB of Strings and writeout the de-duplicated stuff to an intermediate file
Follow the above step with next 2GB of Strings and so on.
Now apply the same method on the intermediate de-duplicated files.
When the total size of intermediate de-duplicated data is smaller, use Memcache or internal HashMap to produce the final output.
This approach doesn't involve sorting and hence may be efficient.

Reading Big File in Java

I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output.
The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).
Note:
I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)
For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.
Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.
Please suggest to reduce Memory utilization as I need to process 2GB file.

Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.
From their website, the following are the features they support:
Arbitrary numbers of values per line
Ignoring commas in quoted elements
Handling quoted entries with embedded carriage returns (ie entries
that span multiple lines)
Configurable separator and quote characters (or use sensible
defaults)
Read all the entries at once, or use an Iterator style model
Creating csv files from String[] (ie. automatic escaping of embedded
quote chars)

Use best practices to upgrade your program
Write Multithread in program to get better cpu utilization.
Set heap minimum and maximum heap size to get better use of ram.
Use proper data structure and design.

Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.
I think you should revise your design. Given your requirements
The Upper part of output screen shows each record from file in that
order in JTable, whereas lower part shows statistics computed based on
that data
you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator

Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.
Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.

Reading a large input files(10gb) through java program

I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help

If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.

It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
Read in one piece of your data
Process the piece of data
Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.

My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help

File processing in java

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.

What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.

I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).

2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance

I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.