Sort huge file in java

Sort huge file in java - java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Related

Binary Search a file with varying word length

I am making a crude Java spellchecker that takes an article and a pre-sorted dictionary file. The length of the article's words vary, and therefore I tried making a stack that takes in the words given by the file.
This unfortunately didn't work because the stack ran out of space (even with the shortened dictionary file) and due to performance concerns, I decided to read from the text file directly.
The issue is that the file doesn't have words of the same length. Because that the length of the words vary, I cannot and should not expect that the length of a single word is useful in determining both how many words are in the dictionary file from how large that file is.
Because of this I am stuck. I need to execute a binary search on that file in order to make the spellcheck program work. But I can't execute a binary search if there's no clear way to treat the file as an array, especially when the array is just too large to put into the program's memory.
What should I do?

The Oxford English dictionary suggets that there are about ~250,000 words that you need to consider for your dictionary (not factoring in words only used in highly specific domain). This is an important design information for you.
I see some solutions:
1) Simply using a HashSet<>
In theory, you can use a HashSet<> for this amount of elements (this SO post discusses theoretical limits of HashSets and other in detail).
However, this brings (as you have observed) a couple of problems:
It takes time (on every application start up) to read this into the RAM
It eats up RAM
Of course you can increase the heap size of your JRE but there is a natural limit to that (#StvnBrkddll linked a SO post that describes this perfectly in the comments)
2) Using a Database
I'd consider storing the valid words in a (relational) database:
You don't need to load everything on application start up
It does not weigh as heavy on your RAM as option (1)
It gives you more options, if you want to change your application to also suggest similar words without typos to the user (e.g. if you use PostgreSQL you could achiev pg_trgm)
It has some drawbacks though:
You mentioned your application is simple: Having a database system adds complexity

When is it better to load all data from file at initialization, as opposed to performing file lookup on-demand? (Java)

One is sometimes faced with the task of parsing data stored in files on the local system. A significant dilemma is whether to load and parse all of the file data at the beginning of the program run, or access the file throughout the run and read data on-demand (assuming the file is sorted, so search is performed in constant time).
When it comes to small data sets, the first approach seems favorable, but with larger ones the threat of clogging up the heap increases.
What are some general guidelines one can use in such scenarios?

That's the standard tradeoff in programming - memory vs performance, Space–time tradeoff etc. There is no "right" answer to that question. It depends on the memory you have, speed you need, size of files, how often you query them etc.
In your specific case and since it seems like a one time job (if you are able to read it in the beginning) then it probably won't matter that much ;)

That depends entirely on what your program needs to do. The general advice is to keep only as much data in memory as is necessary. For example, consider a simple program that reads each record from a file of transactions, and then reports the total number of transactions and the total dollar amount:
count = 0
dollars = 0
while not end of file
read record
parse record
increment count
add transaction amount to dollars
end
output count and dollars
Here, you clearly need to have only one transaction record in memory at a time. So you read a record, process it, and discard it. It makes no sense to load all of the records into a list or other data structure, and then iterate over the list to get the count and total dollar amount.
In some cases you do need multiple records, perhaps all of them, in memory. In those cases, all you do is re-structure the program a little bit. You keep the reading loop, but have it add records to a list. Then afterwards you can process the list:
list = []
while not end of file
read record
parse record
add record to list
end
process list
output results
It makes no sense to load the entire file into a list, and then scan the list sequentially to obtain count and dollar amount. Not only is that a waste of memory, it makes the program more complex, uses memory to no gain, will be slower, and will fail with large data sets. The "memory vs performance" tradeoff doesn't always apply. Often, as in this case, using more memory makes the program slower.
I generally find it a good practice to structure my solutions so that I keep as little data in memory as is practical. If the solution is simpler with sorted data, for example, I'll make sure that the input is sorted before I run the program.
That's the general advice. Without specific examples from you, it's hard to say what approach would be preferred.

Efficient and scalable way to sort large amount of strings in Java

I am looking for some ideas idea on sorting large amount of strings from an input file and print out the sorted results to a new file in Java. The requirement is that the input file could be extremely large. I need to consider the performance in the solution, so any ideas?

External Sorting technique is generally used to sort huge amounts of data. May be this is what you need.
externalsortinginjava is the java library for this.

Is an SQL database available? If you inserted all the data into a table, with the sortable column or section indexed, you may (or may not) be able to output the sorted result more efficiently. This solution may also be helpful if the amount of data, outweighs the amount of RAM available.
It would be interesting to know how large, and what the purpose is.

Break the file into amounts you can read in memory.
Sort each amount and write to a file. (If you could fit everything into memory you are done)
Merge sort the resulting files into a single sorted file.
You can also do a form of radix sort to improve CPU efficiency, but the main bottleneck is all the re-writing and re-reading you have to do.

Smart buffering in an environment with limited amount of memory Java

Dear StackOverflowers,
I am in the process of writing an application that sorts a huge amount of integers from a binary file. I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a multitude of reads it slows down the algorithm quite significantly.
The standard way of doing this would be to fill ~50% of the available memory with a buffered object of some sort (BufferedInputStream etc) then transfer the integers from the buffered object into an array of integers (which takes up the rest of free space) and sort the integers in the array. Save the sorted block back to disk, repeat the procedure until the whole file is split into sorted blocks and then merge the blocks together.
The strategy for sorting the blocks utilises only 50% of the memory available since the data is essentially duplicated (50% for the cache and 50% for the array while they store the same data).
I am hoping that I can optimise this phase of the algorithm (sorting the blocks) by writing my own buffered class that allows caching data straight into an int array, so that the array could take up all of the free space not just 50% of it, this would reduce the number of disk accesses in this phase by a factor of 2. The thing is I am not sure where to start.
EDIT:
Essentially I would like to find a way to fill up an array of integers by executing only one read on the file. Another constraint is the array has to use most of the free memory.
If any of the statements I made are wrong or at least seem to be please correct me,
any help appreciated,
Regards

when you say limited, how limited... <1mb <10mb <64mb?
It makes a difference since you won't actually get much benefit if any from having large BufferedInputStreams in most cases the default value of 8192 (JDK 1.6) is enough and increasing doesn't ussually make that much difference.
Using a smaller BufferedInputStream should leave you with nearly all of the heap to create and sort each chunk before writing them to disk.

You might want to look into the Java NIO libraries, specifically File Channels and Int Buffers.

You dont give many hints. But two things come to my mind. First, if you have many integers, but not that much distinctive values, bucket sort could be the solution.
Secondly, one word (ok term), screams in my head when I hear that: external tape sorting. In early computer days (i.e. stone age) data relied on tapes, and it was very hard to sort data spread over multiple tapes. It is very similar to your situation. And indeed merge sort was the most often used sorting that days, and as far as I remember, Knuths TAOCP had a nice chapter about it. There might be some good hints about the size of caches, buffers and similar.

Sorting+merging lines of multiple files according to a timestamp

I have multiple text files that represent logging entries which I need to parse later on. Each of the files is up to 1M in size and I have approximately 10 files.
Each line has the following format:
Timestamp\tData
I have to merge all files and sort the entries by the timestamp value. There is no guarantee that the entries of 1 file are in correct chronological order.
What would be the smartest approach? My Pseudo'd code looks like this:
List<FileEntry> oneBigList = new ArrayList<FileEntry>();
for each file {
parse each line into an instance of FileEntry;
add the instance to oneBigList;
}
Collections.sort(oneBigList according to FileEntry.getTimestamp());

If you are not sure that your task will fit into available memory, you are better off inserting your lines after parsing into a database table and have the database worry about how to order the data (an index on the timestamp column will help :-)
If you are sure memory is no problem, I would use a TreeMap to do the sorting while I add the lines to it.
Make sure your FileEntry class implements hashCode(), equals() and Comparable according to your sort order.

Within each file, you can probably assume that the entries are time ordered, as the "next" line was written after the "previous" line.
This means that you should probably implement a merge sort. Preferably merge sort the two smallest files to each other, and then repeat until you have one file.
Note that if these files come from multiple machines, you are still going to have the logs out-of-order; because, unless the machine clocks are synchronized by some reliable means, the clocks will differ. Even if they are synchronized, the clocks will differ; however, they might differ by a small enough amount to not matter.
Merge sort is not the fastest possible sort; however, it has some very beneficial side effects. Namely that it can be implemented in parallel for each pair of files, and that it is far faster than sorts which don't assume order, it is memory consumption friendly, and that you can easily checkpoint at the end of two files merging. This means that you can recover from an interrupted sorting session, while only losing part of the effort.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.