Optimize inverted index Java

Optimize inverted index Java - java

I am trying to create an inverted index for Wikipedia pages however I keep running out of memory. I am not sure what else I can to do ensure it doesn't run out of memory. However we are talking about 3.9Mil words.
indexer.java
public void index() {
ArrayList<Page> pages = parse(); // Parse XML pages
HashMap<String, ArrayList<Integer>> postings = getPostings(pages);
}
public HashMap<String, ArrayList<Integer>> getPostings(ArrayList<Page> pages) {
assert pages != null;
englishStemmer stemmer = new englishStemmer();
HashSet<String> stopWords = getStopWords();
HashMap<String, ArrayList<Integer>> postings = new HashMap<>();
int count = 0;
int artCount = 0;
for (Page page : pages) {
if (!page.isRedirect()) { // Skip pages that are redirects.
StringBuilder sb = new StringBuilder();
artCount = count; // All the words until now
boolean ignore = false;
for (char c : page.getText().toCharArray()) {
if (c == '<') // Ignore words inside <> tags.
ignore = true;
if (!ignore) {
if (c != 39) {
if (c > 47 && c < 58 || c > 96 && c < 123) // Character c is a number 0-9 or a lower case letter a-z.
sb.append(c);
else if (c > 64 && c < 91) // Character c is an uppercase letter A-Z.
sb.append(Character.toLowerCase(c));
else if (sb.length() > 0) { // Check if there is a word up until now.
if (sb.length() > 1) { // Ignore single character "words"
if (!stopWords.contains(sb.toString())) { // Check if the word is not a stop word.
stemmer.setCurrent(sb.toString());
stemmer.stem(); // Stem word s
String s = sb.toString(); // Retrieve the stemmed word
if (!postings.containsKey(s)) // Check if the word already exists in the words map.
postings.put(s, new ArrayList<>()); // If the word is not in the map then create an array list for that word.
postings.get(s).add(page.getId()); // Place the id of the page in the word array list.
count++; // Increase the overall word count for the pages
}
}
sb = new StringBuilder();
}
}
}
if (c == '>')
ignore = false;
}
}
page.setCount(count - artCount);
}
System.out.println("Word count:" + count);
return postings;
}
Advantages
Some advantages for this approach are:
You can get the number of occurrences of a given word simply by getting the size of the associated ArrayList.
Looking up the number of times a given word occurs in a page is relatively easy.
Optimizations
Current optimizations:
Ignoring common words (stop words).
Stemming words to their roots and storing those.
Ignoring common Wikipedia tags that aren't English words (included in stop word list such as: lt, gt, ref .. etc).
Ignoring text within < > tags such as: <pre>, <div>
Limitations
Array lists become incredibly large with number of occurrences for words, the major disadvantage of this approach comes when an array list has to grow. A new array list is created and the items from the previous array list need to be copied into the new array list. This could be a possible performance bottleneck. Would a Linked list make more sense here? As we are simply adding more occurrences and not reading the occurrences. This would also mean that since linked lists do not rely on an array as their underlying data structure they can grow without bounds and do not need to be replaced when they are too large.
Alternative approaches
I have considered dumping the counts for each word into a database like MongoDB after each page has been processed and then append the new occurrences. It would be: {word : [occurrences]} and then let the GC clean postings HashMap after each page has been processed.
I've also considered moving the pages loop into the index() method such that GC can clean up getPostings() before a new page. Then merging the new postings after each page but I don't think that will alleviate the memory burden.
As for the hash maps would a tree map be a better fit for this situation?
Execution
On my machine this program runs on all 4 cores using 90 - 100% and takes about 2-2.5GB RAM. It runs for over an hour and a half then: GC Out of memory.
I have also considered increasing the available memory for this program but it needs to run on my instructors machine as well. So it needs to operate as standard without any "hacks".
I need help making considerable optimizations, I'm not sure what else would help.

TL;DR Most likely your data structure won't fit in memory, no matter what you do.
Side note: you should actually explain what your task is and what your approach is. You don't do that and expect us to read and poke in your code.
What you're basically doing is building a multimap of word -> ids of Wikipedia articles. For this, you parse each non-redirect page, divide it into single words and build a multimap by adding a word -> page id mapping.
Let's roughly estimate how big that structure would be. Your assumption is around 4 millions of words. There's around 5 millions of articles in EN Wikipedia. Average word length in English is around 5 characters, so let's assume 10 bytes per word, 4 bytes per article id. We're getting around 40 MB for words (keys in the map), 20 MB for article ids (values in the map).
Assuming a multihashmap-like structure you could estimate the hashmap size at around 32*size + 4*capacity.
So far this seems to be manageable, a few dozen MBs.
But there will be around 4 millions collections to store ids of articles, each will be around 8*size (if you'll take array lists), where the size is a number of articles a word will be encountered in. According to http://www.wordfrequency.info/, the top 5000 words are mentioned in COCAE over 300 million times, so I'd expect Wikipedia to be in this range.
That would be around 2.5 GB just for article ids just for 5k top words. This is a good hint that your inverted index structure will probably take too much memory to fit on a single machine.
However I don't think that the you've got problems with the size of the resulting structure. Your code indicates that you load pages in memory first and process them later on. And that definitely won't work.
You'll most probably need to process pages in a stream-like fashion and use some kind of a database to store results. There's basically a thousand ways to do that, I'd personally go with a Hadoop job on AWS with PostgreSQL as the database, utilizing the UPSERT feature.

ArrayList is a candidate for replacement by a class Index you'll have to write. It should use int[] for storing index values and a reallocation strategy that uses an increment based on the overall growth rate of the word it belongs to. (ArrayList increments by 50% of the old value, and this may not be optimal for rare words.) Also, it should leave room for optimizing the storage of ranges by storing the first index and the negative count of following numbers, e.g.,
..., 100, -3,... is index values for 100, 101, 102, 103
This may result in saving entries for frequently occurring words at the cost of a few cycles.
Consider a dump of the postings HashMap after entering a certain number of index values and a continuation with an empty map. If the file is sorted by key, it'll permit a relatively simple merge of two or more files.

Related

Fast and efficient computation on arrays

I want to count the number of occurances for a particular phrase in a document. For example "stackoverflow forums". Suppose D represents the documents set with document containing both terms.
Now, suppose I have the following data structure:
A[numTerms][numMatchedDocuments][numOccurInADocument]
where numMatchedDocuments is the size of D and numOccurInADocument is the number of occurrences a particular term occurs in a particular document, for example:
A[stackoverflow][document1][occurance1]=3;
means, the term "stackoverflow" occurs in document "document1" and its first occurance is at position "3".
Then I pick the term that occur the least and loop over all its positions to find if "forum" occurs at a position+1 the current term "stackoverflow" positions. In other words, if I find "forum" at position 4 then that is a phrase and I've found a match for it.
the matching is straightforward per document and runs reasonably fast but when the number of documents exceed 2,000,000 it gets very slow. I've distributed it over cores and it gets faster of course but wonder if there is algorithmically better way of doing this.
thanks,
Psudo-Code:
boolean docPhrase=true;
int numOfTerms=2;
// 0 for "stackoverflow" and 1 for "forums"
for (int d=0;d<D.size();d++){
//D is a set containing the matched documents
int minId=getTheLeastOccuringTerm();
for (int i=0; i<A[minId][d].length;i++){ // For every position for LeastOccuringTerm
for( int t=0;t<numOfTerms;t++){ // For every terms
int id=BinarySearch(A[t][d], A[minId][d][i] - minId + t);
if (id<0) docPhrase=false;
}
}
}

As I mentioned in comments, Suffix Array can solve this sort of problem. I answered a similar question ( Fastest way to search a list of names in C# ) with a simple c# implementation of a Suffix Array.
The basic idea is you have an array of index pairs that point to a document index, and a position within that document. The index pair represents the string that starts at that point in the document, and continues to the end of the document. But the actual documents and their contents exist only once in your original store. The Suffix Array is just an array of these index pairs, with a pair for every position in every document. You then sort the Suffix Array in the order of the text they point to. Once sorted, you can now very quickly find any phrase among any of the documents by doing a simple Binary Search on the Suffix Array. Constructing (mainly sorting) the Suffix Array can be time consumptive. But once constructed, it is very fast to search on. It's fairly easy on memory since the actual document contents only exist once.
It would be trivial to extend it to returning counts of phrase matches within each document.
This is a little different than the classic description of a Suffix Array where they are usually talking about the Suffix Array operating over one single, very large string. But the changes to make it work for an array of strings/documents is not that large, although it can increase the amount of memory consumed by the Suffix Array depending on the maximum number of documents and the maximum document length, and how you encode the index pairs.

Returning a Subset of Strings from 10000 ascii strings

My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.

Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there

They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)

As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.

I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.

I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.

I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.

I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]

I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.

I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.

I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.

The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)

How to count string num with limit memory?

The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.

I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as #jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.

Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.

I guess you mean the number of distinct words do you?
So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.
To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.
Update
Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.

I suck at explaining theoretical answers but here we go....
I have made an assumption about your question as it is not entirely clear.
The memory used to store all the distinct words is 80MB (the entire file is bigger).
The words could contain non-ascii characters (so we just treat the data as raw bytes).
It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.
// Loop over the file and for each word:
//
// Compute a hash of the word.
// Convert the hash to a number by some means (skip if possible).
// If the number is odd then skip to the next word.
// Use conventional means to store the distinct word.
//
// Do something with all the distinct words.
Then repeat the above a second time using even instead of odd.
Then you have divided the task into 2 and can do each separately.
No words from the first set will appear in the second set.
The hash is necessary because the words could (in theory) all end with the same letter.
The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.

Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.

I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]

Depending on what kind of character the words are build of you can chose for this system:
If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.
So compute the checksum for the strings like this:
public static long checksum(String str)
{
String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
long checksum = 0;
for (int i = 0; i < str.length(); ++i)
{
int c = tokens.indexOf(str.charAt(i));
checksum *= tokens.length();
checksum += c;
}
return checksum;
}
This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.
So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.

You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>
If you just need the distinct words you can use TLongHashSet

If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.
If you need to sort using Java, this post might help:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

You can use constant memory by reading your file multiple times.
Basic idea:
Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.
Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
Remove element if we encounter the same value in a partition p_j with j smaller i
Output result counts for elements in the Map
Clear Map, repeat for all p_1...p_n

As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.
Your scarce resource is memory, so you can't store the words in RAM.
You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.
Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).
So my proposal is to iterate over the file, and for each word:
If the word was found for first time, write the string to a result file together with the integer value 1.
If the word was processed before (it will appear in the result file), increment the record value.
This solution scales well no matter the number of lines of your input file nor the length of the words*.
You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.
EDIT:
*It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.

possible solutions:
Use file sorting and then just count the consequent occurences of each value.
Load the file in a database and use a count statement like this: select value, count(*) from table group by value

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.