The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.
I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as #jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.
Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.
I guess you mean the number of distinct words do you?
So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.
To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.
Update
Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.
I suck at explaining theoretical answers but here we go....
I have made an assumption about your question as it is not entirely clear.
The memory used to store all the distinct words is 80MB (the entire file is bigger).
The words could contain non-ascii characters (so we just treat the data as raw bytes).
It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.
// Loop over the file and for each word:
//
// Compute a hash of the word.
// Convert the hash to a number by some means (skip if possible).
// If the number is odd then skip to the next word.
// Use conventional means to store the distinct word.
//
// Do something with all the distinct words.
Then repeat the above a second time using even instead of odd.
Then you have divided the task into 2 and can do each separately.
No words from the first set will appear in the second set.
The hash is necessary because the words could (in theory) all end with the same letter.
The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.
Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.
I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]
Depending on what kind of character the words are build of you can chose for this system:
If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.
So compute the checksum for the strings like this:
public static long checksum(String str)
{
String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
long checksum = 0;
for (int i = 0; i < str.length(); ++i)
{
int c = tokens.indexOf(str.charAt(i));
checksum *= tokens.length();
checksum += c;
}
return checksum;
}
This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.
So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.
You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>
If you just need the distinct words you can use TLongHashSet
If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.
If you need to sort using Java, this post might help:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
You can use constant memory by reading your file multiple times.
Basic idea:
Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.
Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
Remove element if we encounter the same value in a partition p_j with j smaller i
Output result counts for elements in the Map
Clear Map, repeat for all p_1...p_n
As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.
Your scarce resource is memory, so you can't store the words in RAM.
You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.
Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).
So my proposal is to iterate over the file, and for each word:
If the word was found for first time, write the string to a result file together with the integer value 1.
If the word was processed before (it will appear in the result file), increment the record value.
This solution scales well no matter the number of lines of your input file nor the length of the words*.
You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.
EDIT:
*It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.
possible solutions:
Use file sorting and then just count the consequent occurences of each value.
Load the file in a database and use a count statement like this: select value, count(*) from table group by value
Related
I have a text file containing ~30,000 words in alphabetical order each on a separate line.
I also have a Set<String> set containing ~10 words.
I want to check if any of the words in my set are in the word list (text file).
So far my method has been to:
Open the word list text file
Read a line/word
Check if set contains that word
Repeat to the end of the word list file
This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.
My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?
Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.
You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:
wordListHashSet.add(mycurrentword.hashcode());
You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set.
Create an iterator of this Integer Hash Set.
Iterator it = myTenWordsHashSet.iterator();
Iterate over this in a loop and check for the following condition:
wordListHashSet.contains(it.next());
If this is true, then you have the word in the wordlist.
Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.
Hope that helps!
It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.
In that case, you could do a binary search in the large file for each of the search words, using Random Access Files.
Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.
You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.
Since you'll repeat this step once for every word in your set, it should take 10*log(30k)
You can make some improvements depending on your needs.
If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).
I was reading about algorithmic problem and one was the following:
Having a file with millions of lines of data, there are 2 lines which
are identical. The lines are so long that may not fit in memory. Find
the 2 identical lines.
The solution suggested was to read lines in parts and create hashes for each line.
E.g. you build the hash for line 1 by building the hash of part-1 of line 1 (which can be read in memory) and then hash of part-2 of line 1 up to part-N of line 1.
Store the hashes in file or hashtable. For any same hash values compare the lines. If the lines are the same we solved it.
Although I understand this solution in high level, I have no idea how this could be implemented. How can we associate a hash with a specific line in file? Is this language implementation detail?
E.g. in Java how would we address this?
The real answer is buy more memory. The longest string you can have in Java 2 GB and that will fit in machines these days. You can buy 32 GB for less than $200.
But to solve the problem, I suggest you
find the offset of each line.
find the lines which are the same length (using the difference of offset)
calculate 64-bit or longer hashes of the lines with the same length.
for the lines with the same hash, do a byte-by-byte comparison.
Note: if you don't have enough memory to cache the entire file this will take a very long time. If you have a 32 GB machine and it has a 64 GB file, each pass will take about 20 minutes, and this has multiple passes.
1)Which API to find the offset?
You count the number of bytes you have read, and that is the offset.
2)The real answer is buy more memory Project Managers don't agree on this for real products. Do you have different experience?
I point out to them that I could spend a day which could cost them > $1000 (even if that is not what I get paid) saving $100 of reusable memory if they think that is good use of resources. I let them decide ;)
My 8 year old son has 8 GB in a PC he built as the memory cost me £24. Yet you are right that there are project mangers who think 8 GB is too much for a professional which is costing them that much per hour!? I have 16 GB in PC which I don't use to run anything serious because I do my work on machine with 256 GB. You can buy machines with 2 TB these days which is overkill for most applications. ;)
While i agree the solution is to utilize modern techniques, and leverage how cheap memory is these days, the problem is one meant to exercise the mind and understand how to solve the problem under the given constraints.
The hashing you talked about is rather simple.
The java solution can leverage a few things under the hood which may obscure whats actually going on so i will explain the solution first, and the java implementation second.
Generic Solution:
Hashing, such as SHA1, MD5, etc. generate an integer by compressing the input. Lets say you can only store the first MB of characters in each line.
You would iterate over each line, get the first MB of characters, and pass that into the Hashing algorithm(MD5 for example).
You then map the hash as the key, and a list/array of line numbers as the value.
After the first pass, any lines with a matching first MB of characters will end up with the same hash, and thus in the same list in the map.
To prepare for the second pass, you search the map and cull any lists that contain only one line number.
Then you create a list of line numbers by compiling the line numbers from the remaining entries in the map, these lines will be the only ones checked in the second pass.
Second pass, you pull the Second MB of characters from each line in your line list, hash them and put them in the map in the same fashion as pass one.
Iterate over the entries in the map, culling hash entries that only have one line number.
repeat step two but by incrementing the character block(MB) to coincide with the pass number.
when you reach a pass where you only have one hash with multiple line numbers, and that hash only has two elements, those lines are the two that are the same.
This is essentially a tree search.
Java Method:
Java has a class called HashMap, which automatically hashes the key. by using a
HashMap<String,ArrayList<Integer>>
for your master map, all you have to do each call
map.get(mbBlock).add(lineNumber); of course you should check to see if this is the first time this key has been used so you dont get a null pointer exception.
after each pass, cull the entries containing only one line.
reiterate over the remaining lines until you only have two line numbers left
Get the first k characters of each line, where k is configurable. Do your hash to find several groups of lines that could have identical lines.
Based on the result of the first step, in which you great narrow down the search range, run your algorithm on each smaller groups for the next k characters.
The search range is narrowed down dramatically after each round if not in the worst case.
The trick of algorithm is that breaking big problems into small ones and make full use of the results of previous steps.
In a java application, I have a requirement where a user will define a string value, and then keep on appending further string values to original value...
There can be multiple different named strings defined by the user..
From hashmap, array list and linked list, which one should I use on the basis of following criteria:
(1) Most memory efficient
(2) Max possible space per single string value
Also what is the max possible size of a single string value in all 3 options (hashmap/array list/linked list) ?
If the user is entering the string, you shouldn't need to worry. The maximum String length is over 2 billion.
The fastest typing speed ever, 216 words per minute,
http://en.wikipedia.org/wiki/Words_per_minute
This means even a fast typist will take a minute to write 1 K of letters. To write one String which is maximum length will take 1491 days, non stop. (Assuming their keyboard, computer, or the user does die in the attempt)
It is highly unlikely you need to most efficient data structure and using the simplest and most obvious choice is a better approach. (Again because users cannot type fast enough for it to ever matter)
A Kindle can store thousands of books in a device which costs less than 100 pounds. A user could write all their live and not write enough to fill up a small, cheap mobile device.
Save your time and use StringBuilder or StringBuffer (if you need thread safety).
You will need ArrayList<Stringbuffer>
If you are creating a text editor where the user can jump to anywhere in the string and start changing it, a gap buffer is a fairly good data structure: http://en.wikipedia.org/wiki/Gap_buffer
Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.
I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.
I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.
If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.
When I try to make a very large boolean array using Java, such as:
boolean[] isPrime1 = new boolean[600851475144];
I get a possible loss of precision error?
Is it too big?
To store 600 billion bits, you need an absolute minimum address space of 75 gigabytes! Good luck with that!
Even worse, the Java spec doesn't specify that a boolean array will use a single bit of memory for each element - it could (and in some cases does) use more.
In any case, I recognise that number from Project Euler #3. If it needs that much memory, you're doing it wrong...
Consider using a BitSet.
Since you're attempting to solve Euler-problem #3 the wrong way, here's a hint: You're supposed to find all the prime factors of a number, not all the prime numbers below a certain limit.
BTW: This particular Euler-problem can be solved using a very small amount of RAM.
An array index is an int, not a long, so your "array" is too big to fit into an array. One of the java Collection classses might be more suited. Never mind - Collection.size() returns an int as well, so Collection can't store more than Integer.MAX_VALUE items either.
Um... that would be about 70GB worth of booleans. Not gonna work. No way.
The problem is you are using a long value vs. an int value for the size of the array. Java does not support array lengths longer that the maximum value of an int. Java is treating your length as a long because the size you specified exceeds the maximum value for an int but fits within a long. Hence it must convert the length back to an int to create an array. The conversion from long -> int is producing the warning you are seeing
You can use an array of longs, encapsulated in a class that would handle all the operations on the array. Something like your own implementation of BitSet.
Why not just store the values in a file, and then seek to the position in the file and pull up the right value. Like others have stated, that's 70GB of data. In most cases, you wouldn't even be able to hold that in memory. If you're going to store it to a file, you could even look at individual bits when storing and retrieving the data using bitwise operators to save on storage space.
Also, since the number of primes decreases with the size of the numbers, it's probably better just to store the prime numbers themselves in the file, in order, and then do a binary search for the number to see if it is one of the primes.
What values do you have in the array?
For a such large number I guess it's going to be a sparse array so maybe it would be best to use a Map/List and just allocate space and store a value for a 1 value for a bit. Or for a 0 value if most of your values will be 1.
Apache ActiveMQ has a datastructure called BitArrayBin. This is used to find out whether a message is duplicated. A message ID is a combination of producer ID and sequence ID.
Each producer will have a BitArrayBin to track its sequence IDs. Once it finds out the BitArrayBin for the given producer, it sets the sequence ID which is a long value to the BitArrayBin.
oldValue = bitArrayBin.setBit(sequenceId, true)
if (oldVlaue) {
"message is duplicated"
}
The method returns the old Value.
If y is the long index, it is used to derive at a bin index and an offset into it.
y = bin index * 64 + offset
BitArrayBin is nothing but a holder for many bins where the size can be defined during its construction. Each bin contains a long variable to store the bits so so it can store up to 64 boolean values.
Bit masking is used to set the bit, and then, get it's value.
This class doesn't have much documentation. You need to go through its source code to know the internals.