I was reading about algorithmic problem and one was the following:
Having a file with millions of lines of data, there are 2 lines which
are identical. The lines are so long that may not fit in memory. Find
the 2 identical lines.
The solution suggested was to read lines in parts and create hashes for each line.
E.g. you build the hash for line 1 by building the hash of part-1 of line 1 (which can be read in memory) and then hash of part-2 of line 1 up to part-N of line 1.
Store the hashes in file or hashtable. For any same hash values compare the lines. If the lines are the same we solved it.
Although I understand this solution in high level, I have no idea how this could be implemented. How can we associate a hash with a specific line in file? Is this language implementation detail?
E.g. in Java how would we address this?
The real answer is buy more memory. The longest string you can have in Java 2 GB and that will fit in machines these days. You can buy 32 GB for less than $200.
But to solve the problem, I suggest you
find the offset of each line.
find the lines which are the same length (using the difference of offset)
calculate 64-bit or longer hashes of the lines with the same length.
for the lines with the same hash, do a byte-by-byte comparison.
Note: if you don't have enough memory to cache the entire file this will take a very long time. If you have a 32 GB machine and it has a 64 GB file, each pass will take about 20 minutes, and this has multiple passes.
1)Which API to find the offset?
You count the number of bytes you have read, and that is the offset.
2)The real answer is buy more memory Project Managers don't agree on this for real products. Do you have different experience?
I point out to them that I could spend a day which could cost them > $1000 (even if that is not what I get paid) saving $100 of reusable memory if they think that is good use of resources. I let them decide ;)
My 8 year old son has 8 GB in a PC he built as the memory cost me £24. Yet you are right that there are project mangers who think 8 GB is too much for a professional which is costing them that much per hour!? I have 16 GB in PC which I don't use to run anything serious because I do my work on machine with 256 GB. You can buy machines with 2 TB these days which is overkill for most applications. ;)
While i agree the solution is to utilize modern techniques, and leverage how cheap memory is these days, the problem is one meant to exercise the mind and understand how to solve the problem under the given constraints.
The hashing you talked about is rather simple.
The java solution can leverage a few things under the hood which may obscure whats actually going on so i will explain the solution first, and the java implementation second.
Generic Solution:
Hashing, such as SHA1, MD5, etc. generate an integer by compressing the input. Lets say you can only store the first MB of characters in each line.
You would iterate over each line, get the first MB of characters, and pass that into the Hashing algorithm(MD5 for example).
You then map the hash as the key, and a list/array of line numbers as the value.
After the first pass, any lines with a matching first MB of characters will end up with the same hash, and thus in the same list in the map.
To prepare for the second pass, you search the map and cull any lists that contain only one line number.
Then you create a list of line numbers by compiling the line numbers from the remaining entries in the map, these lines will be the only ones checked in the second pass.
Second pass, you pull the Second MB of characters from each line in your line list, hash them and put them in the map in the same fashion as pass one.
Iterate over the entries in the map, culling hash entries that only have one line number.
repeat step two but by incrementing the character block(MB) to coincide with the pass number.
when you reach a pass where you only have one hash with multiple line numbers, and that hash only has two elements, those lines are the two that are the same.
This is essentially a tree search.
Java Method:
Java has a class called HashMap, which automatically hashes the key. by using a
HashMap<String,ArrayList<Integer>>
for your master map, all you have to do each call
map.get(mbBlock).add(lineNumber); of course you should check to see if this is the first time this key has been used so you dont get a null pointer exception.
after each pass, cull the entries containing only one line.
reiterate over the remaining lines until you only have two line numbers left
Get the first k characters of each line, where k is configurable. Do your hash to find several groups of lines that could have identical lines.
Based on the result of the first step, in which you great narrow down the search range, run your algorithm on each smaller groups for the next k characters.
The search range is narrowed down dramatically after each round if not in the worst case.
The trick of algorithm is that breaking big problems into small ones and make full use of the results of previous steps.
Related
First of all I want to make clear that the nature of this question is different than other questions which are already posted as per my knowledge. Please let me know if it is not so.
Given
I have a list of names ~3000.
There are ~2500 files which consists of names one at a line (taken from the name list)
Each file contains ~3000 names (and hence ~3000 lines, though avg is 400)
Problem
At a given time I will be provided with 2 files. I have to create a list of names which are common in both files.
Pre Processing
For reducing the time complexity I have done pre processing and sorted the names in all files.
My Approach
Sorted names in the given list and indexed them from 0 to 2999
In each file for each name
Calculated the group number (name_index / 30)
Calculated the group value (For each name in same group calculate (2^(name_index%30)) and add)
Create a new file with same name in the format "groupNumber blankSpace groupValue"
Result
Instead of having ~3000(Though avg is 400) names in each file now I will have maximum 100 lines in each file. Now I will have to check for common group number and then by help of bit manipulation I can find out common names.
Expectation
Can anyone please suggest a shorter and better solution of the problem. I can do pre processing and store new files in my application so that minimum processing is required at the time of finding common names.
Please let me know if I am going in wrong direction to solve the problem. Thanks in advance.
Points
In my approach the size of total files is 258KB (as I have used group names and group values) and if it is kept by names in each line it's size is 573KB. These files have to be stored on mobile device. So I need to decrease the size as far as possible. Also I am looking forward to data compression and I have no idea about how to do that. Please care to explain that also.
Have you tried the following?
Read names 1 at a time from list1, adding them to a hashset.
Read names from list2 one at a time, looking them up in the hashset created from list one. If they are in the hashset, means the name is common to both files.
If you want to preprocess for some extra speed, store the # of names in each list and select the shorter list as list1.
Aha! Given the very low memory requirement you stated in edit, there's another thing you could do.
Although I still think you could go for the solution other answers suggest. A HashSet with 3000 String entries won't get too big. My quick approximation with 16-char Strings suggests something below 400 kB of heap memory. Try it, then go back. It's like 25 lines of code for the whole program.
If the solution eats too much memory, then you could do this:
Sort the names in the files. That's always a good thing to have.
Open both files.
Read a line from both files.
If line1 < line2, read a line from line1, repeat.
If line1 > line2, read a line from line2, repeat.
Else they are the same, add to results. Repeat.
It eats virtually no memory and it's a good place to use a compareTo() method (if you used it to sort the names, that is) and a switch statement, I think.
The size of the files doesn't influence the memory usage at all.
About the data compression - there are lots of tools and algorithms you could use, try this (look at the related questions, too), or this this.
You are attempting to re-implement a Set with a List. Don't do that. Use a Set of names, which will automatically take care of duplications of inserts.
You need to read both files, there is no way around doing that.
// in pseudo-java
Set<String> names1 = new HashSet<String>();
for (String name : file1.getLine().trim()) {
names1.put(name);
}
Set<String> names2 = new HashSet<String>();
for (String name : file2.getLine().trim()) {
names2.put(name);
}
// with this line, names1 will discard any name not in names2
names1.retainAll(names2);
System.out.println(names1);
Assuming you use HashSet as this example does, you will be comparing hashes of the Strings, which will improve performance dramatically.
If you find the performance is not sufficient, then start looking for faster solutions. Anything else is premature optimization, and if you don't know how fast it must run, then it is optimization without setting a goal. Finding the "fastest" solution requires enumerating and exhausting every possible solution, as that solution you haven't checked yet might be faster.
I'm not sure whether I understood your requirements and situation.
You have about 2.500 files, each of 3000 words (or 400?). There are many duplicates of words which occur in multiple files.
Now somebody will ask you, which words have file-345 and file-765 in common.
You could create a Hashmap, where you store every word, and a List of Files, in which the words occur.
If you get File 345 with it's 3000 words (400?), you look it up in the hashmap, and see, where file 765 is mentioned in the list.
However 2 * 3000 isn't that much. If I create 2 lists of Strings in Scala (which runs on the JVM):
val g1 = (1 to 3000).map (x=> "" + r.nextInt (10000))
val g2 = (1 to 3000).map (x=> "" + r.nextInt (10000))
and build the intersection
g1.intersect (g2)
I get the result (678 elements) in nearly no time on an 8 years old laptop.
So how many requests will you have to answer? How often does the input of the files change? If rarely, then reading the 2 files might be the critical point.
How many unique words do you have? Maybe it is no problem at all to keep them all in memory.
I was looking at the source code of the sort() method of the java.util.ArrayList on grepcode. They seem to use insertion sort on small arrays (of size < 7) and merge sort on large arrays. I was just wondering if that makes a lot of difference given that they use insertion sort only for arrays of size < 7. The difference in running time will be hardly noticeable on modern machines.
I have read this in Cormen:
Although merge sort runs in O(n*logn) worst-case time and insertion sort runs in O(n*n) worst-case time, the constant factors in insertion sort can make it faster in practice for small problem sizes on many machines. Thus, it makes sense to coarsen the leaves of the recursion by using insertion sort within merge sort when subproblems become sufficiently small.
If I would have designed sorting algorithm for some component which I require, then I would consider using insertion-sort for greater sizes (maybe upto size < 100) before the difference in running time, as compared to merge sort, becomes evident.
My question is what is the analysis behind arriving at size < 7?
The difference in running time will be hardly noticeable on modern machines.
How long it takes to sort small arrays becomes very important when you realize that the overall sorting algorithm is recursive, and the small array sort is effectively the base case of that recursion.
I don't have any inside info on how the number seven got chosen. However, I'd be very surprised if that wasn't done as the result of benchmarking the competing algorithms on small arrays, and choosing the optimal algorithm and threshold based on that.
P.S. It is worth pointing out that Java7 uses Timsort by default.
I am posting this for people who visit this thread in future and documenting my own research. I stumbled across this excellent link in my quest to find the answer to the mystery of choosing 7:
Tim Peters’s description of the algorithm
You should read the section titled "Computing minrun".
To give a gist, minrun is the cutoff size of the array below which the algorithm should start using insertion sort. Hence, we will always have sorted arrays of size "minrun" on which we will need to run merge operation to sort the entire array.
In java.util.ArrayList.sort(), "minrun" is chosen to be 7, but as far as my understanding of the above document goes, it busts that myth and shows that it should be near powers of 2 and less than 256 and more than 8. Quoting from the document:
At 256 the data-movement cost in binary insertion sort clearly hurt, and at 8 the increase in the number of function calls clearly hurt. Picking some power of 2 is important here, so that the merges end up perfectly balanced (see next section).
The point which I am making is that "minrun" can be any power of 2 (or near power of 2) less than 64, without hindering the performance of TimSort.
http://en.wikipedia.org/wiki/Timsort
"Timsort is a hybrid sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data... The algorithm finds subsets of the data that are already ordered, and uses the subsets to sort the data more efficiently. This is done by merging an identified subset, called a run, with existing runs until certain criteria are fulfilled."
About number 7:
"... Also, it is seen that galloping is beneficial only when the initial element is not one of the first seven elements of the other run. This also results in MIN_GALLOP being set to 7. To avoid the drawbacks of galloping mode, the merging functions adjust the value of min-gallop. If the element is from the array currently under consideration (that is, the array which has been returning the elements consecutively for a while), the value of min-gallop is reduced by one. Otherwise, the value is incremented by one, thus discouraging entry back to galloping mode. When this is done, in the case of random data, the value of min-gallop becomes so large, that the entry back to galloping mode never takes place.
In the case where merge-hi is used (that is, merging is done right-to-left), galloping needs to start from the right end of the data, that is the last element. Galloping from the beginning also gives the required results, but makes more comparisons than required. Thus, the algorithm for galloping includes the use of a variable which gives the index at which galloping should begin. Thus the algorithm can enter galloping mode at any index and continue thereon as mentioned above, as in, it will check at the next index which is offset by 1, 3, 7,...., (2k - 1).. and so on from the current index. In the case of merge-hi, the offsets to the index will be -1, -3, -7,...."
The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.
I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as #jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.
Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.
I guess you mean the number of distinct words do you?
So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.
To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.
Update
Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.
I suck at explaining theoretical answers but here we go....
I have made an assumption about your question as it is not entirely clear.
The memory used to store all the distinct words is 80MB (the entire file is bigger).
The words could contain non-ascii characters (so we just treat the data as raw bytes).
It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.
// Loop over the file and for each word:
//
// Compute a hash of the word.
// Convert the hash to a number by some means (skip if possible).
// If the number is odd then skip to the next word.
// Use conventional means to store the distinct word.
//
// Do something with all the distinct words.
Then repeat the above a second time using even instead of odd.
Then you have divided the task into 2 and can do each separately.
No words from the first set will appear in the second set.
The hash is necessary because the words could (in theory) all end with the same letter.
The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.
Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.
I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]
Depending on what kind of character the words are build of you can chose for this system:
If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.
So compute the checksum for the strings like this:
public static long checksum(String str)
{
String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
long checksum = 0;
for (int i = 0; i < str.length(); ++i)
{
int c = tokens.indexOf(str.charAt(i));
checksum *= tokens.length();
checksum += c;
}
return checksum;
}
This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.
So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.
You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>
If you just need the distinct words you can use TLongHashSet
If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.
If you need to sort using Java, this post might help:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
You can use constant memory by reading your file multiple times.
Basic idea:
Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.
Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
Remove element if we encounter the same value in a partition p_j with j smaller i
Output result counts for elements in the Map
Clear Map, repeat for all p_1...p_n
As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.
Your scarce resource is memory, so you can't store the words in RAM.
You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.
Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).
So my proposal is to iterate over the file, and for each word:
If the word was found for first time, write the string to a result file together with the integer value 1.
If the word was processed before (it will appear in the result file), increment the record value.
This solution scales well no matter the number of lines of your input file nor the length of the words*.
You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.
EDIT:
*It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.
possible solutions:
Use file sorting and then just count the consequent occurences of each value.
Load the file in a database and use a count statement like this: select value, count(*) from table group by value
I am confused as to how the Trie implementation saves space & stores data in most compact form!
If you look at the tree below. When you store a character at any node, you also need to store a reference to that & thus for each character of the string you need to store its reference.
Ok we saved some space when a common character arrived but we lost more space in storing a reference to that character node.
So isn't there a lot of structural overhead to maintain this tree itself ? Instead if a TreeMap was used in place of this, lets say to implement a dictionary, this could have saved a lot more space as string would be kept in one piece hence no space wasted in storing references, isn't it ?
To save space when using a trie, one can use a compressed trie (also known as a patricia trie or radix tree), for which one node can represent multiple characters:
In computer science, a radix tree (also patricia trie or radix trie)
is a space-optimized trie data structure where each node with only one
child is merged with its child. The result is that every internal node
has at least two children. Unlike in regular tries, edges can be
labeled with sequences of characters as well as single characters.
This makes them much more efficient for small sets (especially if the
strings are long) and for sets of strings that share long prefixes.
Example of a radix tree:
Note that a trie is usually used as an efficient data structure for prefix matching on a set of strings. A trie can also be used as an associative array (like a hash table) where the key is a string.
Space is saved when you've lots of words to be represented by the tree. Because many words share the same path in the tree; the more words you've, more space you would save.
But there is a better data structure if you want to save space. Trie doesn't save space as much as directed acyclic word graph (DAWG) does, because it shares common node throughout the structure, whereas trie doesn't share nodes. The wiki entry explains this much detail, so have a look at it.
Here is the difference (graphically) between Trie and DAWG:
The strings "tap", "taps", "top", and "tops" stored in a Trie (left) and a DAWG (right), EOW stands for End-of-word.
The tree on the left side is Trie, and the tree on the right is DAWG. Compare them and see how DAWG saves space effciently. Trie has duplicate nodes that represent same letter/subword, while DAWG has exactly one node for each letter/subword.
It's not about cheap space in memory, it's about precious space in a file or on a communications link. With an algorithm that builds that trie, we can send 'ten' in three bits, left-right-right. Compared to the 24 bits 'ten' would take up uncompressed, that's a huge savings of valuable disk space or transfer bandwidth.
You might deduce that it save space is on a ideal machine where every byte is allocated efficiently. However real machines allocate aligned blocks of memory (8 bytes on Java and 16 bytes on some C++) and so it may not save any space.
Java Strings and collections add relatively high amount of over head so the percentage difference can be very small.
Unless your structure is very large the value of your time out weights the memory cost that using the simplest, most standard and easiest to maintain collection is far more important. e.g. your time can very easily be worth 1000x or more the value of the memory you are try to save.
e.g. say you have 10000 names which you can save 16 bytes each by using a trie. (Assuming this can be proven without taking more time) This equates to 16 KB, which at today's prices is worth 0.1 cents. If your time costs your company $30 per hour, the cost of writing one line of tested code might be $1.
If you have think about it a blink of an eye longer to save 16 KB, its unlikely to be worth it for a PC. (mobile devices are a different story but the same argument applies IMHO)
EDIT: You have inspired me to add an update http://vanillajava.blogspot.com/2011/11/ever-decreasing-cost-of-main-memory.html
Guava may indeed store the key at each level but the point to realize is that the key does not really need to be stored because the path to the node completely defines the key for that node. All that actually needs to be stored at each node is a single boolean indicating whether this is a leaf node or not.
Tries, like any other structure, excel at storing certain types of data. Specifically, tries are best at storing strings that share a common root. Think of storing full-path directory listings for example.
Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.
I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.
I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.
If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.