Memory conscious string filtering - java

Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.

I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.

I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.

If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.

Related

Optimising checking for Strings in a word list (Java)

I have a text file containing ~30,000 words in alphabetical order each on a separate line.
I also have a Set<String> set containing ~10 words.
I want to check if any of the words in my set are in the word list (text file).
So far my method has been to:
Open the word list text file
Read a line/word
Check if set contains that word
Repeat to the end of the word list file
This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.
My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?
Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.
You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:
wordListHashSet.add(mycurrentword.hashcode());
You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set.
Create an iterator of this Integer Hash Set.
Iterator it = myTenWordsHashSet.iterator();
Iterate over this in a loop and check for the following condition:
wordListHashSet.contains(it.next());
If this is true, then you have the word in the wordlist.
Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.
Hope that helps!
It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.
In that case, you could do a binary search in the large file for each of the search words, using Random Access Files.
Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.
You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.
Since you'll repeat this step once for every word in your set, it should take 10*log(30k)
You can make some improvements depending on your needs.
If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).

How to count string num with limit memory?

The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.
I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as #jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.
Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.
I guess you mean the number of distinct words do you?
So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.
To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.
Update
Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.
I suck at explaining theoretical answers but here we go....
I have made an assumption about your question as it is not entirely clear.
The memory used to store all the distinct words is 80MB (the entire file is bigger).
The words could contain non-ascii characters (so we just treat the data as raw bytes).
It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.
// Loop over the file and for each word:
//
// Compute a hash of the word.
// Convert the hash to a number by some means (skip if possible).
// If the number is odd then skip to the next word.
// Use conventional means to store the distinct word.
//
// Do something with all the distinct words.
Then repeat the above a second time using even instead of odd.
Then you have divided the task into 2 and can do each separately.
No words from the first set will appear in the second set.
The hash is necessary because the words could (in theory) all end with the same letter.
The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.
Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.
I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]
Depending on what kind of character the words are build of you can chose for this system:
If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.
So compute the checksum for the strings like this:
public static long checksum(String str)
{
String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
long checksum = 0;
for (int i = 0; i < str.length(); ++i)
{
int c = tokens.indexOf(str.charAt(i));
checksum *= tokens.length();
checksum += c;
}
return checksum;
}
This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.
So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.
You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>
If you just need the distinct words you can use TLongHashSet
If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.
If you need to sort using Java, this post might help:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
You can use constant memory by reading your file multiple times.
Basic idea:
Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.
Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
Remove element if we encounter the same value in a partition p_j with j smaller i
Output result counts for elements in the Map
Clear Map, repeat for all p_1...p_n
As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.
Your scarce resource is memory, so you can't store the words in RAM.
You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.
Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).
So my proposal is to iterate over the file, and for each word:
If the word was found for first time, write the string to a result file together with the integer value 1.
If the word was processed before (it will appear in the result file), increment the record value.
This solution scales well no matter the number of lines of your input file nor the length of the words*.
You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.
EDIT:
*It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.
possible solutions:
Use file sorting and then just count the consequent occurences of each value.
Load the file in a database and use a count statement like this: select value, count(*) from table group by value

java- storing string values- which is most efficient- linked list, array list or hashmap

In a java application, I have a requirement where a user will define a string value, and then keep on appending further string values to original value...
There can be multiple different named strings defined by the user..
From hashmap, array list and linked list, which one should I use on the basis of following criteria:
(1) Most memory efficient
(2) Max possible space per single string value
Also what is the max possible size of a single string value in all 3 options (hashmap/array list/linked list) ?
If the user is entering the string, you shouldn't need to worry. The maximum String length is over 2 billion.
The fastest typing speed ever, 216 words per minute,
http://en.wikipedia.org/wiki/Words_per_minute
This means even a fast typist will take a minute to write 1 K of letters. To write one String which is maximum length will take 1491 days, non stop. (Assuming their keyboard, computer, or the user does die in the attempt)
It is highly unlikely you need to most efficient data structure and using the simplest and most obvious choice is a better approach. (Again because users cannot type fast enough for it to ever matter)
A Kindle can store thousands of books in a device which costs less than 100 pounds. A user could write all their live and not write enough to fill up a small, cheap mobile device.
Save your time and use StringBuilder or StringBuffer (if you need thread safety).
You will need ArrayList<Stringbuffer>
If you are creating a text editor where the user can jump to anywhere in the string and start changing it, a gap buffer is a fairly good data structure: http://en.wikipedia.org/wiki/Gap_buffer

How to efficiently search on a String

I have a text with about 300 - 500 words. Also i got about 200k keywords and i want to know if each of the keywords is contained in the text. A String contains ist quite slow, is there some way to preprocess the String?
I thought about using a SuffixTree but im not sure this is the best choice.
Also, are there any good librarys for this task? semanticdiscoverytoolkit for example has a suffixtree implementation but after adding the string i cant figure out how to look up if a string is contained in the tree.
Greetings,
Nico
you can try the rabin-karp string search algorithm. since you are doing mostly hash (integer) comparisons, the performance is much better than string comparisons.
compute the hash of the keyword
compute the rolling hash of the text
compare these 2 hashes. if they match, perform the actual string comparison.
advance the position by 1 character and repeat from step 2 until you reach the end of the text.
as a analogy, the rolling hash is like a "sliding window" that scrolls along the text. the hash comparison is done using the hash of the substring in the "sliding window" against the hash of the keyword.
You can use StringTokenizer to get each of the words then populate a hashmap which you check afterwards. This requires going through each list only once. Lookup times should then be very fast which is important given the amount of keywords you have.
It may be worth it to profile this method against something like Lucene.

Counting repeated words in a file

Goal: to find count of all words in a file. file contains 1000+ words
My approach: use a HashMap<String,Integer>() to store and count the number of times each word appears in the file.
Question:
Would a HashMap() be the best way or would it be better to use a Binary Tree for ensuring faster lookup as there is a large count of words in the file?
Or is there a better way to do this?
HashMap would result in a lot of memory overhead which is not desired.
So you're looking for distinct words?
The most efficient structure I can think of is a Trie
Here's one open source implementation: Google Code patricia-trie
Although I tend to agree with Mitch Wheat -- It sounds like a HashMap should work fine (It's always best to avoid premature optimization... so you should use a HashMap until you've shown that it's a bottleneck)
1000 - 10000 words is very small.
A Hashmap will be fine.
I would recommend doing such a task in Perl/PHP. It's very hard to kill a fly with a machine gun.
A HashMap is perfect. You need to store
a copy of each word encountered
the count for each
A HashMap really won't store much more than that!
Assuming that the strings are not insanely long, a "Trie" approach as Michael suggest would be good. The node in the Trie can store the character and the "count" of the strings that end with that character. This should drastically reduce the storage requirements (again assuming the strings are uniformly distributed and overlapping)
Assuming that the counts are not to be persisted across invocations, while using a HashMap, let the Map be from Integer => Integer - where the "key" is the hashcode of the string and value the count. This should be a efficient solution - with fast lookup and reduced memory foot print.

Categories

Resources