Data structure for soundex algorithm? - java

Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features:
be able to read about 50,000 words
should be able to read a word and return the related words having the same soundex
I don't want the program implementation just few advices on what data structure to use.

TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.
Maybe not what you wanted, but many people do not know that MSsql has those two functions.

Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.
After that the 4 character code can be treated as an integer key.
Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.
Then walk the dictionary and each bucket is a group of similar sounding words.
Actually, here is the whole program in perl:
#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
chomp();
chomp();
push #{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
my #words = split / +/;
foreach (#words) {
print Dumper $dictionary{soundex($_)};
}
}

I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.
The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.

class SpellChecker
{
interface Hash {
String hash(String);
}
private final Hash hash;
private final Map<String, Set<String>> collisions;
SpellChecker(Hash hash) {
this.hash = hash;
collisions = new TreeSet<String, Set<String>>();
}
boolean addWord(String word) {
String key = hash.hash(word);
Set<String> similar = collisions.get(key);
if (similar == null)
collisions.put(key, similar = new TreeSet<String>());
return similar.add(word);
}
Set<String> similar(String word) {
Set<String> similar = collisions.get(hash.hash(word));
if (similar == null)
return Collections.emptySet();
else
return Collections.unmodifiableSet(similar);
}
}
The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)

Since soundex is a hash, I'd use a hash table, with the soundex as the key.

you want a 4-byte integer.
The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).
So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.

Related

how to use associateArray as a key in other associativeArray in PHP (just like using HashMap as key in another HashMap in JAVA)

In Java, we can use HashMap as a key in other HashMap. I'm using an associative array as a map in PHP. Now there is a need to store an associative array as a key in another associative array.
I asked ChatGPT and it presented a lengthy solution:
Suppose $map is an array that I want to use as a key:
ksort($map);
$key = serialize($map);
if(!isset($main[$key])){
$main[$key] = 0;
}
$main[$key]++
I'm running above code in a loop where map is:
on first iteration: [a=>1, b=>2, c=>3]
on second iteration: [b=>2, a=>1, c=>3]
after 2 two iterations
main is looking like
$main["serialized---key"] -> 2
Yes, I need to use ksort because the next $map could contain the same array but keys could be in a different order.
The above solution is working fine but it will slow down drastically on large inputs. I need a better way where I don't need to use ksort and serialization.
I also used spl_object_hash instead of serialize but it didn't work. please suggest an optimal approach just like hashmap in java.
Also, I used splObjectStorage class by type-casting the array into an object but it gives incorrect results.
Detailed problem:
What I'm actually trying to do?
I'm solving the following problem:
Given an array of strings strs, group the anagrams together. You can return the answer in any order.
An Anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once.
Example 1:
Input: strs =
["eat","tea","tan","ate","nat","bat"]
Output:
[["bat"],["nat","tan"],["ate","eat","tea"]]
My working code:
In short, I'm just grouping the strings based on their frequency map.
function groupAnagrams(array $arr){
$main = [];
$ret = [];
for($i=0; $i<count($arr); $i++){
$el = $arr[$i];
$map = [];
for($j=0; $j<strlen($el); $j++){
if(!isset($map[$el[$j]])){
$map[$el[$j]]=0;
}
$map[$el[$j]]++;
}
ksort($map);
$key = serialize($map);
if(!isset($main[$key])){
$main[$key] = [];
}
$main[$key][] = $el;
}
//return $main;
foreach($main as $key=>$val){
$ret[] = $val;
}
return $ret;
}
Here is the problem link: https://leetcode.com/problems/group-anagrams/

convert a string into something reversible, in Java

I have a lot of urls that serve as keys in a HBase table. Since they "all" start by http://, Hbase puts them in the same node. Thus I end with a node at +100% and the other idle.
So, I need to map the url to something hash-like, but reversible. Is there any simple, standard, and fast way to do that in JAVA8.
I look for random (linear) distribution of prefixes.
Note:
reversing the url is not interesting since a lot of urls end with / ? = and risk to unbalance the distribution.
I do not need encryption, but I can accept it.
I do not look for compression, but it is welcome if possible :)
Thanks,
Costin
There's not a single, standard way.
One thing you can do is to prefix the key with its hash. Something like:
a01cc0fe http://...
That's easily reversible (just snip off the hash chars, which you can make be a fixed length) and will get you good distribution.
The hash code for a string is stable and consistent across JVMs. The algorithm for computing it is specified in String.hashCode's documentation, so you can consider it part of the contract of how a String works.
Add prefix of the hash code encoded by 36 decimal number [0-9a-z].
public static String encode(String s) {
return Integer.toString(s.hashCode() & 0xffffff, 36) + "#" + s;
}
public static String decode(String s) {
return s.replaceFirst("^[^#]*#", "");
}
sample:
http://google.com/ <-> 5o07l#http://google.com/

Encode String (rfc4122) to Number in Java, decode in PHP

In my use case, a javascript tracker generate a unique ID for a visitor whenever he/she visits the site, using the following formula:
function generateUUID(){
return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = Math.random()*16|0, v = c == 'x' ? r : (r&0x3|0x8);
return v.toString(16);
});
}
It generates strings like this (rfc4122):
"3314891e-285e-40a7-ac59-8b232863bead"
Now I need to encode that string in a Number (e.g. BigInteger in Java) that can be read by Mahout. And likewise, restore it (in PHP) to display results. Is there any fast, consistent and reliable way to do that?
Some solutions are:
Mapping each possible char (alphanumeric + '-') to a number [1..M] and summing each char position accordingly
get 2 longs from md5 hash
keep a hash map in memory
Any ideas appreciated!
If Mahout can use a compound ID of two longs, you can use:
UUID uuid = UUID.fromString(string);
long l1 = uuid.getMostSignificantBits();
long l2 = uuid.getLeastSignificantBits();
If you really are stuck with one long, then I'd agree with your idea to use a portion of a hash based on the entire UUID

Is there an equivalent to Java's String intern function in Go?

Is there an equivalent to Java's String intern function in Go?
I am parsing a lot of text input that has repeating patterns (tags). I would like to be memory efficient about it and store pointers to a single string for each tag, instead of multiple strings for each occurrence of a tag.
No such function exists that I know of. However, you can make your own very easily using maps. The string type itself is a uintptr and a length. So, a string assigned from another string takes up only two words. Therefore, all you need to do is ensure that there are no two strings with redundant content.
Here is an example of what I mean.
type Interner map[string]string
func NewInterner() Interner {
return Interner(make(map[string]string))
}
func (m Interner) Intern(s string) string {
if ret, ok := m[s]; ok {
return ret
}
m[s] = s
return s
}
This code will deduplicate redundant strings whenever you do the following:
str = interner.Intern(str)
EDIT: As jnml mentioned, my answer could pin memory depending on the string it is given. There are two ways to solve this problem. Both of these should be inserted before m[s] = s in my previous example. The first copies the string twice, the second uses unsafe. Neither are ideal.
Double copy:
b := []byte(s)
s = string(b)
Unsafe (use at your own risk. Works with current version of gc compiler):
b := []byte(s)
s = *(*string)(unsafe.Pointer(&b))
I think that for example Pool and GoPool may fulfill your needs. That code solves one thing which Stephen's solution ignores. In Go, a string value may be a slice of a bigger string. Scenarios are where it doesn't matter and scenarios are where that is a show stopper. The linked functions attempt to be on the safe side.

Working with huge text files in Java

I was given an English vocabulary assignment by my teacher.
Choose a random alphabet, say 'a'
Write a word from the alphabet, say
'apple' Take the last word 'e' Write a
word from e, say elephant Now from 't'
and so on.. No repetition allowed
Make a list of 500 words.
Mail the list to the teacher. :)
So Instead of doing it myself, I am working on a Java code which will do my homework for me.
The code seems to be simple.
The core of algorithm:
Pick up a random word from a dictionary, which satisfies the requirement. seek() with RandomAccessFile. Try to put it in a Set with ordering (maybe LinkedHashSet)
But the problem is the huge size of dictionary with 300 000+ enteries. :|
Brute force random algorithms wont work.
What could be the best, quickest and most efficient way out?
****UPDATE :** Now that I have written the code and its working. How can I make it efficient so that it chooses common words?
Any text files containing list of common words around??**
Either look for a data structure allowing you to keep a compacted dictionary in memory, or simply give your process more memory. Three hundred thousand words is not that much.
Hope this doesn't spoil your fun or something, but if I were you I'd take this approach..
Pseudo java:
abstract class Word {
String word;
char last();
char first();
}
abstract class DynamicDictionary {
Map<Character,Set<Word>> first_indexed;
Word removeNext(Word word){
Set<Word> candidates = first_indexed.get(word.last());
return removeRandom(candidates);
}
/**
* Remove a random word out from the entire dic.
*/
Word removeRandom();
/**
* Remove and return a random word out from the set provided.
*/
Word removeRandom(Set<Word> wordset);
}
and then
Word primer = dynamicDictionary.removeRandom();
List<Word> list = new ArrayList<Word>(500);
list.add(primer);
for(int i=0, Word cur = primer;i<499;i++){
cur = dynamicDictionary.removeNext(cur);
list.add(cur);
}
NOTE: Not intended to be viewed as actual java code, just a way to roughly explain the approach (no error handling, not a good class structure if it were really used, no encupsulation etc. etc.)
Should I encounter memory issues, maybe I'll do this:
abstract class Word {
int lineNumber;
char last();
char first();
}
If that is not sufficient, guess I'll use a binary search on the file or put it in a DB etc..
I think a way to do this could be to use a TreeSet where you put all the dictionary then use the method subSet to retreive all the words beginning by the desired letter and do a random on the subset.
But in my opinion the best way to do this, due to the quantity of data, would be to use a database with SQL requests instead of Java.
If I do this:
class LoadWords {
public static void main(String... args) {
try {
Scanner s = new Scanner(new File("/usr/share/dict/words"));
ArrayList<String> ss = new ArrayList<String>();
while (s.hasNextLine())
ss.add(s.nextLine());
System.out.format("Read %d words\n", ss.size());
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
}
}
}
I can run it with java -mx16m LoadWords, which limits the Java heap size to 16 Mb, which is not that much memory for Java. My /usr/share/dict/words file has approximately 250,000 words in it, so it may be a bit smaller than yours.
You'll need to use a different data structure than the simple ArrayList<String> that I've used. Perhaps a HashMap of ArrayList<String>, keyed on the starting letter of the word would be a good starting choice.
Here is some word frequency lists:
http://www.robwaring.org/vocab/wordlists/vocfreq.html
This text file, reachable from the above link, contains the first 2000 words that are used most frequently:
http://www.robwaring.org/vocab/wordlists/1-2000.txt
The goal is to increase your English language vocabulary - not to increase your computer's English language vocabulary.
If you do not share this goal, why are you (or your parents) paying tuition?

Categories

Resources