Detecting if a word is valid when it contains a blank - java

I'm working on a phone based word game, and there could potentially be quite a few blanks (representing any letter) that a player could have the option to use.
I store all the possible words in a hashSet, so detecting if a word is valid when it has one blank is simply a matter of looping through the alphabet replacing the blank with a letter and testing the word. I have a recursive call so this will work with any number of blanks. The code is as follows:
public boolean isValidWord(String word) {
if (word.contains(" ")){
for (char i = 'A'; i <= 'Z'; i++) {
if (isValidWord(word.replaceFirst(" ", Character.toString(i))))
return true;
}
return false;
}
else
return wordHashSet.contains(word);
}
As the number of blanks increases, the number of words we have to test increase exponentially. By the time we get to 3 blanks we're having to do 17576 lookups before we can reject a word, and this is affecting game play. Once there are 4 blanks the game will just freeze for a while.
What is the most efficient way for me to check words with multiple blanks. Should I just iterate through the hashset and check if we have a match against each word? If so, then what's the fastest way for me to compare two strings taking the blanks into account? I've tried doing this using a regular expression and String.matches(xx), but it's too slow. A straight String.equals(xx) is fast enough, but that obviously doesn't take blanks into account.

A very fast method althrough somewhat challenging to implement would be to store your words in a Trie - http://en.wikipedia.org/wiki/Trie
A trie is a tree structure that contains a char in every node and an array of pointers pointing to next nodes.
Without blank spaces it would be easy - just follow the trie structure, you can check this in linear time. When you have a blank, you will have a loop to search all possible routes.
This can sound complicated and difficult if you are not familiar with tries but if you get stuck I can help you with some code.
EDIT:
Ok, here is some c# code for your problem using tries, I think you will have no problems converting it in JAVA. If you do, leave a comment and I will help.
Trie.cs
public class Trie
{
private char blank = '_';
public Node Root { get; set; }
public void Insert(String key)
{
Root = Insert(Root, key, 0);
}
public bool Contains(String key)
{
Node x = Find(Root, key, 0);
return x != null && x.NullNode;
}
private Node Find(Node x, String key, int d)
{ // Return value associated with key in the subtrie rooted at x.
if (x == null)
return null;
if (d == key.Length)
{
if (x.NullNode)
return x;
else
return null;
}
char c = key[d]; // Use dth key char to identify subtrie.
if (c == blank)
{
foreach (var child in x.Children)
{
var node = Find(child, key, d + 1);
if (node != null)
return node;
}
return null;
}
else
return Find(x.Children[c], key, d + 1);
}
private Node Insert(Node x, String key, int d)
{ // Change value associated with key if in subtrie rooted at x.
if (x == null) x = new Node();
if (d == key.Length)
{
x.NullNode = true;
return x;
}
char c = key[d]; // Use dth key char to identify subtrie.
x.Children[c] = Insert(x.Children[c], key, d + 1);
return x;
}
public IEnumerable<String> GetAllKeys()
{
return GetKeysWithPrefix("");
}
public IEnumerable<String> GetKeysWithPrefix(String pre)
{
Queue<String> q = new Queue<String>();
Collect(Find(Root, pre, 0), pre, q);
return q;
}
private void Collect(Node x, String pre, Queue<String> q)
{
if (x == null) return;
if (x.NullNode) q.Enqueue(pre);
for (int c = 0; c < 256; c++)
Collect(x.Children[c], pre + ((char)c), q);
}
}
Node.cs
public class Node
{
public bool NullNode { get; set; }
public Node[] Children { get; set; }
public Node()
{
NullNode = false;
Children = new Node[256];
}
}
Sample usage:
Trie tr = new Trie();
tr.Insert("telephone");
while (true)
{
string str = Console.ReadLine();
if( tr.Contains( str ) )
Console.WriteLine("contains!");
else
Console.WriteLine("does not contain!");
}

A straight String.equals(xx) is fast enough, but that obviously
doesn't take blanks into account.
So I recommend to implement this simple solution, which is very close to String.equals(), and takes blanks into account:
public boolean isValidWord(String word) {
if (wordHashSet.contains(word)) {
return true;
}
for (String fromHashSet: wordHashSet){
if (compareIgnoreBlanks(fromHashSet, word)) {
return true;
}
}
return false;
}
/**
* Inspired by String.compareTo(String). Compares two String's, ignoring blanks in the String given as
* second argument.
*
* #param s1
* String from the HashSet
* #param s2
* String with potential blanks
* #return true if s1 and s2 match, false otherwise
*/
public static boolean compareIgnoreBlanks(String s1, String s2) {
int len = s1.length();
if (len != s2.length()) {
return false;
}
int k = 0;
while (k < len) {
char c1 = s1.charAt(k);
char c2 = s2.charAt(k);
if (c2 != ' ' && c1 != c2) {
return false;
}
k++;
}
return true;
}

public boolean isValidWord(String word) {
word = word.replaceAll(" ", "[a-z]");
Pattern pattern = Pattern.compile(word);
for (String wordFromHashSet: hashSet){
Matcher matcher = pattern.matcher(wordFromHashSet);
if (matcher.matches()) return true;
}
return false;
}

public boolean isValidWord(String word) {
ArrayList<Integer> pos = new ArrayList<Integer>();
for (int i=0; i!=word.length();i++){
if (word.charAt(i) == ' ') pos.add(i);
}
for (String hashSetWord: hashSet){
for (Integer i: pos){
hashSetWord = hashSetWord.substring(0,i)+" "+hashSetWord.substring(i+1);
}
if (hashSetWord.equals(word)) return true;
}
return false;
}

A kind of ugly, but I would guess fairly fast method would be to create a string containing all valid words like this:
WORD1
WORD2
WORD3
etc.
Then use a regex like (^|\n)A[A-Z]PL[A-Z]\n (i.e. replacing all blanks with [A-Z]), and match it on that string.

Related

Radix(Trie) Tree implementation for Cutomer search in Java

I am working on a project and need to search in data of millions of customers. I want to implement radix(trie) search algorithm. I have read and implement radix for a simple string collections. But Here I have a collection of customers and want to search it by name or by mobile number.
Customer Class:
public class Customer {
String name;
String mobileNumer;
public Customer (String name, String phoneNumer) {
this.name = name;
this.mobileNumer = phoneNumer;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getPhoneNumer() {
return mobileNumer;
}
public void setPhoneNumer(String phoneNumer) {
this.mobileNumer = phoneNumer;
}
}
RadixNode Class:
import java.util.HashMap;
import java.util.Map;
class RadixNode {
private final Map<Character, RadixNode> child = new HashMap<>();
private final Map<Customer, RadixNode> mobileNum = new HashMap<>();
private boolean endOfWord;
Map<Character, RadixNode> getChild() {
return child;
}
Map<Customer, RadixNode> getChildPhoneDir() {
return mobileNum;
}
boolean isEndOfWord() {
return endOfWord;
}
void setEndOfWord(boolean endOfWord) {
this.endOfWord = endOfWord;
}
}
Radix Class:
class Radix {
private RadixNode root;
Radix() {
root = new RadixNode();
}
void insert(String word) {
RadixNode current = root;
for (int i = 0; i < word.length(); i++) {
current = current.getChild().computeIfAbsent(word.charAt(i), c -> new RadixNode());
}
current.setEndOfWord(true);
}
void insert(Customer word) {
RadixNode current = root;
System.out.println("==========================================");
System.out.println(word.mobileNumer.length());
for (int i = 0; i < word.mobileNumer.length(); i++) {
current = current.getChildPhoneDir().computeIfAbsent(word.mobileNumer.charAt(i), c -> new RadixNode());
System.out.println(current);
}
current.setEndOfWord(true);
}
boolean delete(String word) {
return delete(root, word, 0);
}
boolean containsNode(String word) {
RadixNode current = root;
for (int i = 0; i < word.length(); i++) {
char ch = word.charAt(i);
RadixNode node = current.getChild().get(ch);
if (node == null) {
return false;
}
current = node;
}
return current.isEndOfWord();
}
boolean isEmpty() {
return root == null;
}
private boolean delete(RadixNode current, String word, int index) {
if (index == word.length()) {
if (!current.isEndOfWord()) {
return false;
}
current.setEndOfWord(false);
return current.getChild().isEmpty();
}
char ch = word.charAt(index);
RadixNode node = current.getChild().get(ch);
if (node == null) {
return false;
}
boolean shouldDeleteCurrentNode = delete(node, word, index + 1) && !node.isEndOfWord();
if (shouldDeleteCurrentNode) {
current.getChild().remove(ch);
return current.getChild().isEmpty();
}
return false;
}
public void displayContactsUtil(RadixNode curNode, String prefix)
{
// Check if the string 'prefix' ends at this Node
// If yes then display the string found so far
if (curNode.isEndOfWord())
System.out.println(prefix);
// Find all the adjacent Nodes to the current
// Node and then call the function recursively
// This is similar to performing DFS on a graph
for (char i = 'a'; i <= 'z'; i++)
{
RadixNode nextNode = curNode.getChild().get(i);
if (nextNode != null)
{
displayContactsUtil(nextNode, prefix + i);
}
}
}
public boolean displayContacts(String str)
{
RadixNode prevNode = root;
// 'flag' denotes whether the string entered
// so far is present in the Contact List
String prefix = "";
int len = str.length();
// Display the contact List for string formed
// after entering every character
int i;
for (i = 0; i < len; i++)
{
// 'str' stores the string entered so far
prefix += str.charAt(i);
// Get the last character entered
char lastChar = prefix.charAt(i);
// Find the Node corresponding to the last
// character of 'str' which is pointed by
// prevNode of the Trie
RadixNode curNode = prevNode.getChild().get(lastChar);
// If nothing found, then break the loop as
// no more prefixes are going to be present.
if (curNode == null)
{
System.out.println("No Results Found for \"" + prefix + "\"");
i++;
break;
}
// If present in trie then display all
// the contacts with given prefix.
System.out.println("Suggestions based on \"" + prefix + "\" are");
displayContactsUtil(curNode, prefix);
// Change prevNode for next prefix
prevNode = curNode;
}
for ( ; i < len; i++)
{
prefix += str.charAt(i);
System.out.println("No Results Found for \"" + prefix + "\"");
}
return true;
}
public void displayContactsUtil(RadixNode curNode, String prefix, boolean isPhoneNumber)
{
// Check if the string 'prefix' ends at this Node
// If yes then display the string found so far
if (curNode.isEndOfWord())
System.out.println(prefix);
// Find all the adjacent Nodes to the current
// Node and then call the function recursively
// This is similar to performing DFS on a graph
for (char i = '0'; i <= '9'; i++)
{
RadixNode nextNode = curNode.getChildPhoneDir().get(i);
if (nextNode != null)
{
displayContactsUtil(nextNode, prefix + i);
}
}
}
public boolean displayContacts(String str, boolean isPhoneNumber)
{
RadixNode prevNode = root;
// 'flag' denotes whether the string entered
// so far is present in the Contact List
String prefix = "";
int len = str.length();
// Display the contact List for string formed
// after entering every character
int i;
for (i = 0; i < len; i++)
{
// 'str' stores the string entered so far
prefix += str.charAt(i);
// Get the last character entered
char lastChar = prefix.charAt(i);
// Find the Node corresponding to the last
// character of 'str' which is pointed by
// prevNode of the Trie
RadixNode curNode = prevNode.getChildPhoneDir().get(lastChar);
// If nothing found, then break the loop as
// no more prefixes are going to be present.
if (curNode == null)
{
System.out.println("No Results Found for \"" + prefix + "\"");
i++;
break;
}
// If present in trie then display all
// the contacts with given prefix.
System.out.println("Suggestions based on \"" + prefix + "\" are");
displayContactsUtil(curNode, prefix, isPhoneNumber);
// Change prevNode for next prefix
prevNode = curNode;
}
for ( ; i < len; i++)
{
prefix += str.charAt(i);
System.out.println("No Results Found for \"" + prefix + "\"");
}
return true;
}
}
I have tried to search in a collection but got stuck. Any help / suggestion would be appreciated.
I propose you 2 ways of doing it.
First way: with a single trie.
It is possible to store all you need in a single trie. Your customer class is fine, and here is a possible RadixNode implementation.
I consider that there cannot be two customers with the same name, or with the same phone number. If it is not the case (possibility to have people with same name and different phone nb for instance) tell me in a comment I'll edit.
The thing that is important to understand, is that if you want to have two different ways of finding a customer, and you use a single trie, each customer will appear twice in your trie. Once at the end of the path corresponding to its name, and once after the end of the path corresponding to its phone number.
import java.util.HashMap;
import java.util.Map;
class RadixNode {
private Map<Character, RadixNode> children;
private Customer customer;
public RadixNode(){
this.children = new Map<Character, RadixNode>();
this.Customer = NULL;
}
Map<Character, RadixNode> getChildren() {
return children;
}
boolean hasCustomer() {
return this.customer != NULL;
}
Customer getCustomer() {
return customer;
}
void setCustomer(Customer customer) {
this.customer = customer;
}
}
As you can see, there is only one map storing the node's children. That is because we can see a phone number as a string of digits, so this trie will store all the customers ... twice. Once per name, once per phone number.
Now let's see an insert function. Your trie will need a root,n let's call it root.
public void insert(RadixNode root, Customer customer){
insert_with_name(root, customer, 0);
insert_with_phone_nb(root, customer, 0);
}
public void insert_with_name(RadixNode node, Customer customer, int idx){
if (idx == customer.getName().length()){
node.setCustomer(customer);
} else {
Character current_char = customer.getName().chatAt(idx);
if (! node.getChlidren().containsKey(current_char){
RadixNode new_child = new RadixNode();
node.getChildren().put(current_char, new_child);
}
insert_with_name(node.getChildren().get(current_char), customer, idx+1);
}
}
The insert_with_phone_nb() method is similar. This will work as long as people has unique names, unique phone numbers, and that someone's name cannot be someone's phone number.
As you can see, the method is recursive. I advice you to build your trie structure (and generally, everything based on tree structures) recursively, as it makes for simpler, and generallay cleaner code.
The search function is almost a copy-paste of the insert function:
public void search_by_name(RadixNode node, String name, int idx){
// returns NULL if there is no user going by that name
if (idx == name.length()){
return node.getCustomer();
} else {
Character current_char = name.chatAt(idx);
if (! node.getChlidren().containsKey(current_char){
return NULL;
} else {
return search_by_name(node.getChildren().get(current_char), name, idx+1);
}
}
}
Second way: with 2 tries
The principle is the same, all you have to do is reuse the code above, but keep two distinct root nodes, each of them will build a trie (one for names, one for phone numbers).
The only difference will be the insert function (as it will call insert_with_name and insert_with_phone_nb with 2 different roots), and the search function which will have to search in the right trie as well.
public void insert(RadixNode root_name_trie, RadixNode root_phone_trie, Customer customer){
insert_with_name(root_name_trie, customer, 0);
insert_with_phone_nb(root_phone_trie, customer, 0);
}
Edit: After comment precising there might be customers with the same name, here is an alternative implementation, to allow a RadixNode to contain references toward several Customer.
Replace the Customer customer attribute in RadixNode by, for example, a Vector<Customer>. The methods will have to be modified accordingly of course, and a search by name will then return to you a vector of customers (possibly empty), since this search can then lead to several results.
In your case, I'd go for a single trie, containing vectors of customers. So you can have both a search by name and phone (cast the number as a String), and a single data structure to maintain.

check whether we can split string in two half and both halfves are equal?

I am working on a project where I need to add below method in SampleQueue class - .
public static boolean isValid(String s)
Above method should do this - It will take a String as an input
parameter. Consider strings that can be split so that their first half
is the same as their second half (ignoring blanks, punctuation, and
case). For example, the string "treetree" can be split into "tree" and
"tree". Another example is "world, world". After ignoring blanks and
the comma, the two halves of the string are the same. However, the
string "kattan" has unequal halves, as does the string "abcab".
Basically my method should return true when string has the property above and false otherwise. We need to only use methods in SampleQueue class as shown below to implement the method:
public class SampleQueue<T> {
private T[] queue;
private int frontIndex;
private int backIndex;
private static final int DEFAULT_INITIAL_CAPACITY = 200;
public SampleQueue() {
this(DEFAULT_INITIAL_CAPACITY);
}
public SampleQueue(int initialCapacity) {
T[] tempQueue = (T[]) new Object[initialCapacity + 1];
queue = tempQueue;
frontIndex = 0;
backIndex = initialCapacity;
}
public void enqueue(T newEntry) {
ensureCapacity();
backIndex = (backIndex + 1) % queue.length;
queue[backIndex] = newEntry;
}
public T getFront() {
T front = null;
if (!isEmpty())
front = queue[frontIndex];
return front;
}
public T dequeue() {
// some stuff here
}
private void ensureCapacity() {
// some stuff here
}
public boolean isEmpty() {
// some stuff here
}
public void clear() {
// some stuff here
}
public static boolean isValid(String s) {
if (s == null || s.isEmpty()) {
return false;
}
SampleQueue<Character> myQueue = new SampleQueue<>();
for (char ch : s.trim().toLowerCase().toCharArray()) {
if ((ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9'))
myQueue.enqueue(ch);
}
// all is this right way to check the length?
if (myQueue.queue.length % 2 == 1) {
return false;
}
// now I am confuse here?
}
}
I implemented few things in the isValid method basis on this logic I came up with but I am confuse on what to do for the case length is even?
Enqueue all of the string’s characters—excluding blanks and
punctuation—one at a time. Let the length of the queue be n. If n is
odd, return false. If n is even then what should I do?
This seems overly complicated; use a regular expression to remove everything not a letter and then test if the two halves of the String are equal. Like,
public static boolean isValid(String s) {
String t = s.replaceAll("[^A-Za-z]", "");
return t.substring(0, t.length() / 2).equals(t.substring(t.length() / 2, t.length()));
}

How to properly check if a prefix of a word exists in a trie?

Currently I have the "searchPrefix" function of my Trie class defined as such:
public Boolean searchPrefix(String word) {
TrieNode temp = this.root;
for(int i = 0; i < word.length(); i++){
if(temp.children.get(word.charAt(i)) == null) return false;
else temp = temp.children.get(word.charAt(i));
}
return (temp.children.isEmpty()) ? false : true;
}
This function is supposed to return "true" when the input string is a prefix of a word that exists inside of the trie object. Here is the TrieNode class for reference:
class TrieNode {
Character c;
Boolean isWord = false;
HashMap<Character, TrieNode> children = new HashMap<>();
public TrieNode() {}
public TrieNode(Character c) {
this.c = c;
}
}
According to this online judge, I am incorrectly determining whether a given input string is a prefix. Can anyone shed some light as to why this is an incorrect method? My thinking is that when we get to the node that is the end of the input string, if the node has children it is a prefix of some other word so we return true. However this is apparently incorrect.
I think you are not handling the case where prefix is a terminal word in the trie.
For example, assume there's only one word hello in a trie.
Your implementation will return false for searchPrefix("hello").
To fix it, you need to check the isWord flag too:
public Boolean searchPrefix(String word) {
TrieNode temp = this.root;
for (int i = 0; i < word.length(); i++){
TrieNode next = temp.children.get(word.charAt(i));
if (next == null) {
return false;
}
temp = next;
}
return !temp.children.isEmpty() || temp.isWord;
}

Fastest way to check if a haystack contains set of needles

I have a haystack string and I would like to check if it contains any of the needle strings. Currently I do it that way:
Set<String> needles = ...;
...
String [] pieces = haystack.split(" ");
for (String piece: pieces) {
if (needles.contains(piece) {
return true;
}
}
return false;
It works, but it is relatively slow.
Question: Is there a faster way to accomplish the task?
Example.
Haystack: I am a big tasty potato .
Needles: big, tasty
== RUN ==
I am a big tasty potato .
|
[tasty] got a match, we are good!
You should take a look at Aho-Corasick algorithm. This suits your problem because it build an automaton of all words(needles) and traverse the text(haystack) over the built automaton to find all matching words. Its basically constructs a finite state machine that resembles a trie.
The time complexity is O(n + m + z) where
z is the total number of occurrences of words in text, n is the length of text and m is the total number characters in all words.
Edit 2
Here is a straight-forward implementation which stop traversing after finding first occurrence of any needle.
import java.util.*;
class AhoCorasick {
static final int ALPHABET_SIZE = 256;
Node[] nodes;
int nodeCount;
public static class Node {
int parent;
char charFromParent;
int suffLink = -1;
int[] children = new int[ALPHABET_SIZE];
int[] transitions = new int[ALPHABET_SIZE];
boolean leaf;
{
Arrays.fill(children, -1);
Arrays.fill(transitions, -1);
}
}
public AhoCorasick(int maxNodes) {
nodes = new Node[maxNodes];
// create root
nodes[0] = new Node();
nodes[0].suffLink = 0;
nodes[0].parent = -1;
nodeCount = 1;
}
public void addString(String s) {
int cur = 0;
for (char ch : s.toCharArray()) {
int c = ch;
if (nodes[cur].children[c] == -1) {
nodes[nodeCount] = new Node();
nodes[nodeCount].parent = cur;
nodes[nodeCount].charFromParent = ch;
nodes[cur].children[c] = nodeCount++;
}
cur = nodes[cur].children[c];
}
nodes[cur].leaf = true;
}
public int suffLink(int nodeIndex) {
Node node = nodes[nodeIndex];
if (node.suffLink == -1)
node.suffLink = node.parent == 0 ? 0 : transition(suffLink(node.parent), node.charFromParent);
return node.suffLink;
}
public int transition(int nodeIndex, char ch) {
int c = ch;
Node node = nodes[nodeIndex];
if (node.transitions[c] == -1)
node.transitions[c] = node.children[c] != -1 ? node.children[c] : (nodeIndex == 0 ? 0 : transition(suffLink(nodeIndex), ch));
return node.transitions[c];
}
// Usage example
public static void main(String[] args) {
AhoCorasick ahoCorasick = new AhoCorasick(1000);
ahoCorasick.addString("big");
ahoCorasick.addString("tasty");
String s = "I am a big tasty potato";
int node = 0;
for (int i = 0; i < s.length(); i++) {
node = ahoCorasick.transition(node, s.charAt(i));
if (ahoCorasick.nodes[node].leaf) {
System.out.println("A match found! Needle ends at: " + i); // A match found! Needle ends at: 9
break;
}
}
}
}
However currently this code will find the end position of any occurrences in text. If you need the starting position and/or the needle, you can trace back from the ending position until finding a space to get the matched word.
This doesn't guaranty speed in worst-case, but should work better on average and best cases.
You can use java8 plus with parallel streams with anymatch function
boolean hi=Arrays.stream(pieces).parallel().anyMatch(i->needle.contains(i));
You should make sure needless is an instance of a HashSet which makes contains a "fast", constant time operation. Next, don't process all of haystack if you don't have to... Try this:
int i, j, l = haystack.length();
for(i = 0; i < l; i = j + 1) {
j = haystack.indexOf(' ', i + 1);
if(j == -1) {
j = l - 1;
}
String hay = haystack.s substring(i, j - 1).trim();
if(hay.length() > 0 && needles.contains(hay)) {
return true;
}
}
return false;
*note: this is untested and indexes might be off by +-1, as well as some edge cases might exist. use at your own risk.
Generally most of your slowdown is the split command. You are way better off searching the one string you have than allocating a crap ton of objects. You'd be better off doing regex, and avoiding new object construction. And using Aho would be quite effective. Assuming your lists are big enough to be troublesome.
public class NeedleFinder {
static final int RANGEPERMITTED = 26;
NeedleFinder next[];
public NeedleFinder() {
}
public NeedleFinder(String haystack) {
buildHaystack(haystack);
}
public void buildHaystack(String haystack) {
buildHaystack(this,haystack,0);
}
public void buildHaystack(NeedleFinder node, String haystack, int pos) {
if (pos >= haystack.length()) return;
char digit = (char) (haystack.charAt(pos) % RANGEPERMITTED);
if (digit == ' ') {
buildHaystack(this,haystack,pos+1);
return;
}
if (node.next == null) node.next = new NeedleFinder[RANGEPERMITTED];
if (node.next[digit] == null) node.next[digit] = new NeedleFinder();
NeedleFinder nodeNext = node.next[digit];
buildHaystack(nodeNext,haystack,pos+1);
}
public boolean findNeedle(String needle) {
return findNeedle(this, needle,0);
}
private boolean findNeedle(NeedleFinder node, String needle, int pos) {
if (pos >= needle.length()) return true;
char digit = (char) (needle.charAt(pos) % RANGEPERMITTED);
if (node.next == null) return false;
if (node.next[digit] == null) return false;
return findNeedle(node.next[digit],needle,pos+1);
}
}
On success, check the contains to make sure it's not a false positive. But, it's fast. We're talking 1/5th the speed of binary search.
Speaking of, binary search is a great idea. It's in the right time complexity alone. Just sort your silly list of haystack strings then when you look through the needles do a binary search. In java these are really basic and items in Collections. Both the .sort() and the .binarySearch() commands. And it's going to be orders of magnitude better than brute.
value = Collections.binarySearch(haystackList, needle, strcomp);
If value is positive it was found.
Collections.sort(words, strcomp);
With the strcomp.
public Comparator<String> strcomp = new Comparator<String>() {
#Override
public int compare(String s, String t1) {
if ((s == null) && (t1 == null)) return 0;
if (s == null) return 1;
if (t1 == null) return -1;
return s.compareTo(t1);
}
};
If it's really all about speed, and you want to search through a list of items instead of a solid string, you could divide the work into different threads (I'm not sure how many items you're checking with, but if it's not taking minutes, this might not be the way to go)
If you don't need to make the haystack into an array, you could instead iterate through needles, and test haystack via String.contains();

How to use string frequencies list in Trie data structure?

I am working on some performance test on various data structures. In my list I have HashMap and Trie data structure. I am done with HashMap but not sure how to use Trie for below problem -
I have a text file which contains 2 million english words with their frequencies in this format -
hello 100
world 5000
good 2000
bad 9000
...
Now I am reading this file line by line and storing it in HashMap - First splitted string goes as the key in the HashMap and next splitted string goes as the value in the HashMap and so I am able to measure the insertion performance with the below code.
Map<String, String> wordTest = new HashMap<String, String>();
try {
fis = new FileInputStream(FILE_LOCATION);
reader = new BufferedReader(new InputStreamReader(fis));
String line = reader.readLine();
while (line != null) {
String[] splitString = line.split("\\s+");
// now put it in HashMap as key value pair
wordTest.put(splitString[0].toLowerCase().trim(), splitString[1].trim());
line = reader.readLine();
}
}
Now how would I implement Trie data structure to load the same thing in Trie as I did for HashMap? And then do a lookup basis on String as well? This is my first time with Trie data structure so little bit confuse.
Update:-
Below is my TrieImpl class
public class TrieImpl {
//root node
private TrieNode r;
public TrieImpl() {
r = new TrieNode();
}
public boolean has(String word) {
return r.has(word);
}
public void insert(String word){
r.insert(word);
}
public String toString() {
return r.toString();
}
public static void main(String[] args) {
TrieImpl t = new TrieImpl();
System.out.println("Testing some strings");
t.insert("HELLO"); // how do I pass string and its count
t.insert("WORLD"); // how do I pass string and its count
}
}
And below is my TrieNode class -
public class TrieNode {
// make child nodes
private TrieNode[] c;
// flag for end of word
private boolean flag = false;
public TrieNode() {
c = new TrieNode[26]; // 1 for each letter in alphabet
}
protected void insert(String word) {
int val = word.charAt(0) - 64;
// if the value of the child node at val is null, make a new node
// there to represent the letter
if (c[val] == null) {
c[val] = new TrieNode();
}
// if word length > 1, then word is not finished being added.
// otherwise, set the flag to true so we know a word ends there.
if (word.length() > 1) {
c[val].insert(word.substring(1));
} else {
c[val].flag = true;
}
}
public boolean has(String word) {
int val = word.charAt(0) - 64;
if (c[val] != null && word.length() > 1) {
c[val].has(word.substring(1));
} else if (c[val].flag == true && word.length() == 1) {
return true;
}
return false;
}
public String toString() {
return "";
}
}
Now how would I extend this to passs a particular string and its count and then do a lookup basis on String?
You can just add a element frequency to your TrieNode class.
public class TrieNode {
// make child nodes
private TrieNode[] c;
// flag for end of word
private boolean flag = false;
//stores frequency if flag is set
private int frequency;
Now in the insert method, add the frequency while setting the flag..change method signature appropriately
protected void insert(String word, int frequency) {
int val = word.charAt(0) - 64;
..........
..........
// if the value of the child node at val is null, make a new nod
if (word.length() > 1) {
c[val].insert(word.substring(1),frequency);
} else {
c[val].flag = true;
c[val].frequency = frequency;
}
}
Now create a new method to get the frequency.It can be done similar to has method, where you follow the branches till the end and finally when you find that the flag is set, return the frequency.
public int getFreq(String word) {
int val = word.charAt(0) - 64;
if (word.length() > 1) {
return c[val].getFreq(word.substring(1));
} else if (c[val].flag == true && word.length() == 1) {
return c[val].frequency;
} else
return -1;
}
-------------------------------EDIT------------------------
Use has method first to check for the string, then use getFreq method
public int getFreq(String word) {
if(has(word))
return getFreqHelper(word);
else
return -1; //this indicates word is not present
}
private int getFreqHelper(String word) {
int val = word.charAt(0) - 64;
if (word.length() > 1) {
return c[val].getFreq(word.substring(1));
} else if (c[val].flag == true && word.length() == 1) {
return c[val].frequency;
} else
return -1;
}
Here is a hint:
Define a class FrequencyString like so:
class FrequencyString {
private String string;
private int frequency;
public FrequencyString(String str, int freq) {
this.string = str;
this.frequency = freq;
}
public getString() {
return string;
}
public getFrequency() {
return frequency;
}
}
Now modify your Trie implementation methods to accept this new FrequencyString. These will be your new signatures:
TrieImpl:
boolean has(String word);
void insert(String word, int freq);
TrieNode:
boolean has(String word);
void insert(FrequencyString word);
If you want to find the frequency for a given word if it exists, change the has methods' signatures to this:
Integer find(String word);
When implementing find, return null if the word does not exist, or new Integer(result.getFrequency()); (where result is the found FrequencyString) if it does.

Categories

Resources