How to use string frequencies list in Trie data structure? - java

I am working on some performance test on various data structures. In my list I have HashMap and Trie data structure. I am done with HashMap but not sure how to use Trie for below problem -
I have a text file which contains 2 million english words with their frequencies in this format -
hello 100
world 5000
good 2000
bad 9000
...
Now I am reading this file line by line and storing it in HashMap - First splitted string goes as the key in the HashMap and next splitted string goes as the value in the HashMap and so I am able to measure the insertion performance with the below code.
Map<String, String> wordTest = new HashMap<String, String>();
try {
fis = new FileInputStream(FILE_LOCATION);
reader = new BufferedReader(new InputStreamReader(fis));
String line = reader.readLine();
while (line != null) {
String[] splitString = line.split("\\s+");
// now put it in HashMap as key value pair
wordTest.put(splitString[0].toLowerCase().trim(), splitString[1].trim());
line = reader.readLine();
}
}
Now how would I implement Trie data structure to load the same thing in Trie as I did for HashMap? And then do a lookup basis on String as well? This is my first time with Trie data structure so little bit confuse.
Update:-
Below is my TrieImpl class
public class TrieImpl {
//root node
private TrieNode r;
public TrieImpl() {
r = new TrieNode();
}
public boolean has(String word) {
return r.has(word);
}
public void insert(String word){
r.insert(word);
}
public String toString() {
return r.toString();
}
public static void main(String[] args) {
TrieImpl t = new TrieImpl();
System.out.println("Testing some strings");
t.insert("HELLO"); // how do I pass string and its count
t.insert("WORLD"); // how do I pass string and its count
}
}
And below is my TrieNode class -
public class TrieNode {
// make child nodes
private TrieNode[] c;
// flag for end of word
private boolean flag = false;
public TrieNode() {
c = new TrieNode[26]; // 1 for each letter in alphabet
}
protected void insert(String word) {
int val = word.charAt(0) - 64;
// if the value of the child node at val is null, make a new node
// there to represent the letter
if (c[val] == null) {
c[val] = new TrieNode();
}
// if word length > 1, then word is not finished being added.
// otherwise, set the flag to true so we know a word ends there.
if (word.length() > 1) {
c[val].insert(word.substring(1));
} else {
c[val].flag = true;
}
}
public boolean has(String word) {
int val = word.charAt(0) - 64;
if (c[val] != null && word.length() > 1) {
c[val].has(word.substring(1));
} else if (c[val].flag == true && word.length() == 1) {
return true;
}
return false;
}
public String toString() {
return "";
}
}
Now how would I extend this to passs a particular string and its count and then do a lookup basis on String?

You can just add a element frequency to your TrieNode class.
public class TrieNode {
// make child nodes
private TrieNode[] c;
// flag for end of word
private boolean flag = false;
//stores frequency if flag is set
private int frequency;
Now in the insert method, add the frequency while setting the flag..change method signature appropriately
protected void insert(String word, int frequency) {
int val = word.charAt(0) - 64;
..........
..........
// if the value of the child node at val is null, make a new nod
if (word.length() > 1) {
c[val].insert(word.substring(1),frequency);
} else {
c[val].flag = true;
c[val].frequency = frequency;
}
}
Now create a new method to get the frequency.It can be done similar to has method, where you follow the branches till the end and finally when you find that the flag is set, return the frequency.
public int getFreq(String word) {
int val = word.charAt(0) - 64;
if (word.length() > 1) {
return c[val].getFreq(word.substring(1));
} else if (c[val].flag == true && word.length() == 1) {
return c[val].frequency;
} else
return -1;
}
-------------------------------EDIT------------------------
Use has method first to check for the string, then use getFreq method
public int getFreq(String word) {
if(has(word))
return getFreqHelper(word);
else
return -1; //this indicates word is not present
}
private int getFreqHelper(String word) {
int val = word.charAt(0) - 64;
if (word.length() > 1) {
return c[val].getFreq(word.substring(1));
} else if (c[val].flag == true && word.length() == 1) {
return c[val].frequency;
} else
return -1;
}

Here is a hint:
Define a class FrequencyString like so:
class FrequencyString {
private String string;
private int frequency;
public FrequencyString(String str, int freq) {
this.string = str;
this.frequency = freq;
}
public getString() {
return string;
}
public getFrequency() {
return frequency;
}
}
Now modify your Trie implementation methods to accept this new FrequencyString. These will be your new signatures:
TrieImpl:
boolean has(String word);
void insert(String word, int freq);
TrieNode:
boolean has(String word);
void insert(FrequencyString word);
If you want to find the frequency for a given word if it exists, change the has methods' signatures to this:
Integer find(String word);
When implementing find, return null if the word does not exist, or new Integer(result.getFrequency()); (where result is the found FrequencyString) if it does.

Related

Radix(Trie) Tree implementation for Cutomer search in Java

I am working on a project and need to search in data of millions of customers. I want to implement radix(trie) search algorithm. I have read and implement radix for a simple string collections. But Here I have a collection of customers and want to search it by name or by mobile number.
Customer Class:
public class Customer {
String name;
String mobileNumer;
public Customer (String name, String phoneNumer) {
this.name = name;
this.mobileNumer = phoneNumer;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getPhoneNumer() {
return mobileNumer;
}
public void setPhoneNumer(String phoneNumer) {
this.mobileNumer = phoneNumer;
}
}
RadixNode Class:
import java.util.HashMap;
import java.util.Map;
class RadixNode {
private final Map<Character, RadixNode> child = new HashMap<>();
private final Map<Customer, RadixNode> mobileNum = new HashMap<>();
private boolean endOfWord;
Map<Character, RadixNode> getChild() {
return child;
}
Map<Customer, RadixNode> getChildPhoneDir() {
return mobileNum;
}
boolean isEndOfWord() {
return endOfWord;
}
void setEndOfWord(boolean endOfWord) {
this.endOfWord = endOfWord;
}
}
Radix Class:
class Radix {
private RadixNode root;
Radix() {
root = new RadixNode();
}
void insert(String word) {
RadixNode current = root;
for (int i = 0; i < word.length(); i++) {
current = current.getChild().computeIfAbsent(word.charAt(i), c -> new RadixNode());
}
current.setEndOfWord(true);
}
void insert(Customer word) {
RadixNode current = root;
System.out.println("==========================================");
System.out.println(word.mobileNumer.length());
for (int i = 0; i < word.mobileNumer.length(); i++) {
current = current.getChildPhoneDir().computeIfAbsent(word.mobileNumer.charAt(i), c -> new RadixNode());
System.out.println(current);
}
current.setEndOfWord(true);
}
boolean delete(String word) {
return delete(root, word, 0);
}
boolean containsNode(String word) {
RadixNode current = root;
for (int i = 0; i < word.length(); i++) {
char ch = word.charAt(i);
RadixNode node = current.getChild().get(ch);
if (node == null) {
return false;
}
current = node;
}
return current.isEndOfWord();
}
boolean isEmpty() {
return root == null;
}
private boolean delete(RadixNode current, String word, int index) {
if (index == word.length()) {
if (!current.isEndOfWord()) {
return false;
}
current.setEndOfWord(false);
return current.getChild().isEmpty();
}
char ch = word.charAt(index);
RadixNode node = current.getChild().get(ch);
if (node == null) {
return false;
}
boolean shouldDeleteCurrentNode = delete(node, word, index + 1) && !node.isEndOfWord();
if (shouldDeleteCurrentNode) {
current.getChild().remove(ch);
return current.getChild().isEmpty();
}
return false;
}
public void displayContactsUtil(RadixNode curNode, String prefix)
{
// Check if the string 'prefix' ends at this Node
// If yes then display the string found so far
if (curNode.isEndOfWord())
System.out.println(prefix);
// Find all the adjacent Nodes to the current
// Node and then call the function recursively
// This is similar to performing DFS on a graph
for (char i = 'a'; i <= 'z'; i++)
{
RadixNode nextNode = curNode.getChild().get(i);
if (nextNode != null)
{
displayContactsUtil(nextNode, prefix + i);
}
}
}
public boolean displayContacts(String str)
{
RadixNode prevNode = root;
// 'flag' denotes whether the string entered
// so far is present in the Contact List
String prefix = "";
int len = str.length();
// Display the contact List for string formed
// after entering every character
int i;
for (i = 0; i < len; i++)
{
// 'str' stores the string entered so far
prefix += str.charAt(i);
// Get the last character entered
char lastChar = prefix.charAt(i);
// Find the Node corresponding to the last
// character of 'str' which is pointed by
// prevNode of the Trie
RadixNode curNode = prevNode.getChild().get(lastChar);
// If nothing found, then break the loop as
// no more prefixes are going to be present.
if (curNode == null)
{
System.out.println("No Results Found for \"" + prefix + "\"");
i++;
break;
}
// If present in trie then display all
// the contacts with given prefix.
System.out.println("Suggestions based on \"" + prefix + "\" are");
displayContactsUtil(curNode, prefix);
// Change prevNode for next prefix
prevNode = curNode;
}
for ( ; i < len; i++)
{
prefix += str.charAt(i);
System.out.println("No Results Found for \"" + prefix + "\"");
}
return true;
}
public void displayContactsUtil(RadixNode curNode, String prefix, boolean isPhoneNumber)
{
// Check if the string 'prefix' ends at this Node
// If yes then display the string found so far
if (curNode.isEndOfWord())
System.out.println(prefix);
// Find all the adjacent Nodes to the current
// Node and then call the function recursively
// This is similar to performing DFS on a graph
for (char i = '0'; i <= '9'; i++)
{
RadixNode nextNode = curNode.getChildPhoneDir().get(i);
if (nextNode != null)
{
displayContactsUtil(nextNode, prefix + i);
}
}
}
public boolean displayContacts(String str, boolean isPhoneNumber)
{
RadixNode prevNode = root;
// 'flag' denotes whether the string entered
// so far is present in the Contact List
String prefix = "";
int len = str.length();
// Display the contact List for string formed
// after entering every character
int i;
for (i = 0; i < len; i++)
{
// 'str' stores the string entered so far
prefix += str.charAt(i);
// Get the last character entered
char lastChar = prefix.charAt(i);
// Find the Node corresponding to the last
// character of 'str' which is pointed by
// prevNode of the Trie
RadixNode curNode = prevNode.getChildPhoneDir().get(lastChar);
// If nothing found, then break the loop as
// no more prefixes are going to be present.
if (curNode == null)
{
System.out.println("No Results Found for \"" + prefix + "\"");
i++;
break;
}
// If present in trie then display all
// the contacts with given prefix.
System.out.println("Suggestions based on \"" + prefix + "\" are");
displayContactsUtil(curNode, prefix, isPhoneNumber);
// Change prevNode for next prefix
prevNode = curNode;
}
for ( ; i < len; i++)
{
prefix += str.charAt(i);
System.out.println("No Results Found for \"" + prefix + "\"");
}
return true;
}
}
I have tried to search in a collection but got stuck. Any help / suggestion would be appreciated.
I propose you 2 ways of doing it.
First way: with a single trie.
It is possible to store all you need in a single trie. Your customer class is fine, and here is a possible RadixNode implementation.
I consider that there cannot be two customers with the same name, or with the same phone number. If it is not the case (possibility to have people with same name and different phone nb for instance) tell me in a comment I'll edit.
The thing that is important to understand, is that if you want to have two different ways of finding a customer, and you use a single trie, each customer will appear twice in your trie. Once at the end of the path corresponding to its name, and once after the end of the path corresponding to its phone number.
import java.util.HashMap;
import java.util.Map;
class RadixNode {
private Map<Character, RadixNode> children;
private Customer customer;
public RadixNode(){
this.children = new Map<Character, RadixNode>();
this.Customer = NULL;
}
Map<Character, RadixNode> getChildren() {
return children;
}
boolean hasCustomer() {
return this.customer != NULL;
}
Customer getCustomer() {
return customer;
}
void setCustomer(Customer customer) {
this.customer = customer;
}
}
As you can see, there is only one map storing the node's children. That is because we can see a phone number as a string of digits, so this trie will store all the customers ... twice. Once per name, once per phone number.
Now let's see an insert function. Your trie will need a root,n let's call it root.
public void insert(RadixNode root, Customer customer){
insert_with_name(root, customer, 0);
insert_with_phone_nb(root, customer, 0);
}
public void insert_with_name(RadixNode node, Customer customer, int idx){
if (idx == customer.getName().length()){
node.setCustomer(customer);
} else {
Character current_char = customer.getName().chatAt(idx);
if (! node.getChlidren().containsKey(current_char){
RadixNode new_child = new RadixNode();
node.getChildren().put(current_char, new_child);
}
insert_with_name(node.getChildren().get(current_char), customer, idx+1);
}
}
The insert_with_phone_nb() method is similar. This will work as long as people has unique names, unique phone numbers, and that someone's name cannot be someone's phone number.
As you can see, the method is recursive. I advice you to build your trie structure (and generally, everything based on tree structures) recursively, as it makes for simpler, and generallay cleaner code.
The search function is almost a copy-paste of the insert function:
public void search_by_name(RadixNode node, String name, int idx){
// returns NULL if there is no user going by that name
if (idx == name.length()){
return node.getCustomer();
} else {
Character current_char = name.chatAt(idx);
if (! node.getChlidren().containsKey(current_char){
return NULL;
} else {
return search_by_name(node.getChildren().get(current_char), name, idx+1);
}
}
}
Second way: with 2 tries
The principle is the same, all you have to do is reuse the code above, but keep two distinct root nodes, each of them will build a trie (one for names, one for phone numbers).
The only difference will be the insert function (as it will call insert_with_name and insert_with_phone_nb with 2 different roots), and the search function which will have to search in the right trie as well.
public void insert(RadixNode root_name_trie, RadixNode root_phone_trie, Customer customer){
insert_with_name(root_name_trie, customer, 0);
insert_with_phone_nb(root_phone_trie, customer, 0);
}
Edit: After comment precising there might be customers with the same name, here is an alternative implementation, to allow a RadixNode to contain references toward several Customer.
Replace the Customer customer attribute in RadixNode by, for example, a Vector<Customer>. The methods will have to be modified accordingly of course, and a search by name will then return to you a vector of customers (possibly empty), since this search can then lead to several results.
In your case, I'd go for a single trie, containing vectors of customers. So you can have both a search by name and phone (cast the number as a String), and a single data structure to maintain.

check whether we can split string in two half and both halfves are equal?

I am working on a project where I need to add below method in SampleQueue class - .
public static boolean isValid(String s)
Above method should do this - It will take a String as an input
parameter. Consider strings that can be split so that their first half
is the same as their second half (ignoring blanks, punctuation, and
case). For example, the string "treetree" can be split into "tree" and
"tree". Another example is "world, world". After ignoring blanks and
the comma, the two halves of the string are the same. However, the
string "kattan" has unequal halves, as does the string "abcab".
Basically my method should return true when string has the property above and false otherwise. We need to only use methods in SampleQueue class as shown below to implement the method:
public class SampleQueue<T> {
private T[] queue;
private int frontIndex;
private int backIndex;
private static final int DEFAULT_INITIAL_CAPACITY = 200;
public SampleQueue() {
this(DEFAULT_INITIAL_CAPACITY);
}
public SampleQueue(int initialCapacity) {
T[] tempQueue = (T[]) new Object[initialCapacity + 1];
queue = tempQueue;
frontIndex = 0;
backIndex = initialCapacity;
}
public void enqueue(T newEntry) {
ensureCapacity();
backIndex = (backIndex + 1) % queue.length;
queue[backIndex] = newEntry;
}
public T getFront() {
T front = null;
if (!isEmpty())
front = queue[frontIndex];
return front;
}
public T dequeue() {
// some stuff here
}
private void ensureCapacity() {
// some stuff here
}
public boolean isEmpty() {
// some stuff here
}
public void clear() {
// some stuff here
}
public static boolean isValid(String s) {
if (s == null || s.isEmpty()) {
return false;
}
SampleQueue<Character> myQueue = new SampleQueue<>();
for (char ch : s.trim().toLowerCase().toCharArray()) {
if ((ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9'))
myQueue.enqueue(ch);
}
// all is this right way to check the length?
if (myQueue.queue.length % 2 == 1) {
return false;
}
// now I am confuse here?
}
}
I implemented few things in the isValid method basis on this logic I came up with but I am confuse on what to do for the case length is even?
Enqueue all of the string’s characters—excluding blanks and
punctuation—one at a time. Let the length of the queue be n. If n is
odd, return false. If n is even then what should I do?
This seems overly complicated; use a regular expression to remove everything not a letter and then test if the two halves of the String are equal. Like,
public static boolean isValid(String s) {
String t = s.replaceAll("[^A-Za-z]", "");
return t.substring(0, t.length() / 2).equals(t.substring(t.length() / 2, t.length()));
}

Insert a new string to the trie graph

I am trying to implement the insert method of the Patricia Trie data structure and I am trying to handle this case:
first string: abaxyzalexsky,
second string: abaxyzalex,
third string: abaxyz,
fourth string: aba
I want to mark the trie as the following aba-xyz-alex-sky after inserting the fourth string, but I don't know how can I get it work.
How can I mark the words in the trie in the case above?
public void insert(String s) {
if (nodeRoot == null) {
nodeRoot = new TrieNode(s);
nodeRoot.isWord = true;
} else {
insert(nodeRoot, s);
}
}
private void insert(TrieNode node, String s) {
int len1 = node.edge.length();
int len2 = s.length();
int len = Math.min(len1, len2);
ArrayList<TrieNode> nextNode = node.getNext();
for (int index = 0; index < len; index++) {
if (s.charAt(index) != node.edge.charAt(index)) {
// In case the both words have common substrings and after the
// common substrings the words are split. For example abad, abac
} else if (index == (s.length() - 1)
|| index == (node.edge.length() - 1)) {
// In case the node just needs one path since one word is
// substring of the other.
// For example (aba and abac)
if (len1 > len2) {
// node edge string is longer than the inserted one. For example (abac
// and aba).
String samesubString = node.edge.substring(0, index + 1);
String different = node.edge.substring(index + 1);
node.edge = samesubString;
if (node.getNext() != null && !node.getNext().isEmpty()) {
for (TrieNode subword : node.getNext()) {
//I am here when I insert the third string. The code below retrives wrong data structure.
TrieNode node1 = new TrieNode(different);
node1.isWord = true;
node1.next.add(subword);
node.next.add(node1);
}
} else {
TrieNode leaf = new TrieNode(different);
leaf.isWord = true;
node.next.add(leaf);
for (TrieNode subword : node.getNext()) {
System.out.println(node.getEdge() + "---"
+ subword.getEdge());
}
}
} else {
// new inserted string value is longer. For example (aba
// and abac).
}
} else {
System.out.println("The strings are the same - " + index);
}
}
}
NodeTrie class
package patriciaTrie;
import java.util.ArrayList;
public class TrieNode {
ArrayList<TrieNode> next = new ArrayList<TrieNode>();
String edge;
boolean isWord;
TrieNode(String edge){
this.edge = edge;
}
public ArrayList<TrieNode> getNext() {
return next;
}
public void setNext(ArrayList<TrieNode> next) {
this.next = next;
}
public String getEdge() {
return edge;
}
public void setEdge(String edge) {
this.edge = edge;
}
}

Detecting if a word is valid when it contains a blank

I'm working on a phone based word game, and there could potentially be quite a few blanks (representing any letter) that a player could have the option to use.
I store all the possible words in a hashSet, so detecting if a word is valid when it has one blank is simply a matter of looping through the alphabet replacing the blank with a letter and testing the word. I have a recursive call so this will work with any number of blanks. The code is as follows:
public boolean isValidWord(String word) {
if (word.contains(" ")){
for (char i = 'A'; i <= 'Z'; i++) {
if (isValidWord(word.replaceFirst(" ", Character.toString(i))))
return true;
}
return false;
}
else
return wordHashSet.contains(word);
}
As the number of blanks increases, the number of words we have to test increase exponentially. By the time we get to 3 blanks we're having to do 17576 lookups before we can reject a word, and this is affecting game play. Once there are 4 blanks the game will just freeze for a while.
What is the most efficient way for me to check words with multiple blanks. Should I just iterate through the hashset and check if we have a match against each word? If so, then what's the fastest way for me to compare two strings taking the blanks into account? I've tried doing this using a regular expression and String.matches(xx), but it's too slow. A straight String.equals(xx) is fast enough, but that obviously doesn't take blanks into account.
A very fast method althrough somewhat challenging to implement would be to store your words in a Trie - http://en.wikipedia.org/wiki/Trie
A trie is a tree structure that contains a char in every node and an array of pointers pointing to next nodes.
Without blank spaces it would be easy - just follow the trie structure, you can check this in linear time. When you have a blank, you will have a loop to search all possible routes.
This can sound complicated and difficult if you are not familiar with tries but if you get stuck I can help you with some code.
EDIT:
Ok, here is some c# code for your problem using tries, I think you will have no problems converting it in JAVA. If you do, leave a comment and I will help.
Trie.cs
public class Trie
{
private char blank = '_';
public Node Root { get; set; }
public void Insert(String key)
{
Root = Insert(Root, key, 0);
}
public bool Contains(String key)
{
Node x = Find(Root, key, 0);
return x != null && x.NullNode;
}
private Node Find(Node x, String key, int d)
{ // Return value associated with key in the subtrie rooted at x.
if (x == null)
return null;
if (d == key.Length)
{
if (x.NullNode)
return x;
else
return null;
}
char c = key[d]; // Use dth key char to identify subtrie.
if (c == blank)
{
foreach (var child in x.Children)
{
var node = Find(child, key, d + 1);
if (node != null)
return node;
}
return null;
}
else
return Find(x.Children[c], key, d + 1);
}
private Node Insert(Node x, String key, int d)
{ // Change value associated with key if in subtrie rooted at x.
if (x == null) x = new Node();
if (d == key.Length)
{
x.NullNode = true;
return x;
}
char c = key[d]; // Use dth key char to identify subtrie.
x.Children[c] = Insert(x.Children[c], key, d + 1);
return x;
}
public IEnumerable<String> GetAllKeys()
{
return GetKeysWithPrefix("");
}
public IEnumerable<String> GetKeysWithPrefix(String pre)
{
Queue<String> q = new Queue<String>();
Collect(Find(Root, pre, 0), pre, q);
return q;
}
private void Collect(Node x, String pre, Queue<String> q)
{
if (x == null) return;
if (x.NullNode) q.Enqueue(pre);
for (int c = 0; c < 256; c++)
Collect(x.Children[c], pre + ((char)c), q);
}
}
Node.cs
public class Node
{
public bool NullNode { get; set; }
public Node[] Children { get; set; }
public Node()
{
NullNode = false;
Children = new Node[256];
}
}
Sample usage:
Trie tr = new Trie();
tr.Insert("telephone");
while (true)
{
string str = Console.ReadLine();
if( tr.Contains( str ) )
Console.WriteLine("contains!");
else
Console.WriteLine("does not contain!");
}
A straight String.equals(xx) is fast enough, but that obviously
doesn't take blanks into account.
So I recommend to implement this simple solution, which is very close to String.equals(), and takes blanks into account:
public boolean isValidWord(String word) {
if (wordHashSet.contains(word)) {
return true;
}
for (String fromHashSet: wordHashSet){
if (compareIgnoreBlanks(fromHashSet, word)) {
return true;
}
}
return false;
}
/**
* Inspired by String.compareTo(String). Compares two String's, ignoring blanks in the String given as
* second argument.
*
* #param s1
* String from the HashSet
* #param s2
* String with potential blanks
* #return true if s1 and s2 match, false otherwise
*/
public static boolean compareIgnoreBlanks(String s1, String s2) {
int len = s1.length();
if (len != s2.length()) {
return false;
}
int k = 0;
while (k < len) {
char c1 = s1.charAt(k);
char c2 = s2.charAt(k);
if (c2 != ' ' && c1 != c2) {
return false;
}
k++;
}
return true;
}
public boolean isValidWord(String word) {
word = word.replaceAll(" ", "[a-z]");
Pattern pattern = Pattern.compile(word);
for (String wordFromHashSet: hashSet){
Matcher matcher = pattern.matcher(wordFromHashSet);
if (matcher.matches()) return true;
}
return false;
}
public boolean isValidWord(String word) {
ArrayList<Integer> pos = new ArrayList<Integer>();
for (int i=0; i!=word.length();i++){
if (word.charAt(i) == ' ') pos.add(i);
}
for (String hashSetWord: hashSet){
for (Integer i: pos){
hashSetWord = hashSetWord.substring(0,i)+" "+hashSetWord.substring(i+1);
}
if (hashSetWord.equals(word)) return true;
}
return false;
}
A kind of ugly, but I would guess fairly fast method would be to create a string containing all valid words like this:
WORD1
WORD2
WORD3
etc.
Then use a regex like (^|\n)A[A-Z]PL[A-Z]\n (i.e. replacing all blanks with [A-Z]), and match it on that string.

Where do I find a standard Trie based map implementation in Java? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I have a Java program that stores a lot of mappings from Strings to various objects.
Right now, my options are either to rely on hashing (via HashMap) or on binary searches (via TreeMap). I am wondering if there is an efficient and standard trie-based map implementation in a popular and quality collections library?
I've written my own in the past, but I'd rather go with something standard, if available.
Quick clarification: While my question is general, in the current project I am dealing with a lot of data that is indexed by fully-qualified class name or method signature. Thus, there are many shared prefixes.
You might want to look at the Trie implementation that Limewire is contributing to the Google Guava.
There is no trie data structure in the core Java libraries.
This may be because tries are usually designed to store character strings, while Java data structures are more general, usually holding any Object (defining equality and a hash operation), though they are sometimes limited to Comparable objects (defining an order). There's no common abstraction for "a sequence of symbols," although CharSequence is suitable for character strings, and I suppose you could do something with Iterable for other types of symbols.
Here's another point to consider: when trying to implement a conventional trie in Java, you are quickly confronted with the fact that Java supports Unicode. To have any sort of space efficiency, you have to restrict the strings in your trie to some subset of symbols, or abandon the conventional approach of storing child nodes in an array indexed by symbol. This might be another reason why tries are not considered general-purpose enough for inclusion in the core library, and something to watch out for if you implement your own or use a third-party library.
Apache Commons Collections v4.0 now supports trie structures.
See the org.apache.commons.collections4.trie package info for more information. In particular, check the PatriciaTrie class:
Implementation of a PATRICIA Trie (Practical Algorithm to Retrieve Information Coded in Alphanumeric).
A PATRICIA Trie is a compressed Trie. Instead of storing all data at the edges of the Trie (and having empty internal nodes), PATRICIA stores data in every node. This allows for very efficient traversal, insert, delete, predecessor, successor, prefix, range, and select(Object) operations. All operations are performed at worst in O(K) time, where K is the number of bits in the largest item in the tree. In practice, operations actually take O(A(K)) time, where A(K) is the average number of bits of all items in the tree.
Also check out concurrent-trees. They support both Radix and Suffix trees and are designed for high concurrency environments.
I wrote and published a simple and fast implementation here.
What you need is org.apache.commons.collections.FastTreeMap , I think.
Below is a basic HashMap implementation of a Trie. Some people might find this useful...
class Trie {
HashMap<Character, HashMap> root;
public Trie() {
root = new HashMap<Character, HashMap>();
}
public void addWord(String word) {
HashMap<Character, HashMap> node = root;
for (int i = 0; i < word.length(); i++) {
Character currentLetter = word.charAt(i);
if (node.containsKey(currentLetter) == false) {
node.put(currentLetter, new HashMap<Character, HashMap>());
}
node = node.get(currentLetter);
}
}
public boolean containsPrefix(String word) {
HashMap<Character, HashMap> node = root;
for (int i = 0; i < word.length(); i++) {
Character currentLetter = word.charAt(i);
if (node.containsKey(currentLetter)) {
node = node.get(currentLetter);
} else {
return false;
}
}
return true;
}
}
Apache's commons collections:
org.apache.commons.collections4.trie.PatriciaTrie
You can try the Completely Java library, it features a PatriciaTrie implementation. The API is small and easy to get started, and it's available in the Maven central repository.
You might look at this TopCoder one as well (registration required...).
If you required sorted map, then tries are worthwhile.
If you don't then hashmap is better.
Hashmap with string keys can be improved over the standard Java implementation:
Array hash map
If you're not worried about pulling in the Scala library, you can use this space efficient implementation I wrote of a burst trie.
https://github.com/nbauernfeind/scala-burst-trie
here is my implementation, enjoy it via: GitHub - MyTrie.java
/* usage:
MyTrie trie = new MyTrie();
trie.insert("abcde");
trie.insert("abc");
trie.insert("sadas");
trie.insert("abc");
trie.insert("wqwqd");
System.out.println(trie.contains("abc"));
System.out.println(trie.contains("abcd"));
System.out.println(trie.contains("abcdefg"));
System.out.println(trie.contains("ab"));
System.out.println(trie.getWordCount("abc"));
System.out.println(trie.getAllDistinctWords());
*/
import java.util.*;
public class MyTrie {
private class Node {
public int[] next = new int[26];
public int wordCount;
public Node() {
for(int i=0;i<26;i++) {
next[i] = NULL;
}
wordCount = 0;
}
}
private int curr;
private Node[] nodes;
private List<String> allDistinctWords;
public final static int NULL = -1;
public MyTrie() {
nodes = new Node[100000];
nodes[0] = new Node();
curr = 1;
}
private int getIndex(char c) {
return (int)(c - 'a');
}
private void depthSearchWord(int x, String currWord) {
for(int i=0;i<26;i++) {
int p = nodes[x].next[i];
if(p != NULL) {
String word = currWord + (char)(i + 'a');
if(nodes[p].wordCount > 0) {
allDistinctWords.add(word);
}
depthSearchWord(p, word);
}
}
}
public List<String> getAllDistinctWords() {
allDistinctWords = new ArrayList<String>();
depthSearchWord(0, "");
return allDistinctWords;
}
public int getWordCount(String str) {
int len = str.length();
int p = 0;
for(int i=0;i<len;i++) {
int j = getIndex(str.charAt(i));
if(nodes[p].next[j] == NULL) {
return 0;
}
p = nodes[p].next[j];
}
return nodes[p].wordCount;
}
public boolean contains(String str) {
int len = str.length();
int p = 0;
for(int i=0;i<len;i++) {
int j = getIndex(str.charAt(i));
if(nodes[p].next[j] == NULL) {
return false;
}
p = nodes[p].next[j];
}
return nodes[p].wordCount > 0;
}
public void insert(String str) {
int len = str.length();
int p = 0;
for(int i=0;i<len;i++) {
int j = getIndex(str.charAt(i));
if(nodes[p].next[j] == NULL) {
nodes[curr] = new Node();
nodes[p].next[j] = curr;
curr++;
}
p = nodes[p].next[j];
}
nodes[p].wordCount++;
}
}
I have just tried my own Concurrent TRIE implementation but not based on characters, it is based on HashCode. Still We can use this having Map of Map for each CHAR hascode.
You can test this using the code # https://github.com/skanagavelu/TrieHashMap/blob/master/src/TrieMapPerformanceTest.java
https://github.com/skanagavelu/TrieHashMap/blob/master/src/TrieMapValidationTest.java
import java.util.concurrent.atomic.AtomicReferenceArray;
public class TrieMap {
public static int SIZEOFEDGE = 4;
public static int OSIZE = 5000;
}
abstract class Node {
public Node getLink(String key, int hash, int level){
throw new UnsupportedOperationException();
}
public Node createLink(int hash, int level, String key, String val) {
throw new UnsupportedOperationException();
}
public Node removeLink(String key, int hash, int level){
throw new UnsupportedOperationException();
}
}
class Vertex extends Node {
String key;
volatile String val;
volatile Vertex next;
public Vertex(String key, String val) {
this.key = key;
this.val = val;
}
#Override
public boolean equals(Object obj) {
Vertex v = (Vertex) obj;
return this.key.equals(v.key);
}
#Override
public int hashCode() {
return key.hashCode();
}
#Override
public String toString() {
return key +"#"+key.hashCode();
}
}
class Edge extends Node {
volatile AtomicReferenceArray<Node> array; //This is needed to ensure array elements are volatile
public Edge(int size) {
array = new AtomicReferenceArray<Node>(8);
}
#Override
public Node getLink(String key, int hash, int level){
int index = Base10ToBaseX.getBaseXValueOnAtLevel(Base10ToBaseX.Base.BASE8, hash, level);
Node returnVal = array.get(index);
for(;;) {
if(returnVal == null) {
return null;
}
else if((returnVal instanceof Vertex)) {
Vertex node = (Vertex) returnVal;
for(;node != null; node = node.next) {
if(node.key.equals(key)) {
return node;
}
}
return null;
} else { //instanceof Edge
level = level + 1;
index = Base10ToBaseX.getBaseXValueOnAtLevel(Base10ToBaseX.Base.BASE8, hash, level);
Edge e = (Edge) returnVal;
returnVal = e.array.get(index);
}
}
}
#Override
public Node createLink(int hash, int level, String key, String val) { //Remove size
for(;;) { //Repeat the work on the current node, since some other thread modified this node
int index = Base10ToBaseX.getBaseXValueOnAtLevel(Base10ToBaseX.Base.BASE8, hash, level);
Node nodeAtIndex = array.get(index);
if ( nodeAtIndex == null) {
Vertex newV = new Vertex(key, val);
boolean result = array.compareAndSet(index, null, newV);
if(result == Boolean.TRUE) {
return newV;
}
//continue; since new node is inserted by other thread, hence repeat it.
}
else if(nodeAtIndex instanceof Vertex) {
Vertex vrtexAtIndex = (Vertex) nodeAtIndex;
int newIndex = Base10ToBaseX.getBaseXValueOnAtLevel(Base10ToBaseX.Base.BASE8, vrtexAtIndex.hashCode(), level+1);
int newIndex1 = Base10ToBaseX.getBaseXValueOnAtLevel(Base10ToBaseX.Base.BASE8, hash, level+1);
Edge edge = new Edge(Base10ToBaseX.Base.BASE8.getLevelZeroMask()+1);
if(newIndex != newIndex1) {
Vertex newV = new Vertex(key, val);
edge.array.set(newIndex, vrtexAtIndex);
edge.array.set(newIndex1, newV);
boolean result = array.compareAndSet(index, vrtexAtIndex, edge); //REPLACE vertex to edge
if(result == Boolean.TRUE) {
return newV;
}
//continue; since vrtexAtIndex may be removed or changed to Edge already.
} else if(vrtexAtIndex.key.hashCode() == hash) {//vrtex.hash == hash) { HERE newIndex == newIndex1
synchronized (vrtexAtIndex) {
boolean result = array.compareAndSet(index, vrtexAtIndex, vrtexAtIndex); //Double check this vertex is not removed.
if(result == Boolean.TRUE) {
Vertex prevV = vrtexAtIndex;
for(;vrtexAtIndex != null; vrtexAtIndex = vrtexAtIndex.next) {
prevV = vrtexAtIndex; // prevV is used to handle when vrtexAtIndex reached NULL
if(vrtexAtIndex.key.equals(key)){
vrtexAtIndex.val = val;
return vrtexAtIndex;
}
}
Vertex newV = new Vertex(key, val);
prevV.next = newV; // Within SYNCHRONIZATION since prevV.next may be added with some other.
return newV;
}
//Continue; vrtexAtIndex got changed
}
} else { //HERE newIndex == newIndex1 BUT vrtex.hash != hash
edge.array.set(newIndex, vrtexAtIndex);
boolean result = array.compareAndSet(index, vrtexAtIndex, edge); //REPLACE vertex to edge
if(result == Boolean.TRUE) {
return edge.createLink(hash, (level + 1), key, val);
}
}
}
else { //instanceof Edge
return nodeAtIndex.createLink(hash, (level + 1), key, val);
}
}
}
#Override
public Node removeLink(String key, int hash, int level){
for(;;) {
int index = Base10ToBaseX.getBaseXValueOnAtLevel(Base10ToBaseX.Base.BASE8, hash, level);
Node returnVal = array.get(index);
if(returnVal == null) {
return null;
}
else if((returnVal instanceof Vertex)) {
synchronized (returnVal) {
Vertex node = (Vertex) returnVal;
if(node.next == null) {
if(node.key.equals(key)) {
boolean result = array.compareAndSet(index, node, null);
if(result == Boolean.TRUE) {
return node;
}
continue; //Vertex may be changed to Edge
}
return null; //Nothing found; This is not the same vertex we are looking for. Here hashcode is same but key is different.
} else {
if(node.key.equals(key)) { //Removing the first node in the link
boolean result = array.compareAndSet(index, node, node.next);
if(result == Boolean.TRUE) {
return node;
}
continue; //Vertex(node) may be changed to Edge, so try again.
}
Vertex prevV = node; // prevV is used to handle when vrtexAtIndex is found and to be removed from its previous
node = node.next;
for(;node != null; prevV = node, node = node.next) {
if(node.key.equals(key)) {
prevV.next = node.next; //Removing other than first node in the link
return node;
}
}
return null; //Nothing found in the linked list.
}
}
} else { //instanceof Edge
return returnVal.removeLink(key, hash, (level + 1));
}
}
}
}
class Base10ToBaseX {
public static enum Base {
/**
* Integer is represented in 32 bit in 32 bit machine.
* There we can split this integer no of bits into multiples of 1,2,4,8,16 bits
*/
BASE2(1,1,32), BASE4(3,2,16), BASE8(7,3,11)/* OCTAL*/, /*BASE10(3,2),*/
BASE16(15, 4, 8){
public String getFormattedValue(int val){
switch(val) {
case 10:
return "A";
case 11:
return "B";
case 12:
return "C";
case 13:
return "D";
case 14:
return "E";
case 15:
return "F";
default:
return "" + val;
}
}
}, /*BASE32(31,5,1),*/ BASE256(255, 8, 4), /*BASE512(511,9),*/ Base65536(65535, 16, 2);
private int LEVEL_0_MASK;
private int LEVEL_1_ROTATION;
private int MAX_ROTATION;
Base(int levelZeroMask, int levelOneRotation, int maxPossibleRotation) {
this.LEVEL_0_MASK = levelZeroMask;
this.LEVEL_1_ROTATION = levelOneRotation;
this.MAX_ROTATION = maxPossibleRotation;
}
int getLevelZeroMask(){
return LEVEL_0_MASK;
}
int getLevelOneRotation(){
return LEVEL_1_ROTATION;
}
int getMaxRotation(){
return MAX_ROTATION;
}
String getFormattedValue(int val){
return "" + val;
}
}
public static int getBaseXValueOnAtLevel(Base base, int on, int level) {
if(level > base.getMaxRotation() || level < 1) {
return 0; //INVALID Input
}
int rotation = base.getLevelOneRotation();
int mask = base.getLevelZeroMask();
if(level > 1) {
rotation = (level-1) * rotation;
mask = mask << rotation;
} else {
rotation = 0;
}
return (on & mask) >>> rotation;
}
}

Categories

Resources