Data structure for holding sets of interchangeable strings

Data structure for holding sets of interchangeable strings - java

I have a set of strings. Out of these, groups of 2 or more may represent the same thing. These groups should be stored in a way that given any member of the group, you can fetch other members of the group with high efficiency.
So given this initial set: ["a","b1","b2","c1","c2","c3"] the result structure should be something like ["a",["b1","b2"],["c1","c2","c3"]] and Fetch("b") should return ["b1","b2"].
Is there a specific data structure and/or algorithm for this purpose?
EDIT: "b1" and "b2" are not actual strings, they're indicating that the 2 belong to the same group. Otherwise a Trie would be a perfect fit.

I may be misinterpreting the initial problem setup, but I believe that there is a simple and elegant solution to this problem using off-the-shelf data structures. The idea is, at a high level, to create a map from strings to sets of strings. Each key in the map will be associated with the set of strings that it's equal to. Assuming that each string in a group is mapped to the same set of strings, this can be done time- and space-efficiently.
The algorithm would probably look like this:
Construct a map M from strings to sets of strings.
Group all strings together that are equal to one another (this step depends on how the strings and groups are specified).
For each cluster:
Create a canonical set of the strings in that cluster.
Add each string to the map as a key whose value is the canonical set.
This algorithm and the resulting data structure is quite efficient. Assuming that you already know the clusters in advance, this process (using a trie as the implementation of the map and a simple list as the data structure for the sets) requires you to visit each character of each input string exactly twice - once when inserting it into the trie and once when adding it to the set of strings equal to it, assuming that you're making a deep copy. This is therefore an O(n) algorithm.
Moreover, lookup is quite fast - to find the set of strings equal to some string, just walk the trie to find the string, look up the associated set of strings, then iterate over it. This takes O(L + k) time, where L is the length of the string and k is the number of matches.
Hope this helps, and let me know if I've misinterpreted the problem statement!

Since this is Java, I would use a HashMap<String, Set<String>>. This would map each string to its equivalence set (which would contain that string and all others that belong to the same group). How you would construct the equivalence sets from the input depends on how you define "equivalent". If the inputs are in order by group (but not actually grouped), and if you had a predicate implemented to test equivalence, you could do something like this:
boolean differentGroups(String a, String b) {
// equivalence test (must handle a == null)
}
Map<String, Set<String>> makeMap(ArrayList<String> input) {
Map<String, Set<String>> map = new HashMap<String, Set<String>>();
String representative = null;
Set<String> group;
for (String next : input) {
if (differentGroups(representative, next)) {
representative = next;
group = new HashSet<String>();
}
group.add(next);
map.put(next, group);
}
return map;
}
Note that this only works if the groups are contiguous elements in the input. If they aren't you'll need more complex bookkeeping to build the group structure.

Related

which datastructure for this hashmap scenario

I have a scenario where i store values in a hashmap.
Keys are strings like
fruits
fruits_citrus_orange
fruits_citrus_lemon
fruits_fleshly_apple
fruits_fleshly
fruits_dry
and so on.
Values are some objects. Now for a given input say fruits_fleshly i need to retrieve all cases where it starts with "fruits_fleshly"
In the above case I need to fetch
fruits_fleshly_apple
fruits_fleshly
One way to do this is by doing String.indexOf over all the keys. Is there any other effective way to do this instead of iterating over all the keys in a map

though these are strings, but to me, it looks like these are certain categories & sub categories, like fruit, fruit-freshly, fruit-citrus etc..
If that is a case you can instead implement a Tree data-structure. This would be most effective for search operation.
since Tree has a parent-child structure, there is a root node & child node. You can have a structure like this:
(0) (1) (2)
fruit
|_____citrus
| |_____lemon
| |_____orange
|
|_____freshly
|_____apple
|_____
in this structure, say if you want to search for citrus fruit, you can just go to citrus, and list all its child. And finally you can construct full name by concatenating the name as a path from root to leaves.

Iterating the map seems quite simple and straight-forward way of doing this. However, since you don't want to iterate over keys on your own, you can use Guava's Maps#filterEntries, if you are ok with using 3rd party library.
Here's how it would work:
Map<String, Object> = Maps.filterEntries(
yourMap,
Predicate.containsPattern("^fruits_fleshly"));
But, that would too iterate over the map in the backyard. So, iteration is still there, if you are bothered about efficiency.

Since HashMap doesn't maintain any order for its keys it's not a very good choice for this problem. A better choice is the TreeMap: it has methods for retrieving a sub map for a range of keys. These methods run in O(log n) time (n number of entries) so it's better than iterating over the keys.
Map subMap = myMap.subMap("fruits_fleshly", true, "fruits_fleshly\uffff", true);

The nature of a hashmap means that there's no way to do a "like" comparison on keys - you have to iterate over them all to find where key.startsWith(input).
I suppose you could nest hashmaps and split up your keys. E.g.,
{
"fruits":{
"citrus":{
"orange":(value),
"lemon":(value)
},
"fleshly":{
"apple":(value),
"":(value)
}
}
}
...etc.
The performance implications are probably horrific on a small scale, but that may not matter in a homework context but maybe not so bad if you're dealing with a lot of data and only a couple layers of nesting.
Alternatively, create a Category object with a List of Categories (sub-categories) and a List of entries.

I believe Radix Trie is what you are looking for. It is similar idea as #ay89 solution.
You can just use this open source library Radix Trie example. It perform better than O(log(N)). You will be able to find a hashmap assigned to a key in average constant time (number of underscores in your search key string) with a decent implementation of Radix Trie.fruits
fruits_citrus_orange
fruits_citrus_lemon
fruits_fleshly_apple
fruits_fleshly
fruits_dry
Trie<String, Map> trie = new PatriciaTrie<>;
trie.put("fruits", hashmap1);
trie.put("fruits_citrus_orange", hashmap2);
trie.put("fruits_citrus_lemon", hashmap3);
trie.put("fruits_fleshly_apple", hashmap4);
trie.put("fruits_fleshly", hashmap5);
Map.Entry<String, Map> entry = trie.select("fruits_fleshy");
If you just want one hashmap to be return by select you might be able to get slightly better performance if you implement your own Radix Trie.

Implementing a complex method to compare two sets of Strings when groups of one are associated with a third

I have a rather complex method that I need to implement. So please bear with me as I try to describe it in as simple a way as I can.
I'm given a set A of Strings that represent filenames - let's say "abc", "def", and "ghi". I must derive from these names a set B of "associated" filenames for each - let's say "abc_123", "abc_456", and "abc_789" for "abc"; "def_123", "def_456", and "def_789" for "def"; and "ghi_123", "ghi_456", and "ghi_789" for "ghi". This much I can do. However, these associated filenames may have prefixes or suffixes attached to them which are unpredictable strings of characters - so the associated filenames for "abc" might actually be "HELLOabc_123WORLD", "FOOabc_456BAR", and "999abc_789000". (In terms of regular expressions, it's just a matter of putting a * on both sides of the associated filenames I had written above). In short, the associated filenames will look like this:
*<original filename><other piece that I know>*
where the stars indicate any number of random characters (could be 0).
That's the first piece of the puzzle.
Next, I am given another set C of Strings that I am to compare to the set of associated filenames (set B). (In case you're wondering, I'm trying to check if the associated files are in a certain directory, and I have the list of filenames in that directory, set C). If I find all the associated filenames for a certain file in set C, I can go ahead and check off that file from set A. I must go and check each filename in set A, and if all of its associated filenames from set B are found in set C, I can check off that file from set A.
Finally, I must return the filenames from set A that were not checked off (so I'd return nothing if everything was found).
I've been struggling to come up with a way to implement this method. I thought of creating a Map that would map a filename from set A to a List containing all the filenames associated with that filename like so:
Key Value
abc *abc_123*, *abc_456*, *abc_789*
def *def_123*, *def_456*, *def_789*
ghi *ghi_123*, *ghi_456*, *ghi_789*
I could then traverse the elements of this map, and the values of the elements, comparing them to the Strings in set C. And if all Elements of the value (the List) for a given Key are found in set C, I could mark that Key off my list. Any Keys remaining would be returned.
This seems to me like it should work, but the actual mechanics of putting that into code have been very challenging for me. So if you could give me any small suggestions or pointers that would move my thinking in the right direction, I'd appreciate it very much. My implementation language will be Java in case you'd like to give code. Pseudocode is welcomed as well.

I'm not quite sure if I understand your problem correctly, but this is how I imagine it:
List<String> setCStrings = //your set C strings I guess...
for(String aSetCString : setCStrings) {
String pattern = ".*" + filename + "_" + associatedFileName + ".*";
if(aSetCString.matches(pattern)) {
// do what you want with this string that matches the filename and associated file name
}
}

I think you have correctly identified the solution partially to your problem. You will need a Map for sure and essentially your problem boils down to finding if all elements in a given list are present in a set of strings in C. I am not sure about what data structure is holding the set B or C of strings. But I can provide a pseudo code for the rest:
Initialize a HashMap<String, ArrayList>
for each string in set B
if it matches the pattern *abc_*
if "abc" is already in the Map
get the value of this key in a temp list and append this string at the end of the list
else
add a new entry into the Map
//follow the same for the other patterns.
for each entry in the Map
traverse the list of values
check if this value is present in the set C
if you reach the end of the list,
remove the entry from the Map
That way, you are left with only the keys in the Map, that you need to return.

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.

I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.

I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.

I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.

I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

Using Binary Trees to find Anagrams

I am currently trying to create a method that uses a binary tree that finds anagrams of a word inputted by the user.
If the tree does not contain any other anagram for the word (i.e., if the key was not in the tree or the only element in the associated linked list was the word provided by the user), the message "no anagram found " gets printed
For example, if key "opst" appears in the tree with an associated linked list containing the words "spot", "pots", and "tops", and the user gave the word "spot", the program should print "pots" and "tops" (but not spot).
public boolean find(K thisKey, T thisElement){
return find(root, thisKey, thisElement);
}
public boolean find(Node current, K thisKey, T thisElement){
if (current == null)
return false;
else{
int comp = current.key.compareTo(thisKey);
if (comp>0)
return find(current.left, thisKey, thisElement);
else if(comp<0)
return find(current.right, thisKey, thisElement);
else{
return current.item.find(thisElement);
}
}
}
While I created this method to find if the element provided is in the tree (and the associated key), I was told not to reuse this code for finding anagrams.
K is a generic type that extends Comparable and represents the Key, T is a generic type that represents an item in the list.
If extra methods I've done are required, I can edit this post, but I am absolutely lost. (Just need a pointer in the right direction)

It's a little unclear what exactly is tripping you up (beyond "I've written a nice find method but am not allowed to use it."), so I think the best thing to do is start from the top.
I think you will find that once you get your data structured in just the right way, the actual algorithms will follow relatively easily (many computer science problems share this feature.)
You have three things:
1) Many linked lists, each of which contains the set of anagrams of some set of letters. I am assuming you can generate these lists as you need to.
2) A binary tree, that maps Strings (keys) to lists of anagrams generated from those strings. Again, I'm assuming that you are able to perform basic operations on these treed--adding elements, finding elements by key, etc.
3) A user-inputted String.
Insight: The anagrams of a group of letters form an equivalence class. This means that any member of an anagram list can be used as the key associated with the list. Furthermore, it means that you don't need to store in your tree multiple keys that point to the same list (provided that we are a bit clever about structuring our data; see below).
In concrete terms, there is no need to have both "spot" and "opts" as keys in the tree pointing to the same list, because once you can find the list using any anagram of "spot", you get all the anagrams of "spot".
Structuring your data cleverly: Given our insight, assume that our tree contains exactly one key for each unique set of anagrams. So "opts" maps to {"opts", "pots", "spot", etc.}. What happens if our user gives us a String that we're not using as the key for its set of anagrams? How do we figure out that if the user types "spot", we should find the list that is keyed by "opts"?
The answer is to normalize the data stored in our data structures. This is a computer-science-y way of saying that we enforce arbitrary rules about how we store the data. (Normalizing data is a useful technique that appears repeatedly in many different computer science domains.) The first rule is that we only ever have ONE key in our tree that maps to a given linked list. Second, what if we make sure that each key we actually store is predictable--that is we know that we should search for "opts" even if the user types "spot"?
There are many ways to achieve this predictability--one simple one is to make sure that the letters of every key are in alphabetical order. Then, we know that every set of anagrams will be keyed by the (unique!) member of the set that comes first in alphabetical order. Consistently enforcing this rule makes it easy to search the tree--we know that no matter what string the user gives us, the key we want is the string formed from alphabetizing the user's input.
Putting it together: I'll provide the high-level algorithm here to make this a little more concrete.
1) Get a String from the user (hold on to this String, we'll need it later)
2) Turn this string into a search key that follows our normalization scheme
(You can do this in the constructor of your "K" class, which ensures that you will never have a non-normalized key anywhere in your program.)
3) Search the tree for that key, and get the linked list associated with it. This list contains every anagram of the user's input String.
4) Print every item in the list that isn't the user's original string (see why we kept the string handy?)
Takeaways:
Frequently, your data will have some special features that allow you to be clever. In this case it is the fact that any member of an anagram list can be the sole key we store for that list.
Normalizing your data give you predictability and allows you to reason about it effectively. How much more difficult would the "find" algorithm be if each key could be an arbitrary member of its anagram list?
Corollary: Getting your data structures exactly right (What am I storing? How are the pieces connected? How is it represented?) will make it much easier to write your algorithms.

What about sorting the characters in the words, and then compare that.

nth item of hashmap

HashMap selections = new HashMap<Integer, Float>();
How can i get the Integer key of the 3rd smaller value of Float in all HashMap?
Edit
im using the HashMap for this
for (InflatedRunner runner : prices.getRunners()) {
for (InflatedMarketPrices.InflatedPrice price : runner.getLayPrices()) {
if (price.getDepth() == 1) {
selections.put(new Integer(runner.getSelectionId()), new Float(price.getPrice()));
}
}
}
i need the runner of the 3rd smaller price with depth 1
maybe i should implement this in another way?

Michael Mrozek nails it with his question if you're using HashMap right: this is highly atypical scenario for HashMap. That said, you can do something like this:
get the Set<Map.Entry<K,V>> from the HashMap<K,V>.entrySet().
addAll to List<Map.Entry<K,V>>
Collections.sort the list with a custom Comparator<Map.Entry<K,V>> that sorts based on V.
If you just need the 3rd Map.Entry<K,V> only, then a O(N) selection algorithm may suffice.
//after edit
It looks like selection should really be a SortedMap<Float, InflatedRunner>. You should look at java.util.TreeMap.
Here's an example of how TreeMap can be used to get the 3rd lowest key:
TreeMap<Integer,String> map = new TreeMap<Integer,String>();
map.put(33, "Three");
map.put(44, "Four");
map.put(11, "One");
map.put(22, "Two");
int thirdKey = map.higherKey(map.higherKey(map.firstKey()));
System.out.println(thirdKey); // prints "33"
Also note how I take advantage of Java's auto-boxing/unboxing feature between int and Integer. I noticed that you used new Integer and new Float in your original code; this is unnecessary.
//another edit
It should be noted that if you have multiple InflatedRunner with the same price, only one will be kept. If this is a problem, and you want to keep all runners, then you can do one of a few things:
If you really need a multi-map (one key can map to multiple values), then you can:
have TreeMap<Float,Set<InflatedRunner>>
Use MultiMap from Google Collections
If you don't need the map functionality, then just have a List<RunnerPricePair> (sorry, I'm not familiar with the domain to name it appropriately), where RunnerPricePair implements Comparable<RunnerPricePair> that compares on prices. You can just add all the pairs to the list, then either:
Collections.sort the list and get the 3rd pair
Use O(N) selection algorithm

Are you sure you're using hashmaps right? They're used to quickly lookup a value given a key; it's highly unusual to sort the values and then try to find a corresponding key. If anything, you should be mapping the float to the int, so you could at least sort the float keys and get the integer value of the third smallest that way

You have to do it in steps:
Get the Collection<V> of values from the Map
Sort the values
Choose the index of the nth smallest
Think about how you want to handle ties.

You could do it with the google collections BiMap, assuming that the Floats are unique.

If you regularly need to get the key of the nth item, consider:
using a TreeMap, which efficiently keeps keys in sorted order
then using a double map (i.e. one TreeMap mapping integer > float, the other mapping float > integer)
You have to weigh up the inelegance and potential risk of bugs from needing to maintain two maps with the scalability benefit of having a structure that efficiently keeps the keys in order.
You may need to think about two keys mapping to the same float...
P.S. Forgot to mention: if this is an occasional function, and you just need to find the nth largest item of a large number of items, you could consider implementing a selection algorithm (effectively, you do a sort, but don't actually bother sorting subparts of the list that you realise you don't need to sort because their order makes no difference to the position of the item you're looking for).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.