Data structure for storing word associations - java

I'm trying to implement prediction by analyzing sentences. Consider the following [rather boring] sentences
Call ABC
Call ABC again
Call DEF
I'd like to have a data structure for the above sentences as follows:
Call: (ABC, 2), (again, 1), (DEF, 1)
ABC: (Call, 2), (again, 1)
again: (Call, 1), (ABC, 1)
DEF: (Call, 1)
In general, Word: (Word_it_appears_with, Frequency), ....
Please note the inherent redundancy in this type of data. Obviously, if the frequency of ABC is 2 under Call, the frequency of Call is 2 under ABC. How do I optimize this?
The idea is to use this data when a new sentence is being typed. For example, if Call has been typed, from the data, it's easy to say ABC is more likely to be present in the sentence, and offer it as the first suggestion, followed by again and DEF.
I realise this is one of a million possible ways of implementing prediction, and I eagerly look forward to suggestions of other ways to do it.
Thanks

Maybe using a bidirectional graph. You can store the words as nodes, with edges as frequencies.

You can use the following data structure too:
Map<String, Map<String, Long>>

I would consider one of two options:
Option 1:
class Freq {
String otherWord;
int freq;
}
Multimap<String, Freq> mymap;
or maybe a Table
Table<String, String, int>
Given the above Freq: you might want to do bi-directional mapping:
class Freq{
String thisWord;
int otherFreq;
Freq otherWord;
}
This would allow for very quick updating of data pairs.

Related

which datastructure for this hashmap scenario

I have a scenario where i store values in a hashmap.
Keys are strings like
fruits
fruits_citrus_orange
fruits_citrus_lemon
fruits_fleshly_apple
fruits_fleshly
fruits_dry
and so on.
Values are some objects. Now for a given input say fruits_fleshly i need to retrieve all cases where it starts with "fruits_fleshly"
In the above case I need to fetch
fruits_fleshly_apple
fruits_fleshly
One way to do this is by doing String.indexOf over all the keys. Is there any other effective way to do this instead of iterating over all the keys in a map
though these are strings, but to me, it looks like these are certain categories & sub categories, like fruit, fruit-freshly, fruit-citrus etc..
If that is a case you can instead implement a Tree data-structure. This would be most effective for search operation.
since Tree has a parent-child structure, there is a root node & child node. You can have a structure like this:
(0) (1) (2)
fruit
|_____citrus
| |_____lemon
| |_____orange
|
|_____freshly
|_____apple
|_____
in this structure, say if you want to search for citrus fruit, you can just go to citrus, and list all its child. And finally you can construct full name by concatenating the name as a path from root to leaves.
Iterating the map seems quite simple and straight-forward way of doing this. However, since you don't want to iterate over keys on your own, you can use Guava's Maps#filterEntries, if you are ok with using 3rd party library.
Here's how it would work:
Map<String, Object> = Maps.filterEntries(
yourMap,
Predicate.containsPattern("^fruits_fleshly"));
But, that would too iterate over the map in the backyard. So, iteration is still there, if you are bothered about efficiency.
Since HashMap doesn't maintain any order for its keys it's not a very good choice for this problem. A better choice is the TreeMap: it has methods for retrieving a sub map for a range of keys. These methods run in O(log n) time (n number of entries) so it's better than iterating over the keys.
Map subMap = myMap.subMap("fruits_fleshly", true, "fruits_fleshly\uffff", true);
The nature of a hashmap means that there's no way to do a "like" comparison on keys - you have to iterate over them all to find where key.startsWith(input).
I suppose you could nest hashmaps and split up your keys. E.g.,
{
"fruits":{
"citrus":{
"orange":(value),
"lemon":(value)
},
"fleshly":{
"apple":(value),
"":(value)
}
}
}
...etc.
The performance implications are probably horrific on a small scale, but that may not matter in a homework context but maybe not so bad if you're dealing with a lot of data and only a couple layers of nesting.
Alternatively, create a Category object with a List of Categories (sub-categories) and a List of entries.
I believe Radix Trie is what you are looking for. It is similar idea as #ay89 solution.
You can just use this open source library Radix Trie example. It perform better than O(log(N)). You will be able to find a hashmap assigned to a key in average constant time (number of underscores in your search key string) with a decent implementation of Radix Trie.fruits
fruits_citrus_orange
fruits_citrus_lemon
fruits_fleshly_apple
fruits_fleshly
fruits_dry
Trie<String, Map> trie = new PatriciaTrie<>;
trie.put("fruits", hashmap1);
trie.put("fruits_citrus_orange", hashmap2);
trie.put("fruits_citrus_lemon", hashmap3);
trie.put("fruits_fleshly_apple", hashmap4);
trie.put("fruits_fleshly", hashmap5);
Map.Entry<String, Map> entry = trie.select("fruits_fleshy");
If you just want one hashmap to be return by select you might be able to get slightly better performance if you implement your own Radix Trie.

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.
I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.
I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.
I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.
I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

Which datastructure to use?

I have a scenario like this,
I need to store the count of the strings and I need to return the top ten strings which has the max count,
for example,
String Count
---------------------------------
String1 10
String2 9
String3 8
.
.
.
String10 1
I was thinking of using the hashtable for storing the string and its count but it would be difficult to retrieve the top ten strings from it as I have to again loop through it to find them.
Any other suggestions here?
Priority Que.
You can make a class to put in it:
public class StringHolder{
private String string;
private int value;
//Compare to and equals methods
}
then it will sort when you insert and it is easy to get the top 10.
Simply use a sorted map like
Map<Integer, List<String>> strings
where the key is the frequency value and the value the list of strings that occur with that frequency.
Then, loop through the map and, with an inner loop through the value lists until you've seen 10 strings. Those are the 10 most frequent one's.
With the additional requirement that the algorithm should support updates of frequencies: add the strings to a map like Map<String, Integer> where the key is the string and the value the actual frequency (increment the value if you see a string again). After that copy the key/value pairs to the map I suggested above.
For any task like "find N top items" priority queues are perfect solution. See Java's PriorityQueue class.
I am not sure, but i guess that the most suitable and elegant class for your needs is guava's
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/TreeMultiset.html
Guava has a HashMultiset which would be very useful for this.
HashMultiset<String> ms = Hashmultiset.create();
ms.add(astring);
ms.add(astring, times);
ImmutableMultiset<String> ims = Multisets.copyHighestCountFirst(ms);
// iterator through the first 10 elements, and they will be your top 10
// from highest to lowest.
You need a Max Heap data structure for this. Put it all into the max heap, and make successive 10 (or whatever n) removals.
And if you intend to keep reusing the data once it's loaded into memory, it might be worth the expense of sorting by value instead of a heap.

Data structure for holding sets of interchangeable strings

I have a set of strings. Out of these, groups of 2 or more may represent the same thing. These groups should be stored in a way that given any member of the group, you can fetch other members of the group with high efficiency.
So given this initial set: ["a","b1","b2","c1","c2","c3"] the result structure should be something like ["a",["b1","b2"],["c1","c2","c3"]] and Fetch("b") should return ["b1","b2"].
Is there a specific data structure and/or algorithm for this purpose?
EDIT: "b1" and "b2" are not actual strings, they're indicating that the 2 belong to the same group. Otherwise a Trie would be a perfect fit.
I may be misinterpreting the initial problem setup, but I believe that there is a simple and elegant solution to this problem using off-the-shelf data structures. The idea is, at a high level, to create a map from strings to sets of strings. Each key in the map will be associated with the set of strings that it's equal to. Assuming that each string in a group is mapped to the same set of strings, this can be done time- and space-efficiently.
The algorithm would probably look like this:
Construct a map M from strings to sets of strings.
Group all strings together that are equal to one another (this step depends on how the strings and groups are specified).
For each cluster:
Create a canonical set of the strings in that cluster.
Add each string to the map as a key whose value is the canonical set.
This algorithm and the resulting data structure is quite efficient. Assuming that you already know the clusters in advance, this process (using a trie as the implementation of the map and a simple list as the data structure for the sets) requires you to visit each character of each input string exactly twice - once when inserting it into the trie and once when adding it to the set of strings equal to it, assuming that you're making a deep copy. This is therefore an O(n) algorithm.
Moreover, lookup is quite fast - to find the set of strings equal to some string, just walk the trie to find the string, look up the associated set of strings, then iterate over it. This takes O(L + k) time, where L is the length of the string and k is the number of matches.
Hope this helps, and let me know if I've misinterpreted the problem statement!
Since this is Java, I would use a HashMap<String, Set<String>>. This would map each string to its equivalence set (which would contain that string and all others that belong to the same group). How you would construct the equivalence sets from the input depends on how you define "equivalent". If the inputs are in order by group (but not actually grouped), and if you had a predicate implemented to test equivalence, you could do something like this:
boolean differentGroups(String a, String b) {
// equivalence test (must handle a == null)
}
Map<String, Set<String>> makeMap(ArrayList<String> input) {
Map<String, Set<String>> map = new HashMap<String, Set<String>>();
String representative = null;
Set<String> group;
for (String next : input) {
if (differentGroups(representative, next)) {
representative = next;
group = new HashSet<String>();
}
group.add(next);
map.put(next, group);
}
return map;
}
Note that this only works if the groups are contiguous elements in the input. If they aren't you'll need more complex bookkeeping to build the group structure.

most efficient Java data structure for searching triples of strings

Suppose I have a large list (around 10,000 entries) of string triples as such:
car noun yes
dog noun no
effect noun yes
effect verb no
Suppose I am presented with a string double - for example, (effect, verb) - and I need to quickly look in the list to see if the pair appears and, if it does, whether its value is yes or no. (For this example the double does appear and the value is "no".)
What is the best data structure in Java to store the list and the most efficient way to perform the search? I am running hundreds of thousands of these searches so speed is of the essence.
Thanks!
You might consider using a HashMap<YourDouble, String>. Searches will be O(1).
You could either create an object, YourDouble which holds the first two values, or else append one to the other -- if values will still be unique -- and use HashMap<String, String>.
I would create a HashMultimap for each type of search you want, e.g. "all three", "each pair", and "each single field". When you build the list, populate all the different maps, then you can fetch from whichever map is appropriate for your query.
(The downside is that you'll need a type for at least each arity, e.g. use just String for the "single field" maps, but a Pair for the two-field maps, and a Triple for the three-field map.)
You could use a HashMap where the key is the concatenation of the first two strings, the ones which you'll use for lookups, and the value is a Boolean, representing the yes and no strings.
Alternatively, it seems the words in the second column would be fewer, since they represent categories. You could have a HashMap<String, HashMap<String, Boolean>> where you first index by e.g. "noun", "verb" etc. and then you index by e.g. "car", "dog", "effect", to get to your boolean. This would probably be more space-efficient.
10k doesn't seem that large to me. Have you tried a DB?
The place to look for information like this is the Semantic Web. A number of projects work on Triple Stores of just this type. There's a list at the bottom of the Triple Store page of implementations.
As far as java is concerned your algorithms are almost certainly going to be language dependent and if you find a good algorithm implemented in C its java port will also be fast.
Also, what's your data set look like? Are there a lot of 2 matches such that subject and verb are often the same? How many matches are you expecting to get? MapReduce will work work well for finding one match in 10k but won't work as well doing a query that returns a 8k of 10k where the query can't be easily partitioned.
There's a query language made just for this problem too: SPARQL. The bigdata blog has some good insights, though again 10k doesn't seem that large.

Categories

Resources