The problem I have is an example of something I've seen often. I have a series of strings (one string per line, lets say) as input, and all I need to do is return how many times each string has appeared. What is the most elegant way to solve this, without using a trie or other string-specific structure? The solution I've used in the past has been to use a hashtable-esque collection of custom-made (String, integer) objects that implements Comparable to keep track of how many times each string has appeared, but this method seems clunky for several reasons:
1) This method requires the creation of a comparable function which is identical to the String's.compareTo().
2) The impression that I get is that I'm misusing TreeSet, which has been my collection of choice. Updating the counter for a given string requires checking to see if the object is in the set, removing the object, updating the object, and then reinserting it. This seems wrong.
Is there a more clever way to solve this problem? Perhaps there is a better Collections interface I could use to solve this problem?
Thanks.
One posibility can be:
public class Counter {
public int count = 1;
}
public void count(String[] values) {
Map<String, Counter> stringMap = new HashMap<String, Counter>();
for (String value : values) {
Counter count = stringMap.get(value);
if (count != null) {
count.count++;
} else {
stringMap.put(value, new Counter());
}
}
}
In this way you still need to keep a map but at least you don't need to regenerate the entry every time you match a new string, you can access the Counter class, which is a wrapper of integer and increase the value by one, optimizing the access to the array
TreeMap is much better for this problem, or better yet, Guava's Multiset.
To use a TreeMap, you'd use something like
Map<String, Integer> map = new TreeMap<>();
for (String word : words) {
Integer count = map.get(word);
if (count == null) {
map.put(word, 1);
} else {
map.put(word, count + 1);
}
}
// print out each word and each count:
for (Map.Entry<String, Integer> entry : map.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getKey(), entry.getValue());
}
Integer theCount = map.get("the");
if (theCount == null) {
theCount = 0;
}
System.out.println(theCount); // number of times "the" appeared, or null
Multiset would be much simpler than that; you'd just write
Multiset<String> multiset = TreeMultiset.create();
for (String word : words) {
multiset.add(word);
}
for (Multiset.Entry<String> entry : multiset.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getElement(), entry.getCount());
}
System.out.println(multiset.count("the")); // number of times "the" appeared
You can use a hash-map (no need to "create a comparable function"):
Map<String,Integer> count(String[] strings)
{
Map<String,Integer> map = new HashMap<String,Integer>();
for (String key : strings)
{
Integer value = map.get(key);
if (value == null)
map.put(key,1);
else
map.put(key,value+1);
}
return map;
}
Here is how you can use this method in order to print (for example) the string-count of your input:
Map<String,Integer> map = count(input);
for (String key : map.keySet())
System.out.println(key+" "+map.get(key));
You can use a Bag data structure from the Apache Commons Collection, like the HashBag.
A Bag does exactly what you need: It keeps track of how often an element got added to the collections.
HashBag<String> bag = new HashBag<>();
bag.add("foo");
bag.add("foo");
bag.getCount("foo"); // 2
Related
I created a Map<Integer, ArrayList<String>> map and I would like to compare each value in map with one ArrayList<String> likeList and get key if they match. I will bring the key to use later.
I tried to run my code like this, but it doesn't work because it returns nothing:
for (int key : map.keySet()) {
if(map.get(key).equals(likeList)){
index = key;
Log.d("IndexN", String.valueOf(index));
}
}
Then, I tried this:
int index = 0;
for (Map.Entry<Integer, ArrayList<String>> entry : map.entrySet()) {
if(entry.getValue().equals(likeList)){
index = entry.getkey();
}
}
Do you have any idea?
Add a list of the key to store all match
List<Integer> indices = new ArrayList<>();
for (int key : map.keySet()) {
if (map.get(key).equals(likeList)) {
indices.add(key);
}
}
It does not return index when I try the code above.
From this comment, I understood that as soon as you find a match in the map, the index should be recorded and further processing should be stopped. In other words, either there is only one match of likeList in the map or you want to find the first match of likeList in the map. If yes, you need to break the loop as soon as the match is found (shown below).
for (int key : map.keySet()) {
if (map.get(key).equals(likeList)) {
Log.d("IndexN", String.valueOf(index));
break;
}
}
Note that this will give you the same value, each time you execute it, only when the map has only one match of likeList or the map is a LinkedHashMap. If it is a HashMap and it has more than one matches of likeList, you may get a different value each time you execute it because a HashMap does not guarantee the order of its entries.
However, if there can be multiple matches of likeList in the map and you want to log all the matches as well as get the list of the corresponding keys, you can do it as follows:
List<Integer> indexList = new ArrayList<>();
for (int key : map.keySet()) {
if (map.get(key).equals(likeList)) {
Log.d("IndexN", String.valueOf(index));
indexList.add(key);
}
}
// Display the list of corresponding keys
System.out.println(indexList);
My hashmap contains one of entry as **key: its-site-of-origin-from-another-site##NOUN** and **value: its##ADJ site-of-origin-from-another-site##NOUN**
i want to get the value of this key on the basis of only key part of `"its-site-of-origin-from-another-site"``
If hashmap contains key like 'its-site-of-origin-from-another-site' then it should be first pick 'its' and then 'site-of-origin-from-another-sit' only not the part after '##'
No. It would be a String so it will pick up whatever after "##" as well. If you need value based on substring then you would have to iterate over the map like:
String value = map.get("its...");
if (value != null) {
//exact match for value
//use it
} else {//or use map or map which will reduce your search time but increase complexity
for (Map.Entry<String, String> entry : map.entrySet()) {
if (entry.getKey().startsWith("its...")) {
//that's the value i needed.
}
}
}
You can consider using a Patricia trie. It's a data structure like a TreeMap where the key is a String and any type of value. It's kind of optimal for storage because common string prefix between keys are shared, but the property which is interesting for your use case is that you can search for specific prefix and get a sorted view of the map entries.
Following is an example with Apache Common implementation.
import org.apache.commons.collections4.trie.PatriciaTrie;
public class TrieStuff {
public static void main(String[] args) {
// Build a Trie with String values (keys are always strings...)
PatriciaTrie<String> pat = new PatriciaTrie<>();
// put some key/value stuff with common prefixes
Random rnd = new Random();
String[] prefix = {"foo", "bar", "foobar", "fiz", "buz", "fizbuz"};
for (int i = 0; i < 100; i++) {
int r = rnd.nextInt(6);
String key = String.format("%s-%03d##whatever", prefix[r], i);
String value = String.format("%s##ADJ %03d##whatever", prefix[r], i);
pat.put(key, value);
}
// Search for all entries whose keys start with "fiz"
SortedMap<String, String> fiz = pat.prefixMap("fiz");
fiz.entrySet().stream().forEach(e -> System.out.println(e));
}
}
Prints all keys that start with "fiz" and sorted.
fiz-000##whatever
fiz-002##whatever
fiz-012##whatever
fiz-024##whatever
fiz-027##whatever
fiz-033##whatever
fiz-036##whatever
fiz-037##whatever
fiz-041##whatever
fiz-045##whatever
fiz-046##whatever
fiz-047##whatever
fizbuz-008##whatever
fizbuz-011##whatever
fizbuz-016##whatever
fizbuz-021##whatever
fizbuz-034##whatever
fizbuz-038##whatever
I'm going count the most used words in a text and I want to make it this way just need little help how i'm gonna fix the Treemap..
this is how its look like now ...
TreeMap<Integer, List<String>> Word = new TreeMap<Integer, List<String>>();
List<String> TheList = new ArrayList<String>();
//While there is still something to read..
while (scanner.hasNext()) {
String NewWord = scanner.next().toLowerCase();
if (Word.containsKey(NewWord)) {
Word.put(HERE I NEED HELP);
} else {
Word.put(HERE I NEED HELP);
}
}
So what i wanna do is if the NewWord is in the list then add one on Integer(key) and if not Add the word to the next list.
Your type appears to be completely incorrect
... if you want a frequency count
You want to have your word as the key and the count as the value. There is little value in using a sorted collection, but it is many time slower so I would use a HashMap.
Map<String, Integer> frequencyCount = new HashMap<>();
while (scanner.hasNext()) {
String word = scanner.next().toLowerCase();
Integer count = frequencyCount.get(word);
if (count == null)
frequencyCount.put(word, 1);
else
frequencyCount.put(word, 1 + count);
}
... if you want to key by length. I would use a List<Set<String>> This is because your word length is positive and bounded, and you want to ignore duplicate words which is something a Set is designed to do.
List<Set<String>> wordsByLength = new ArrayList<Set<String>>();
while (scanner.hasNext()) {
String word = scanner.next().toLowerCase();
// grow the array list as required.
while(wordsByteLength.size() <= word.length())
wordsByLength.add(new HashSet<String>());
// add the word ignoring duplicates.
wordsByLength.get(words.length()).add(word);
}
All the examples above are correctly storing the count into a map, unfortunately they are not sorting by count which is a requirement you also have.
Do not use a TreeMap, instead use a HashMap to build up the values.
Once you have the complete list of values built you can then drop the entrySet from the HashMap into a new ArrayList and sort that array list by Entry<String,Integer>.getValue().
Or to be neater create a new "Count" object which has both the word and the count in and use that.
Dont do..
TreeMap<Integer, List<String>>
instead do,
TreeMap<String, Integer> // String represents the word... Integer represents the count
because your key (count) can be same sometimes where as the words will be unique...
Do it the other way around... keep reading the words and check if your map contains that word... If yes, increment the count, else add the word with count = 1.
Try this one
TreeMap<String, Integer> Word = new TreeMap<String,Integer>();
while (scanner.hasNext()) {
String NewWord = scanner.next().toLowerCase();
if (Word.containsKey(NewWord)) {
Word.put(NewWord,Word.get(NewWord)+1);
} else {
Word.put(NewWord,1);
}
}
The way to solve this in a time-efficient manner is to have two maps. One map should be from keys to counts, and the other from counts to keys. You can assemble these in different passes. The first should assemble the map from keys to counts:
Map<String, Integer> wordCount = new HashMap<String,Integer>();
while (scanner.hasNext()) {
String word = scanner.next().toLowerCase();
wordCount.put(word, wordCount.containsKey(word) ? wordCount.get(word) + 1 : 1);
}
The second phase inverts the map so that you can read off the top-most keys:
// Biggest values first!
Map<Integer,List<String>> wordsByFreq = new TreeMap<Integer,List<String>>(new Comparator<Integer>(){
public int compare(Integer a, Integer b) {
return a - b;
}
});
for (Map.Entry<String,Integer> e : wordCount) {
List<String> current = wordsByFreq.get(e.getValue());
if (current == null)
wordsByFreq.put(e.getValue(), current = new ArrayList<String>());
current.add(e.getKey());
}
Note that the first stage uses a HashMap because we don't need the order at all; just speedy access. The second stage needs a TreeMap and it needs a non-standard comparator so that the first value read out will be the list of most-frequent words (allowing for two or more words to be most-frequent).
Try this out:
TreeMap<String, Integer> map = new TreeMap<String, Integer>();
Scanner scanner = null;
while (scanner.hasNext()) {
String NewWord = scanner.next().toLowerCase();
if (map.containsKey(NewWord)) {
Integer count = map.get(NewWord);
// Add the element back along with incremented count
map.put(NewWord, count++);
} else {
map.put(NewWord,1); // Add a new entry
}
}
I have a basic method which reads in ~1000 files with ~10,000 lines each from the hard drive. Also, I have an array of String called userDescription which has all the "description words" of the user. I have created a HashMap whose data structure is HashMap<String, HashMap<String, Integer>> which corresponds to HashMap<eachUserDescriptionWords, HashMap<TweetWord, Tweet_Word_Frequency>>.
The file is organized as:
<User=A>\t<Tweet="tweet...">\n
<User=A>\t<Tweet="tweet2...">\n
<User=B>\t<Tweet="tweet3...">\n
....
My method to do this is:
for (File file : tweetList) {
if (file.getName().endsWith(".txt")) {
System.out.println(file.getName());
BufferedReader in;
try {
in = new BufferedReader(new FileReader(file));
String str;
while ((str = in.readLine()) != null) {
// String split[] = str.split("\t");
String split[] = ptnTab.split(str);
String user = ptnEquals.split(split[1])[1];
String tweet = ptnEquals.split(split[2])[1];
// String user = split[1].split("=")[1];
// String tweet = split[2].split("=")[1];
if (tweet.length() == 0)
continue;
if (!prevUser.equals(user)) {
description = userDescription.get(user);
if (description == null)
continue;
if (prevUser.length() > 0 && wordsCount.size() > 0) {
for (String profileWord : description) {
if (wordsCorr.containsKey(profileWord)) {
HashMap<String, Integer> temp = wordsCorr
.get(profileWord);
wordsCorr.put(profileWord,
addValues(wordsCount, temp));
} else {
wordsCorr.put(profileWord, wordsCount);
}
}
}
// wordsCount = new HashMap<String, Integer>();
wordsCount.clear();
}
setTweetWordCount(wordsCount, tweet);
prevUser = user;
}
} catch (IOException e) {
System.err.println("Something went wrong: "
+ e.getMessage());
}
}
}
Here, the method setTweetWord counts the word frequency of all the tweets of a single user. The method is:
private void setTweetWordCount(HashMap<String, Integer> wordsCount,
String tweet) {
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
if (currTweet.size() == 0)
return;
for (String word : currTweet) {
try {
if (word.equals("") || word.equals(null))
continue;
} catch (NullPointerException e) {
continue;
}
Integer countWord = wordsCount.get(word);
wordsCount.put(word, (countWord == null) ? 1 : countWord + 1);
}
}
The method addValues checks to see if wordCount has words that is already in the giant HashMap wordsCorr. If it does, it increases the count of the word in the original HashMap wordsCorr.
Now, my problem is no matter what I do the program is very very slow. I ran this version in my server which has fairly good hardware but its been 28 hours and the number of files scanned is just ~450. I tried to see if I was doing anything repeatedly which might be unnecessary and I corrected few of them. But still the program is very slow.
Also, I have increased the heap size to 1500m which is the maximum that I can go.
Is there anything I might be doing wrong?
Thank you for your help!
EDIT: Profiling Results
first of all I really want to thank you guys for the comments. I have changed some of the stuffs in my program. I now have precompiled regex instead of direct String.split() and other optimization. However, after profiling, my addValues method is taking the highest time. So, here's my code for addValues. Is there something that I should be optimizing here? Oh, and I've also changed my startProcess method a bit.
private HashMap<String, Integer> addValues(
HashMap<String, Integer> wordsCount, HashMap<String, Integer> temp) {
HashMap<String, Integer> merged = new HashMap<String, Integer>();
for (String x : wordsCount.keySet()) {
Integer y = temp.get(x);
if (y == null) {
merged.put(x, wordsCount.get(x));
} else {
merged.put(x, wordsCount.get(x) + y);
}
}
for (String x : temp.keySet()) {
if (merged.get(x) == null) {
merged.put(x, temp.get(x));
}
}
return merged;
}
EDIT2: Even after trying so hard with it, the program didn't run as expected. I did all the optimization of the "slow method" addValues but it didn't work. So I went to different path of creating word dictionary and assigning index to each word first and then do the processing. Lets see where it goes. Thank you for your help!
Two things come to mind:
You are using String.split(), which uses a regular expression to do the splitting. That's completely oversized. Use one of the many splitXYZ() methods from Apache StringUtils instead.
You are probably creating really huge hash maps. When having very large hash maps, the hash collisions will make the hashmap functions much slower. This can be improved by using more widely spread hash values. See an example over here: Java HashMap performance optimization / alternative
One suggestion (I don't know how much of an improvement you'll get from it) is based on the observation that curTweet is never modified. There is no need for creating a copy. I.e.
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
can be replaced with
List<String> currTweet = Arrays.asList(removeUnwantedStrings(tweet));
or you can use the array directly (which will be marginally faster). I.e.
String[] currTweet = removeUnwantedStrings(tweet);
Also,
word.equals(null)
is always false by the definition of the contract of equals. The right way to null-check is:
if (null == word || word.equals(""))
Additionally, you won't need that null-pointer-exception try-catch if you do this. Exception handling is expensive when it happens, so if your word array tends to return lots of nulls, this could be slowing down your code.
More generally though, this is one of those cases where you should profile the code and figure out where the actual bottleneck is (if there is a bottleneck) instead of looking for things to optimize ad-hoc.
You would gain from a few more optimizations:
String.split recompiles the input regex (in string form) to a pattern every time. You should have a single static final Pattern ptnTab = Pattern.compile( "\\t" ), ptnEquals = Pattern.compile( "=" ); and call, e.g., ptnTab.split( str ). The resulting performance should be close to StringTokenizer.
word.equals( "" ) || word.equals( null ). Lots of wasted cycles here. If you are actually seeing null words, then you are catching NPEs, which is very expensive. See the response from #trutheality above.
You should allocate the HashMap with a very large initial capacity to avoid all the resizing that is bound to happen.
split() uses regular expressions, which are not "fast". try using a StringTokenizer or something instead.
Have you thought about using db instead of Java. Using db tools you can load the data using dataload tools that comes with DB in tables and from there you can do set processing. One challenge that I see is loading data in table as fields are not delimited with common seprator like "'" or ":"
You could rewrite addValues like this to make it faster - a few notes:
I have not tested the code but I think it is equivalent to yours.
I have not tested that it is quicker (but would be surprised if it wasn't)
I have assumed that wordsCount is larger than temp, if not exchange them in the code
I have also replaced all the HashMaps by Maps which does not make any difference for you but makes the code easier to change later on
private Map<String, Integer> addValues(Map<String, Integer> wordsCount, Map<String, Integer> temp) {
Map<String, Integer> merged = new HashMap<String, Integer>(wordsCount); //puts everyting in wordCounts
for (Map.Entry<String, Integer> e : temp.entrySet()) {
Integer countInWords = merged.get(e.getKey()); //the number in wordsCount
Integer countInTemp = e.getValue();
int newCount = countInTemp + (countInWords == null ? 0 : countInWords); //the sum
merged.put(e.getKey(), newCount);
}
return merged;
}
I am working with a TreeMap of Strings TreeMap<String, String>, and using it to implement a Dictionay of words.
I then have a collection of files, and would like to create a representation of each file in the vector space (space of words) defined by the dictionary.
Each file should have a vector representing it with following properties:
vector should have same size as dictionary
for each word contained in the file the vector should have a 1 in the position corresponding to the word position in dictionary
for each word not contained in the file the vector should have a -1 in the position corresponding to the word position in dictionary
So my idea is to use a Vector<Boolean> to implement these vectors. (This way of representing documents in a collection is called Boolean Model - http://www.site.uottawa.ca/~diana/csi4107/L3.pdf)
The problem I am facing in the procedure to create this vector is that I need a way to find position of a word in the dictionary, something like this:
String key;
int i = get_position_of_key_in_Treemap(key); <--- purely invented method...
1) Is there any method like this I can use on a TreeMap?If not could you provide some code to help me implement it by myself?
2) Is there an iterator on TreeMap (it's alphabetically ordered on keys) of which I can get position?
3)Eventually should I use another class to implement dictionary?(If you think that with TreeMaps I can't do what I need) If yes, which?
Thanks in advance.
ADDED PART:
Solution proposed by dasblinkenlight looks fine but has the problem of complexity (linear with dimension of dictionary due to copying keys into an array), and the idea of doing it for each file is not acceptable.
Any other ideas for my questions?
Once you have constructed your tree map, copy its sorted keys into an array, and use Arrays.binarySearch to look up the index in O(logN) time. If you need the value, do a lookup on the original map too.
Edit: this is how you copy keys into an array
String[] mapKeys = new String[treeMap.size()];
int pos = 0;
for (String key : treeMap.keySet()) {
mapKeys[pos++] = key;
}
An alternative solution would be to use TreeMap's headMap method. If the word exists in the TreeMap, then the size() of its head map is equal to the index of the word in the dictionary. It may be a bit wasteful compared to my other answer, through.
Here is how you code it in Java:
import java.util.*;
class Test {
public static void main(String[] args) {
TreeMap<String,String> tm = new TreeMap<String,String>();
tm.put("quick", "one");
tm.put("brown", "two");
tm.put("fox", "three");
tm.put("jumps", "four");
tm.put("over", "five");
tm.put("the", "six");
tm.put("lazy", "seven");
tm.put("dog", "eight");
for (String s : new String[] {
"quick", "brown", "fox", "jumps", "over",
"the", "lazy", "dog", "before", "way_after"}
) {
if (tm.containsKey(s)) {
// Here is the operation you are looking for.
// It does not work for items not in the dictionary.
int pos = tm.headMap(s).size();
System.out.println("Key '"+s+"' is at the position "+pos);
} else {
System.out.println("Key '"+s+"' is not found");
}
}
}
}
Here is the output produced by the program:
Key 'quick' is at the position 6
Key 'brown' is at the position 0
Key 'fox' is at the position 2
Key 'jumps' is at the position 3
Key 'over' is at the position 5
Key 'the' is at the position 7
Key 'lazy' is at the position 4
Key 'dog' is at the position 1
Key 'before' is not found
Key 'way_after' is not found
https://github.com/geniot/indexed-tree-map
I had the same problem. So I took the source code of java.util.TreeMap and wrote IndexedTreeMap. It implements my own IndexedNavigableMap:
public interface IndexedNavigableMap<K, V> extends NavigableMap<K, V> {
K exactKey(int index);
Entry<K, V> exactEntry(int index);
int keyIndex(K k);
}
The implementation is based on updating node weights in the red-black tree when it is changed. Weight is the number of child nodes beneath a given node, plus one - self. For example when a tree is rotated to the left:
private void rotateLeft(Entry<K, V> p) {
if (p != null) {
Entry<K, V> r = p.right;
int delta = getWeight(r.left) - getWeight(p.right);
p.right = r.left;
p.updateWeight(delta);
if (r.left != null) {
r.left.parent = p;
}
r.parent = p.parent;
if (p.parent == null) {
root = r;
} else if (p.parent.left == p) {
delta = getWeight(r) - getWeight(p.parent.left);
p.parent.left = r;
p.parent.updateWeight(delta);
} else {
delta = getWeight(r) - getWeight(p.parent.right);
p.parent.right = r;
p.parent.updateWeight(delta);
}
delta = getWeight(p) - getWeight(r.left);
r.left = p;
r.updateWeight(delta);
p.parent = r;
}
}
updateWeight simply updates weights up to the root:
void updateWeight(int delta) {
weight += delta;
Entry<K, V> p = parent;
while (p != null) {
p.weight += delta;
p = p.parent;
}
}
And when we need to find the element by index here is the implementation that uses weights:
public K exactKey(int index) {
if (index < 0 || index > size() - 1) {
throw new ArrayIndexOutOfBoundsException();
}
return getExactKey(root, index);
}
private K getExactKey(Entry<K, V> e, int index) {
if (e.left == null && index == 0) {
return e.key;
}
if (e.left == null && e.right == null) {
return e.key;
}
if (e.left != null && e.left.weight > index) {
return getExactKey(e.left, index);
}
if (e.left != null && e.left.weight == index) {
return e.key;
}
return getExactKey(e.right, index - (e.left == null ? 0 : e.left.weight) - 1);
}
Also comes in very handy finding the index of a key:
public int keyIndex(K key) {
if (key == null) {
throw new NullPointerException();
}
Entry<K, V> e = getEntry(key);
if (e == null) {
throw new NullPointerException();
}
if (e == root) {
return getWeight(e) - getWeight(e.right) - 1;//index to return
}
int index = 0;
int cmp;
if (e.left != null) {
index += getWeight(e.left);
}
Entry<K, V> p = e.parent;
// split comparator and comparable paths
Comparator<? super K> cpr = comparator;
if (cpr != null) {
while (p != null) {
cmp = cpr.compare(key, p.key);
if (cmp > 0) {
index += getWeight(p.left) + 1;
}
p = p.parent;
}
} else {
Comparable<? super K> k = (Comparable<? super K>) key;
while (p != null) {
if (k.compareTo(p.key) > 0) {
index += getWeight(p.left) + 1;
}
p = p.parent;
}
}
return index;
}
You can find the result of this work at https://github.com/geniot/indexed-tree-map
There's no such implementation in the JDK itself. Although TreeMap iterates in natural key ordering, its internal data structures are all based on trees and not arrays (remember that Maps do not order keys, by definition, in spite of that the very common use case).
That said, you have to make a choice as it is not possible to have O(1) computation time for your comparison criteria both for insertion into the Map and the indexOf(key) calculation. This is due to the fact that lexicographical order is not stable in a mutable data structure (as opposed to insertion order, for instance). An example: once you insert the first key-value pair (entry) into the map, its position will always be one. However, depending on the second key inserted, that position might change as the new key may be "greater" or "lower" than the one in the Map. You can surely implement this by maintaining and updating an indexed list of keys during the insertion operation, but then you'll have O(n log(n)) for your insert operations (as will need to re-order an array). That might be desirable or not, depending on your data access patterns.
ListOrderedMap and LinkedMap in Apache Commons both come close to what you need but rely on insertion order. You can check out their implementation and develop your own solution to the problem with little to moderate effort, I believe (that should be just a matter of replacing the ListOrderedMaps internal backing array with a sorted list - TreeList in Apache Commons, for instance).
You can also calculate the index yourself, by subtracting the number of elements that are lower than then given key (which should be faster than iterating through the list searching for your element, in the most frequent case - as you're not comparing anything).
I agree with Isolvieira. Perhaps the best approach would be to use a different structure than TreeMap.
However, if you still want to go with computing the index of the keys, a solution would be to count how many keys are lower than the key you are looking for.
Here is a code snippet:
java.util.SortedMap<String, String> treeMap = new java.util.TreeMap<String, String>();
treeMap.put("d", "content 4");
treeMap.put("b", "content 2");
treeMap.put("c", "content 3");
treeMap.put("a", "content 1");
String key = "d"; // key to get the index for
System.out.println( treeMap.keySet() );
final String firstKey = treeMap.firstKey(); // assuming treeMap structure doesn't change in the mean time
System.out.format( "Index of %s is %d %n", key, treeMap.subMap(firstKey, key).size() );
I'd like to thank all of you for the effort you put in answering my question, they all were very useful and taking the best from each of them made me come up to the solution I actually implemented in my project.
What I beleive to be best answers to my single questions are:
2) There is not an Iterator defined on TreeMaps as #Isoliveira sais:
There's no such implementation in the JDK itself.
Although TreeMap iterates in natural key ordering,
its internal data structures are all based on trees and not arrays
(remember that Maps do not order keys, by definition,
in spite of that the very common use case).
and as I found in this SO answer How to iterate over a TreeMap?, the only way to iterate on elements in a Map is to use map.entrySet() and use Iterators defined on Set (or some other class with Iterators).
3) It's possible to use a TreeMap to implement Dictionary, but this will garantuee a complexity of O(logN) in finding index of a contained word (cost of a lookup in a Tree Data Structure).
Using a HashMap with same procedure will instead have complexity O(1).
1) There exists no such method. Only solution is to implement it entirely.
As #Paul stated
Assumes that once getPosition() has been called, the dictionary is not changed.
assumption of solution is that once that Dictionary is created it will not be changed afterwards: in this way position of a word will always be the same.
Giving this assumption I found a solution that allows to build Dictionary with complexity O(N) and after garantuees the possibility to get index of a word contained with constat time O(1) in lookup.
I defined Dictionary as a HashMap like this:
public HashMap<String, WordStruct> dictionary = new HashMap<String, WordStruct>();
key --> the String representing the word contained in Dictionary
value --> an Object of a created class WordStruct
where WordStruct class is defined like this:
public class WordStruct {
private int DictionaryPosition; // defines the position of word in dictionary once it is alphabetically ordered
public WordStruct(){
}
public SetWordPosition(int pos){
this.DictionaryPosition = pos;
}
}
and allows me to keep memory of any kind of attribute I like to couple with the word entry of the Dictionary.
Now I fill dictionary iterating over all words contained in all files of my collection:
THE FOLLOWING IS PSEUDOCODE
for(int i = 0; i < number_of_files ; i++){
get_file(i);
while (file_contais_words){
dictionary.put( word(j) , new LemmaStruct());
}
}
Once HashMap is filled in whatever order I use procedure indicated by #dasblinkenlight to order it once and for all with complexity O(N)
Object[] dictionaryArray = dictionary.keySet().toArray();
Arrays.sort(dictionaryArray);
for(int i = 0; i < dictionaryArray.length; i++){
String word = (String) dictionaryArray[i];
dictionary.get(word).SetWordPosition(i);
}
And from now on to have index position in alphatebetic order of word in dictionary only thing needed is to acces it's variable DictionaryPosition:
since word is know you just need to access it and this has constant cost in a HashMap.
Thanks again and Iwish you all a Merry Christmas!!
Have you thought to make the values in your TreeMap contain the position in your dictionary? I am using a BitSet here for my file details.
This doesn't work nearly as well as my other idea below.
Map<String,Integer> dictionary = new TreeMap<String,Integer> ();
private void test () {
// Construct my dictionary.
buildDictionary();
// Make my file data.
String [] file1 = new String[] {
"1", "3", "5"
};
BitSet fileDetails = getFileDetails(file1, dictionary);
printFileDetails("File1", fileDetails);
}
private void printFileDetails(String fileName, BitSet details) {
System.out.println("File: "+fileName);
for ( int i = 0; i < details.length(); i++ ) {
System.out.print ( details.get(i) ? 1: -1 );
if ( i < details.length() - 1 ) {
System.out.print ( "," );
}
}
}
private BitSet getFileDetails(String [] file, Map<String, Integer> dictionary ) {
BitSet details = new BitSet();
for ( String word : file ) {
// The value in the dictionary is the index of the word in the dictionary.
details.set(dictionary.get(word));
}
return details;
}
String [] dictionaryWords = new String[] {
"1", "2", "3", "4", "5"
};
private void buildDictionary () {
for ( String word : dictionaryWords ) {
// Initially make the value 0. We will change that later.
dictionary.put(word, 0);
}
// Make the indexes.
int wordNum = 0;
for ( String word : dictionary.keySet() ) {
dictionary.put(word, wordNum++);
}
}
Here the building of the file details consists of a single lookup in the TreeMap for each word in the file.
If you were planning to use the value in the dictionary TreeMap for something else you could always compose it with an Integer.
Added
Thinking about it further, if the value field of the Map is earmarked for something you could always use special keys that calculate their own position in the Map and act just like Strings for comparison.
private void test () {
// Dictionary
Map<PosKey, String> dictionary = new TreeMap<PosKey, String> ();
// Fill it with words.
String[] dictWords = new String[] {
"0", "1", "2", "3", "4", "5"};
for ( String word : dictWords ) {
dictionary.put( new PosKey( dictionary, word ), word );
}
// File
String[] fileWords = new String[] {
"0", "2", "3", "5"};
int[] file = new int[dictionary.size()];
// Initially all -1.
for ( int i = 0; i < file.length; i++ ) {
file[i] = -1;
}
// Temp file words set.
Set fileSet = new HashSet( Arrays.asList( fileWords ) );
for ( PosKey key : dictionary.keySet() ) {
if ( fileSet.contains( key.getKey() ) ) {
file[key.getPosiion()] = 1;
}
}
// Print out.
System.out.println( Arrays.toString( file ) );
// Prints: [1, -1, 1, 1, -1, 1]
}
class PosKey
implements Comparable {
final String key;
// Initially -1
int position = -1;
// The map I am keying on.
Map<PosKey, ?> map;
public PosKey ( Map<PosKey, ?> map, String word ) {
this.key = word;
this.map = map;
}
public int getPosiion () {
if ( position == -1 ) {
// First access to the key.
int pos = 0;
// Calculate all positions in one loop.
for ( PosKey k : map.keySet() ) {
k.position = pos++;
}
}
return position;
}
public String getKey () {
return key;
}
public int compareTo ( Object it ) {
return key.compareTo( ( ( PosKey )it ).key );
}
public int hashCode () {
return key.hashCode();
}
}
NB: Assumes that once getPosition() has been called, the dictionary is not changed.
I would suggest that you write a SkipList to store your dictionary, since this will still offer O(log N) lookups, insertion and removal while also being able to provide an index (tree implementations can generally not return an index since the nodes don't know it, and there would be a cost to keeping them updated). Unfortunately the java implementation of ConcurrentSkipListMap does not provide an index, so you would need to implement your own version.
Getting the index of an item would be O(log N), if you wanted both the index and value without doing 2 lookups then you would need to return a wrapper object holding both.