Treemap with <Integer, List> - java

I'm going count the most used words in a text and I want to make it this way just need little help how i'm gonna fix the Treemap..
this is how its look like now ...
TreeMap<Integer, List<String>> Word = new TreeMap<Integer, List<String>>();
List<String> TheList = new ArrayList<String>();
//While there is still something to read..
while (scanner.hasNext()) {
String NewWord = scanner.next().toLowerCase();
if (Word.containsKey(NewWord)) {
Word.put(HERE I NEED HELP);
} else {
Word.put(HERE I NEED HELP);
}
}
So what i wanna do is if the NewWord is in the list then add one on Integer(key) and if not Add the word to the next list.

Your type appears to be completely incorrect
... if you want a frequency count
You want to have your word as the key and the count as the value. There is little value in using a sorted collection, but it is many time slower so I would use a HashMap.
Map<String, Integer> frequencyCount = new HashMap<>();
while (scanner.hasNext()) {
String word = scanner.next().toLowerCase();
Integer count = frequencyCount.get(word);
if (count == null)
frequencyCount.put(word, 1);
else
frequencyCount.put(word, 1 + count);
}
... if you want to key by length. I would use a List<Set<String>> This is because your word length is positive and bounded, and you want to ignore duplicate words which is something a Set is designed to do.
List<Set<String>> wordsByLength = new ArrayList<Set<String>>();
while (scanner.hasNext()) {
String word = scanner.next().toLowerCase();
// grow the array list as required.
while(wordsByteLength.size() <= word.length())
wordsByLength.add(new HashSet<String>());
// add the word ignoring duplicates.
wordsByLength.get(words.length()).add(word);
}

All the examples above are correctly storing the count into a map, unfortunately they are not sorting by count which is a requirement you also have.
Do not use a TreeMap, instead use a HashMap to build up the values.
Once you have the complete list of values built you can then drop the entrySet from the HashMap into a new ArrayList and sort that array list by Entry<String,Integer>.getValue().
Or to be neater create a new "Count" object which has both the word and the count in and use that.

Dont do..
TreeMap<Integer, List<String>>
instead do,
TreeMap<String, Integer> // String represents the word... Integer represents the count
because your key (count) can be same sometimes where as the words will be unique...
Do it the other way around... keep reading the words and check if your map contains that word... If yes, increment the count, else add the word with count = 1.

Try this one
TreeMap<String, Integer> Word = new TreeMap<String,Integer>();
while (scanner.hasNext()) {
String NewWord = scanner.next().toLowerCase();
if (Word.containsKey(NewWord)) {
Word.put(NewWord,Word.get(NewWord)+1);
} else {
Word.put(NewWord,1);
}
}

The way to solve this in a time-efficient manner is to have two maps. One map should be from keys to counts, and the other from counts to keys. You can assemble these in different passes. The first should assemble the map from keys to counts:
Map<String, Integer> wordCount = new HashMap<String,Integer>();
while (scanner.hasNext()) {
String word = scanner.next().toLowerCase();
wordCount.put(word, wordCount.containsKey(word) ? wordCount.get(word) + 1 : 1);
}
The second phase inverts the map so that you can read off the top-most keys:
// Biggest values first!
Map<Integer,List<String>> wordsByFreq = new TreeMap<Integer,List<String>>(new Comparator<Integer>(){
public int compare(Integer a, Integer b) {
return a - b;
}
});
for (Map.Entry<String,Integer> e : wordCount) {
List<String> current = wordsByFreq.get(e.getValue());
if (current == null)
wordsByFreq.put(e.getValue(), current = new ArrayList<String>());
current.add(e.getKey());
}
Note that the first stage uses a HashMap because we don't need the order at all; just speedy access. The second stage needs a TreeMap and it needs a non-standard comparator so that the first value read out will be the list of most-frequent words (allowing for two or more words to be most-frequent).

Try this out:
TreeMap<String, Integer> map = new TreeMap<String, Integer>();
Scanner scanner = null;
while (scanner.hasNext()) {
String NewWord = scanner.next().toLowerCase();
if (map.containsKey(NewWord)) {
Integer count = map.get(NewWord);
// Add the element back along with incremented count
map.put(NewWord, count++);
} else {
map.put(NewWord,1); // Add a new entry
}
}

Related

How to add digits to a created stopwords list in Java?

I have a method which creates a stopword list with the 10% of most frequent words from the lemmas key in my JSON file – which looks like this:
{..
,"lemmas":{
"doc41":"the dynamically expand when there too many collision i e have distinct hash code but fall into same slot modulo size expect average effect"
,"doc40":"retrieval operation include get generally do block so may overlap update operation include put remove retrieval reflect result any non null k new longadder increment"
,"doc42":"a set projection"..
}
}
private static List<String> StopWordsFile(ConcurrentHashMap<String, String> lemmas) {
// ConcurrentHashMap stores each word and its frequency
ConcurrentHashMap<String, Integer> counts = new ConcurrentHashMap<String, Integer>();
// Array List for all the individual words
ArrayList<String> corpus = new ArrayList<String>();
for (Entry<String, String> entry : lemmas.entrySet()) {
String line = entry.getValue().toLowerCase();
line = line.replaceAll("\\p{Punct}", " ");
line = line.replaceAll("\\d+"," ");
line = line.replaceAll("\\s+", " ");
line = line.trim();
String[] value = line.split(" ");
List<String> words = new ArrayList<String>(Arrays.asList(value));
corpus.addAll(words);
}
// count all the words in the corpus and store the words with each frequency in
// the counts
for (String word : corpus) {
if (counts.keySet().contains(word)) {
counts.put(word, counts.get(word) + 1);
} else {
counts.put(word, 1);
}
}
// Create a list to store all the words with their frequency and sort it by values.
List<Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());
list.sort((e2, e1) -> e1.getValue().compareTo(e2.getValue()));
List<Entry<String, Integer>> stopwordslist = new ArrayList<>(list.subList(0, (int) (0.10 * list.size())));
// Create the stopwords list with the 10% most frequent words
List<String> stopwords = new ArrayList<>();
// for (Map.Entry<String, Integer> e : sublist) {
for (ConcurrentHashMap.Entry<String, Integer> e : stopwordslist) {
stopwords.add(e.getKey());
}
System.out.println(stopwords);
return stopwords;
}
It outputs these words:
[the, of, value, v, key, to, given, a, k, map, in, for, this, returns, if, is, super, null, ... that, none]
I want to add single digits to it such as '1,2,3,4,5,6,7,8,9' or/and another stopwords.txt file containing digits.
How can I do that?
Also, how can I output this stopwords list to a CSV file? Can someone point me in the right direction?
I'm new to Java.

Java - Iteration over arrayList & Insertion of 250k words into treeMap is taking huge time

I am implementing a Java based synonym finder, which will store the thesaurus of 250k words into a map and each associated googleWord into the txt file (1000 words in total) will be assigned as values for each of the thesaurus word if its the synonym of it.
Now, that I am doing that I am iterating over each Thesaurus word list and checking for its synonym using wordnet library and if the google word has one of those synonym word them I am assigning that value to Thesaurus map. Code block is provided below:
#SuppressWarnings("rawtypes")
public TreeMap fetchMap() throws IOException {
generateThesaurusList();
generateGoogleList();
/** loop through the array of Thesaurus Words..*/
for (int i=0; i<thesaurusList.size(); i++) {
SynonymFinder sf = new SynonymFinder();
// find the
ArrayList synonymList = sf.getSynonym(thesaurusList.get(i).toString().trim());
for (int j=0; j<synonymList.size(); j++) {
if (googleList.contains(synonymList.get(j)));
hm.put(thesaurusList.get(i).toString().trim(), synonymList.get(j).toString().trim());
}
}
return hm;
}
But, the iteration of the list and its insertion is taking very huge time. Can someone suggest something to cater it fast.
I have used HashMap for the same, but it was also slow..
Note: I must have to use some sort of map for storing data..
My change after suggestions, but nothing helped out.
#SuppressWarnings("rawtypes")
public TreeMap fetchMap() throws IOException {
generateThesaurusList();
generateGoogleList();
Set<String> gWords = new HashSet<>(googleList);
int record =1;
int loopcount=0;
ArrayList thesaurusListing = removeDuplicates(thesaurusList);
Map<String, Set<String>> tWordsWithSynonymsMatchingGoogleWords = new TreeMap<>();
/** loop through the array of Google Words..*/
for (int i=0; i<thesaurusListing.size(); i++) {
SynonymFinder sf = new SynonymFinder();
System.out.println(record);
// find the
ArrayList synonymList = sf.getSynonym(thesaurusListing.get(i).toString().trim());
for (int j=0; j<synonymList.size(); j++) {
if (googleList.contains(synonymList.get(j))) {
/**to avoid duplicate keys*/
tWords.put(thesaurusListing.get(i).toString().trim(), new HashSet<>(synonymList));
}
}
for (String tWord : tWords.keySet()) {
tWords.get(tWord).retainAll(gWords);
tWordsWithSynonymsMatchingGoogleWords.put(tWord, tWords.get(tWord));
}
record++;
}
return (TreeMap) tWordsWithSynonymsMatchingGoogleWords;
}
Your code was missing part of creation, entry which will consist of {key, set}, but was {key, value}. Based on what you want to achieve, you need to intersect two sets. Here is example how you can approach that:
public static Map<String, Set<String>> getThesaurusWordsWithSynonymsMatchingGoogleWords(
Map<String, Set<String>> tWordsWithSynonyms, Set<String> gWords) {
Map<String, Set<String>> tWordsWithSynonymsMatchingGoogleWords = new TreeMap<>();
for (String tWord : tWordsWithSynonyms.keySet()) {
tWordsWithSynonyms.get(tWord).retainAll(gWords);
tWordsWithSynonymsMatchingGoogleWords.put(tWord, tWordsWithSynonyms.get(tWord));
}
return tWordsWithSynonymsMatchingGoogleWords;
}
public static void main(String[] args) {
Map<String, Set<String>> tWords = new HashMap<>();
tWords.put("B", new HashSet<>(Arrays.asList("d")));
tWords.put("A", new HashSet<>(Arrays.asList("a", "b", "c")));
tWords.put("C", new HashSet<>(Arrays.asList("e")));
Set<String> gWords = new HashSet<>(Arrays.asList("a", "b", "e"));
System.out.println("Input -> thesaurusWordsWithSynonyms:");
System.out.println(tWords);
System.out.println("Input -> googleWords:");
System.out.println(gWords);
Map<String, Set<String>> result = getThesaurusWordsWithSynonymsMatchingGoogleWords(tWords, gWords);
System.out.println("Input -> thesaurusWordsWithSynonymsMatchingGoogleWords:");
System.out.println(result);
}
}
To make all things working, firstly you should trim you thesaurus words and find their matching synonyms.

Given a arraylist of words, and user input. How to map them into “Families”? in Java

I am trying to create an evil hangman game using Java and TreeMaps. I'm trying to figure out how to put words into families. I have an ArrayList that is just a list of words and a String that represents a user input / guess. From this I have to create a map of patterns they generate and how many words match each pattern. In order to do this I need to break the word list up into different patterns and words based on the user guess.
For example, suppose I have a list:
{ALLY, BETA, COOL, DEAL, ELSE, FLEW, GOOD, HOPE, IBEX}
and User guesses an E.
Every word falls into one of a few families based on where the E is:
"----", is the pattern for [ALLY, COOL, GOOD]
"-E--", is the pattern for [BETA, DEAL]
"--E-", is the pattern for [FLEW, IBEX]
"E--E", is the pattern for [ELSE]
"---E", is the pattern for [HOPE]
I should also mention that the user also picks the length of the word he guessing so in this specific case it will consider only four letter words.
Is there a way to use a TreeMap object to help map out what words belong in what families? e.g.. to put it in the form TreeMap < String, ArrayList < String > >.
I am having a lot of trouble figuring this out so this is very incomplete but code so far is?
public class Hangman {
// instance vars
private ArrayList<String> list;
private boolean debugging;
private ArrayList<Character> guess;
private int numGuesses;
private String pattern;
// pre: words != null, words.size() > 0
// if debugOn = true, debuggin output is added
public HangmanManager(List<String> words, boolean debugOn) {
list = new ArrayList<String>();
debugging = debugOn;
for(int i = 0; i < words.size(); i++){
list.add(words.get(i));
}
}
// pre: words != null, words.size() > 0
// debuggin output is not added
public HangmanManager(List<String> words) {
list = new ArrayList<String>();
for(int i = 0; i < words.size(); i++){
list.add(words.get(i));
}
}
public TreeMap<String, Integer> makeGuess(char guess) {
if(alreadyGuessed(guess)){
throw new IllegalStateException("Not valid imput.");
}
TreeMap<String, ArrayList<String>> newList = new TreeMap<String, ArrayList<String>>();
newList.put(str, list);
return null;
}
//helper method to generate an ArrayList that contains the letter that the user guesses
public ArrayList<String> getArrayList(char guess){
String str = guess + "";
ArrayList<String> newList = new ArrayList<String>();
for(int i = 0; i < list.size(); i++){
if(list.get(i).contains(str)){
newList.add(list.get(i));
}
}
return newList;
}
//helper method to break up the current word list into different patterns and words based on the user guess.
public TreeMap<String, ArrayList<String>> breakUp(char guess){
Map<String, ArrayList<String>> newList = new TreeMap<String, ArrayList<String>>();
String str = guess + "";
newList.put(str, list);
return null;
}
}
You made good progress so far, please see the 2 methods below that'll help you fill in the cracks.
This method gets the pattern based on your guess and words like {ALLY, BETA, COOL, DEAL, ELSE, FLEW, GOOD, HOPE, IBEX}.
public String getPatternForWord(char guess, String word) {
//regex to match all non-guess characters (ex: [^E] if guess was 'E')
String replaceRegex = "[^" + guess + "]";
//replace all non-guess characters with '-' (ex: replace all non-'E' with '-')
String pattern = word.replaceAll(replaceRegex, "-");
return pattern;
}
This method returns a map of patterns to their words Map<String, List<String>>.
For example: {----=[ALLY, COOL, GOOD], ---E=[HOPE], --E-=[FLEW, IBEX], -E--=[BETA, DEAL], E--E=[ELSE]}
public Map<String, List<String>> getPatternMapForGuess(char guess) {
Map<String, List<String>> newMap = new TreeMap<String, List<String>>();
for (String word : list) {
String pattern = getPatternForWord(guess, word);
//get the list of words for this pattern from map
List<String> wordList;
if (newMap.containsKey(pattern)) {
wordList = newMap.get(pattern);
} else {
wordList = new ArrayList<String>();
}
//add word to list if it isn't there already
if (!wordList.contains(word)) {
wordList.add(word);
}
//pattern : word list
newMap.put(pattern, wordList);
}
return newMap;
}
As an aside, I noticed that you always limit yourself to a specific implementation ArrayList<String> of List<String> instead of just the interface List<String>. It's good OOP practice to program to an interface not a specific implementation since it improves your flexibility.
For example:
List<String> newList = new ArrayList<String>();
instead of
ArrayList<String> newList = new ArrayList<String>();
and
Map<String, List<String>> newList = new TreeMap<String, List<String>>();
instead of
TreeMap<String, ArrayList<String>> newList = new TreeMap<String, ArrayList<String>>();
I won't go into details here, you can find out more with these links: 1, 2, 3.

Elegant solution for string-counting?

The problem I have is an example of something I've seen often. I have a series of strings (one string per line, lets say) as input, and all I need to do is return how many times each string has appeared. What is the most elegant way to solve this, without using a trie or other string-specific structure? The solution I've used in the past has been to use a hashtable-esque collection of custom-made (String, integer) objects that implements Comparable to keep track of how many times each string has appeared, but this method seems clunky for several reasons:
1) This method requires the creation of a comparable function which is identical to the String's.compareTo().
2) The impression that I get is that I'm misusing TreeSet, which has been my collection of choice. Updating the counter for a given string requires checking to see if the object is in the set, removing the object, updating the object, and then reinserting it. This seems wrong.
Is there a more clever way to solve this problem? Perhaps there is a better Collections interface I could use to solve this problem?
Thanks.
One posibility can be:
public class Counter {
public int count = 1;
}
public void count(String[] values) {
Map<String, Counter> stringMap = new HashMap<String, Counter>();
for (String value : values) {
Counter count = stringMap.get(value);
if (count != null) {
count.count++;
} else {
stringMap.put(value, new Counter());
}
}
}
In this way you still need to keep a map but at least you don't need to regenerate the entry every time you match a new string, you can access the Counter class, which is a wrapper of integer and increase the value by one, optimizing the access to the array
TreeMap is much better for this problem, or better yet, Guava's Multiset.
To use a TreeMap, you'd use something like
Map<String, Integer> map = new TreeMap<>();
for (String word : words) {
Integer count = map.get(word);
if (count == null) {
map.put(word, 1);
} else {
map.put(word, count + 1);
}
}
// print out each word and each count:
for (Map.Entry<String, Integer> entry : map.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getKey(), entry.getValue());
}
Integer theCount = map.get("the");
if (theCount == null) {
theCount = 0;
}
System.out.println(theCount); // number of times "the" appeared, or null
Multiset would be much simpler than that; you'd just write
Multiset<String> multiset = TreeMultiset.create();
for (String word : words) {
multiset.add(word);
}
for (Multiset.Entry<String> entry : multiset.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getElement(), entry.getCount());
}
System.out.println(multiset.count("the")); // number of times "the" appeared
You can use a hash-map (no need to "create a comparable function"):
Map<String,Integer> count(String[] strings)
{
Map<String,Integer> map = new HashMap<String,Integer>();
for (String key : strings)
{
Integer value = map.get(key);
if (value == null)
map.put(key,1);
else
map.put(key,value+1);
}
return map;
}
Here is how you can use this method in order to print (for example) the string-count of your input:
Map<String,Integer> map = count(input);
for (String key : map.keySet())
System.out.println(key+" "+map.get(key));
You can use a Bag data structure from the Apache Commons Collection, like the HashBag.
A Bag does exactly what you need: It keeps track of how often an element got added to the collections.
HashBag<String> bag = new HashBag<>();
bag.add("foo");
bag.add("foo");
bag.getCount("foo"); // 2

Efficient way to delete values from hashmap object

I have HashMap object contains a key x-y-z with corresponding value test-test1-test2.
Map<String,String> map = new HashMap<String,String>();
map.put("x-y-z","test-test1-test2");
map.put("x1-y1-z1","test-test2-test3");
Now I have an input string array that contains some piece of the key:
String[] rem={"x","x1"}
Based on this string array I want to remove HashMap values.
Can anyone give an efficient approach to do this operation?
List remList = Arrays.asList(rem);
for (Iterator it = map.keySet().iterator(); it.hasNext();) {
String key = (String) it.next();
String[] tokens = key.split("-");
for (int i = 0; i < tokens.length; i++) {
String token = tokens[i];
if (remList.contains(token)) {
it.remove();
break;
}
}
}
And an updated version with adding functionality based on your latest comment on this answer:
private static Map getMapWithDeletions(Map map, String[] rem) {
Map pairs = new HashMap();
for (int i = 0; i < rem.length; i++) {
String keyValue = rem[i];
String[] pair = keyValue.split("#", 2);
if (pair.length == 2) {
pairs.put(pair[0], pair[1]);
}
}
Set remList = pairs.keySet();
for (Iterator it = map.keySet().iterator(); it.hasNext();) {
String key = (String) it.next();
String[] tokens = key.split("-");
for (int i = 0; i < tokens.length; i++) {
String token = tokens[i];
if (remList.contains(token)) {
it.remove();
pairs.remove(token);
break;
}
}
}
map.putAll(pairs);
return map;
}
Edited based on edited question.
Loop through the keySet of the hashmap. When you find a key that starts with x you are looking for remove it from the map.
Something like:
for(String[] key: map.keySet()){
if(key.length>0 && x.equals(key[0])){
map.remove(key);
}
}
Assuming I understand you correctly, and you want to remove everything starting with 'x-' and 'x1-' from the map (but not 'x1111-', even though 'x1' is a prefix of 'x1111'), and efficiency is important, you might want to look at one of the implementations of NavigableMap, such as (for example) TreeMap.
NavigableMaps keep their entries in order (by natural key order, by default), and can be iterated over and searched very efficiently.
They also provide methods like subMap, which can produce another Map which contains those keys in a specified range. Importantly, this returned Map is a live view, which means operations on this map affect the original map too.
So:
NavigableMap<String,String> map = new TreeMap<String,String>();
// populate data
for (String prefixToDelete : rem) {
// e.g. prefixToDelete = "x"
String startOfRange = prefixToDelete + "-"; // e.g. x-
String endOfRange = prefixToDelete + "`"; // e.g. x`; ` comes after - in sort order
map.subMap(startOfRange, endOfRange).clear(); // MAGIC!
}
Assuming your map is large, .subMap() should be much faster than iterating over each Map entry (as a TreeMap uses a red-black tree for fast searching).
You can do the following:
Map<String,String> map = new HashMap<String,String>();
map.put("x-y-z","test-test1-test2");
map.put("x1-y1-z1","test-test2-test3");
String[] rem={"x","x1"};
for (String s : rem) {
map.keySet().removeIf(key -> key.contains(s));
}
This piece of code will remove all entries with "x" or "x1" in the map key.

Categories

Resources