java weka stringtowordvector is not counting word occurences properly - java

so I'm using Weka Machine Learning Library's JAVA API and I have the following code:
String html = "repeat repeat repeat";
Attribute input = new Attribute("html",(FastVector) null);
FastVector inputVec = new FastVector();
inputVec.addElement(input);
Instances htmlInst = new Instances("html",inputVec,1);
htmlInst.add(new Instance(1));
htmlInst.instance(0).setValue(0, html);
StringToWordVector filter = new StringToWordVector();
filter.setUseStoplist(true);
filter.setInputFormat(htmlInst);
Instances dataFiltered = Filter.useFilter(htmlInst, filter);
Instance last = dataFiltered.lastInstance();
System.out.println(last);
though StringToWordVector is supposed to count the word occurences within the string, instead of having the word 'repeat' counted 3 times, the count only comes out as 1
what am I doing wrong?

The default setting is only reporting presence/absence as 0/1. You must enable counting explicitly. Add:
filter.setOutputWordCounts(true);
and re-run.
Weka has an explicit mailing list; posting such questions there might give you faster responses.

Gee... all those lines of code. How about these few lines instead?
public static Map<String, Integer> countWords(String input) {
Map<String, Integer> map = new HashMap<String, Integer>();
Matcher matcher = Pattern.compile("\\b\\w+\\b").matcher(input);
while (matcher.find())
map.put(matcher.group(), map.containsKey(matcher.group()) ? map.get(matcher.group()) + 1 : 1);
return map;
}
Here's the code in action:
public static void main(String[] args) {
System.out.println(countWords("sample, repeat sample, of text"));
}
Output:
{of=1, text=1, repeat=1, sample=2}

Related

Java-Stream & Optional - Find a value that matches to a stream-element or provide a Default value

I have a Dictionary object which consists of several entries:
record Dictionary(String key, String value, String other) {};
I would like to replace words in the given String my a which are present as a "key" in one of the dictionaries with the corresponding value. I can achieve it like this, but I guess, there must be a better way to do this.
An example:
> Input: One <sup>a</sup> Two <sup>b</sup> Three <sup>D</sup> Four
> Output: One [a-value] Two [b-value] Three [D] Four
The code to be improved:
public class ReplaceStringWithDictionaryEntries {
public static void main(String[] args) {
List<Dictionary> dictionary = List.of(new Dictionary("a", "a-value", "a-other"),
new Dictionary("b", "b-value", "b-other"));
String theText = "One <sup>a</sup> Two <sup>b</sup> Three <sup>D</sup> Four";
Matcher matcher = Pattern.compile("<sup>([A-Za-z]+)</sup>").matcher(theText);
StringBuilder sb = new StringBuilder();
int matchLast = 0;
while (matcher.find()) {
sb.append(theText, matchLast, matcher.start());
Optional<Dictionary> dict = dictionary.stream().filter(f -> f.key().equals(matcher.group(1))).findFirst();
if (dict.isPresent()) {
sb.append("[").append(dict.get().value()).append("]");
} else {
sb.append("[").append(matcher.group(1)).append("]");
}
matchLast = matcher.end();
}
if (matchLast != 0) {
sb.append(theText.substring(matchLast));
}
System.out.println("Result: " + sb.toString());
}
}
Output:
Result: One [a-value] Two [b-value] Three [D] Four
Do you have a more elegant way to do this?
Since Java 9, Matcher#replaceAll can accept a callback function to return the replacement for each matched value.
String result = Pattern.compile("<sup>([A-Za-z]+)</sup>").matcher(theText)
.replaceAll(mr -> "[" + dictionary.stream().filter(f -> f.key().equals(mr.group(1)))
.findFirst().map(Dictionary::value)
.orElse(mr.group(1)) + "]");
Create a map from your list using key as key and value as value, use the Matcher#appendReplacement method to replace matches using the above map and calling Map.getOrDefault, use the group(1) value as default value. Use String#join to put the replacements in square braces
public static void main(String[] args) {
List<Dictionary> dictionary = List.of(
new Dictionary("a", "a-value", "a-other"),
new Dictionary("b", "b-value", "b-other"));
Map<String,String> myMap = dictionary.stream()
.collect(Collectors.toMap(Dictionary::key, Dictionary::value));
String theText = "One <sup>a</sup> Two <sup>b</sup> Three <sup>D</sup> Four";
Matcher matcher = Pattern.compile("<sup>([A-Za-z]+)</sup>").matcher(theText);
StringBuilder sb = new StringBuilder();
while (matcher.find()) {
matcher.appendReplacement(sb,
String.join("", "[", myMap.getOrDefault(matcher.group(1), matcher.group(1)), "]"));
}
matcher.appendTail(sb);
System.out.println(sb.toString());
}
record Dictionary( String key, String value, String other) {};
Map vs List
As #Chaosfire has pointed out in the comment, a Map is more suitable collection for the task than a List, because it eliminates the need of iterating over collection to access a particular element
Map<String, Dictionary> dictByKey = Map.of(
"a", new Dictionary("a", "a-value", "a-other"),
"b", new Dictionary("b", "b-value", "b-other")
);
And I would also recommend wrapping the Map with a class in order to provide continent access to the string-values of the dictionary, otherwise we are forced to check whether a dictionary returned from the map is not null and only then make a call to obtain the required value, which is inconvenient. The utility class can facilitate getting the target value in a single method call.
To avoid complicating the answer, I would not implement such a utility class, and for simplicity I'll go with a Map<String,String> (which basically would act as a utility class intended to act - providing the value within a single call).
public static final Map<String, String> dictByKey = Map.of(
"a", "a-value",
"b", "b-value"
);
Pattern.splitAsStream()
We can replace while-loop with a stream created via splitAsStream() .
In order to distinguish between string-values enclosed with tags <sup>text</sup> we can make use of the special constructs which are called Lookbehind (?<=</sup>) and Lookahead (?=<sup>).
(?<=foo) - matches a position that immediately precedes the foo.
(?=foo) - matches a position that immediately follows after the foo;
For more information, have a look at this tutorial
The pattern "(?=<sup>)|(?<=</sup>)" would match a position in the given string right before the opening tag and immediately after the closing tag. So when we apply this pattern splitting the string with splitAsStream(), it would produce a stream containing elements like "<sup>a</sup>" enclosed with tags, and plain string like "One", "Two", "Three".
Note that in order to reuse the pattern without recompiling, it can be declared on a class level:
public static final Pattern pattern = Pattern.compile("(?=<sup>)|(?<=</sup>)");
The final solution would result in lean and simple stream:
public static void foo(String text) {
String result = pattern.splitAsStream(text)
.map(str -> getValue(str)) // or MyClass::getValue
.collect(Collectors.joining());
System.out.println(result);
}
Instead of tackling conditional logic inside a lambda, it's often better to extract it into a separate method (sure, you can use a ternary operator and place this logic right inside the map operation in the stream if you wish instead of having this method, but it'll be a bit messy):
public static String getValue(String str) {
if (str.matches("<sup>\\p{Alpha}+</sup>")) {
String key = str.replaceAll("<sup>|</sup>", "");
return "[" + dictByKey.getOrDefault(key, key) + "]";
}
return str;
}
main()
public static void main(String[] args) {
foo("One <sup>a</sup> Two <sup>b</sup> Three <sup>D</sup> Four");
}
Output:
Result: One [a-value] Two [b-value] Three [D] Four
A link to Online Demo

Replace number in word

How is it possible to replace every 1 with one, every 2 with two, every 3 with three...
from an Input?
My Code:
import javax.swing.JOptionPane;
public class Main {
public static void main(String[] args) {
String Input = JOptionPane.showInputDialog("Text:");
String Output;
//replace
Output = Input.replaceAll("1", "one");
Output = Input.replaceAll("2", "two");
//Output
System.out.println(Output);
}
}
It just work with one replace-item.
You need call replaceAll on OutPut for the second time:
Output = Input.replaceAll("1", "one");
Output = Output.replaceAll("2", "two");
or just call replaceAll fluently:
Output = Input.replaceAll("1", "one").replaceAll("2", "two");
Your code is setting Output twice using Input as the source string. Therefore, calling Output = Input.replaceAll("2", "two); completely negates the first time you called it.
You could replace that with this instead:
Output = Input.replaceAll("1", "one");
Output = Output.replaceAll("2", "two");
But that would be a bit excessive and become quite cumbersome if you want to define a lot of replacements.
Instead, you could use a HashMap to store the values you want to replace and what to replace them with.
Using HashMap<Character, String> allows you to store the single-character "key," or the value you want to replace, and its replacement string.
Then it is just a matter of reading each character of the input string and determining when the HashMap has defined a replacement for it.
import java.util.HashMap;
public class Main {
private static HashMap<Character, String> replacementMap = new HashMap<>();
public static void main(String[] args) {
// Build the replacement strings
replacementMap.put('1', "one");
replacementMap.put('2', "two");
replacementMap.put('3', "three");
replacementMap.put('4', "four");
replacementMap.put('5', "five");
replacementMap.put('6', "six");
replacementMap.put('7', "seven");
replacementMap.put('8', "eight");
replacementMap.put('9', "nine");
replacementMap.put('0', "zero");
String input = "This is 1 very long string. It has 3 sentences and 121 characters. Exactly 0 people will verify that count.";
StringBuilder output = new StringBuilder();
for (char c : input.toCharArray()) {
// This character has a replacement defined in the map
if (replacementMap.containsKey(c)) {
// Add the replacement string to the output
output.append(replacementMap.get(c));
} else {
// No replacement found, just add this character to the output
output.append(c);
}
}
System.out.println(output.toString());
}
}
Output:
This is one very long string. It has three sentences and onetwotwo characters. Exactly zero people will verify this count.
Limitations:
First of all, this implementation depends on your desired functionality and scope. Since there are an infinite number of possible numbers, this would not account for that.
Also, this looks for a single character to replace. If you wanted to expand this to replace "10" with "ten," for example, you would need to use HashMap<String, String> instead.
Unfortunately, your original question does not provide enough context in order to suggest the best way for you.

Parsing String by pattern of substrings

I need to parse a formula and get all the variables that were used. The list of variables is available. For example, the formula looks like this:
String f = "(Min(trees, round(Apples1+Pears1,1)==1&&universe==big)*number";
I know that possible variables are:
String[] vars = {"trees","rivers","Apples1","Pears1","Apricots2","universe","galaxy","big","number"};
I need to get the following array:
String[] varsInF = {"trees", "Apples1","Pears1", "universe", "big","number"};
I believe that split method is good here but can’t figure the regexp required for this.
No need for any regex pattern - just check which item of the supported vars is contained in the given string:
List<String> varsInf = new ArrayList<>();
for(String var : vars)
if(f.contains(var))
varsInf.add(var);
Using Stream<> you can:
String[] varsInf = Arrays.stream(vars).filter(f::contains).toArray(String[]::new);
Assuming "variable" is represented by one alphanumeric character or sequential sequence of multiple such characters, you should split by not-alphanumeric characters, i. e. [^\w]+, then collect result by iteration or filter:
Set<String> varSet = new HashSet<>(Arrays.asList(vars));
List<String> result = new ArrayList<>();
for (String s : f.split("[^\\w]+")) {
if (varSet.contains(s)) {
result.add(s);
}
}

ReplaceAll with java8 lambda functions

Given the following variables
templateText = "Hi ${name}";
variables.put("name", "Joe");
I would like to replace the placeholder ${name} with the value "Joe" using the following code (that does not work)
variables.keySet().forEach(k -> templateText.replaceAll("\\${\\{"+ k +"\\}" variables.get(k)));
However, if I do the "old-style" way, everything works perfectly:
for (Entry<String, String> entry : variables.entrySet()){
String regex = "\\$\\{" + entry.getKey() + "\\}";
templateText = templateText.replaceAll(regex, entry.getValue());
}
Surely I am missing something here :)
Java 8
The proper way to implement this has not changed in Java 8, it is based on appendReplacement()/appendTail():
Pattern variablePattern = Pattern.compile("\\$\\{(.+?)\\}");
Matcher matcher = variablePattern.matcher(templateText);
StringBuffer result = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(result, variables.get(matcher.group(1)));
}
matcher.appendTail(result);
System.out.println(result);
Note that, as mentioned by drrob in the comments, the replacement String of appendReplacement() may contain group references using the $ sign, and escaping using \. If this is not desired, or if your replacement String can potentially contain those characters, you should escape them using Matcher.quoteReplacement().
Being more functional in Java 8
If you want a more Java-8-style version, you can extract the search-and-replace boiler plate code into a generalized method that takes a replacement Function:
private static StringBuffer replaceAll(String templateText, Pattern pattern,
Function<Matcher, String> replacer) {
Matcher matcher = pattern.matcher(templateText);
StringBuffer result = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(result, replacer.apply(matcher));
}
matcher.appendTail(result);
return result;
}
and use it as
Pattern variablePattern = Pattern.compile("\\$\\{(.+?)\\}");
StringBuffer result = replaceAll(templateText, variablePattern,
m -> variables.get(m.group(1)));
Note that having a Pattern as parameter (instead of a String) allows it to be stored as a constant instead of recompiling it every time.
Same remark applies as above concerning $ and \ – you may want to enforce the quoteReplacement() inside the replaceAll() method if you don't want your replacer function to handle it.
Java 9 and above
Java 9 introduced Matcher.replaceAll(Function) which basically implements the same thing as the functional version above. See Jesse Glick's answer for more details.
you also can using Stream.reduce(identity,accumulator,combiner).
identity
identity is the initial value for reducing function which is accumulator.
accumulator
accumulator reducing identity to result, which is the identity for the next reducing if the stream is sequentially.
combiner
this function never be called in sequentially stream. it calculate the next identity from identity & result in parallel stream.
BinaryOperator<String> combinerNeverBeCalledInSequentiallyStream=(identity,t) -> {
throw new IllegalStateException("Can't be used in parallel stream");
};
String result = variables.entrySet().stream()
.reduce(templateText
, (it, var) -> it.replaceAll(format("\\$\\{%s\\}", var.getKey())
, var.getValue())
, combinerNeverBeCalledInSequentiallyStream);
import java.util.HashMap;
import java.util.Map;
public class Repl {
public static void main(String[] args) {
Map<String, String> variables = new HashMap<>();
String templateText = "Hi, ${name} ${secondname}! My name is ${name} too :)";
variables.put("name", "Joe");
variables.put("secondname", "White");
templateText = variables.keySet().stream().reduce(templateText, (acc, e) -> acc.replaceAll("\\$\\{" + e + "\\}", variables.get(e)));
System.out.println(templateText);
}
}
output:
Hi, Joe White! My name is Joe too :)
However, it's not the best idea to reinvent the wheel and the preferred way to achieve what you want would be to use apache commons lang as stated here.
Map<String, String> valuesMap = new HashMap<String, String>();
valuesMap.put("animal", "quick brown fox");
valuesMap.put("target", "lazy dog");
String templateString = "The ${animal} jumped over the ${target}.";
StrSubstitutor sub = new StrSubstitutor(valuesMap);
String resolvedString = sub.replace(templateString);
Your code should be changed like below,
String templateText = "Hi ${name}";
Map<String,String> variables = new HashMap<>();
variables.put("name", "Joe");
templateText = variables.keySet().stream().reduce(templateText, (originalText, key) -> originalText.replaceAll("\\$\\{" + key + "\\}", variables.get(key)));
Performing replaceAll repeatedly, i.e. for every replaceable variable, can become quiet expensive, especially as the number of variables might grow. This doesn’t become more efficient when using the Stream API. The regex package contains the necessary building blocks to do this more efficiently:
public static String replaceAll(String template, Map<String,String> variables) {
String pattern = variables.keySet().stream()
.map(Pattern::quote)
.collect(Collectors.joining("|", "\\$\\{(", ")\\}"));
Matcher m = Pattern.compile(pattern).matcher(template);
if(!m.find()) {
return template;
}
StringBuffer sb = new StringBuffer();
do {
m.appendReplacement(sb, Matcher.quoteReplacement(variables.get(m.group(1))));
} while(m.find());
m.appendTail(sb);
return sb.toString();
}
If you are performing the operation with the same Map very often, you may consider keeping the result of Pattern.compile(pattern), as it is immutable and safely shareable.
On the other hand, if you are using this operation with different maps frequently, it might be an option to use a generic pattern instead, combined with handling the possibility that the particular variable is not in the map. The adds the option to report occurrences of the ${…} pattern with an unknown variable:
private static Pattern VARIABLE = Pattern.compile("\\$\\{([^}]*)\\}");
public static String replaceAll(String template, Map<String,String> variables) {
Matcher m = VARIABLE.matcher(template);
if(!m.find())
return template;
StringBuffer sb = new StringBuffer();
do {
m.appendReplacement(sb,
Matcher.quoteReplacement(variables.getOrDefault(m.group(1), m.group(0))));
} while(m.find());
m.appendTail(sb);
return sb.toString();
}
m.group(0) is the actual match, so using this as a fall-back for the replacement string establishes the original behavior of not replacing ${…} occurrences when the key is not in the map. As said, alternative behaviors, like reporting the absent key or using a different fall-back text, are possible.
To update #didier-l’s answer, in Java 9 this is a one-liner!
Pattern.compile("[$][{](.+?)[}]").matcher(templateText).replaceAll(m -> variables.get(m.group(1)))

Map of Map - word pairs in java - stuck

I am using a MSDOS windows prompt to pipe in a file.. its a regular file with words.(not like abc,def,ghi..etc)
I am trying to write a program that counts how many times each word pair appears in a text file. A word pair consists of two consecutive words (i.e. a word and the word that directly follows it). In the first sentence of this paragraph, the words “counts” and “how” are a word pair.
What i want the program to do is, take this input :
abc def abc ghi abc def ghi jkl abc xyz abc abc abc ---
Should produce this output:
abc:
abc, 2
def, 2
ghi, 1
xyz, 1
def:
abc, 1
ghi, 1
ghi:
abc, 1
kl, 1
jkl:
abc, 1
xyz:
abc, 1
My input is not going to be like that though. My input will be more like:
"seattle amazoncom is expected to report"
so would i even need to test for "abc"?
MY BIGGEST issue is adding it to the map... so i think
I think i need to use a map of a map? I am not sure how to do this?
Map<String, Map<String, Integer>> uniqueWords = new HashMap<String, Map<String, Integer>>();
I think the map would produce this output for me: which is axactly what i want..
Key | Value number of times
--------------------------
abc | def, ghi, jkl 3
def | jkl, mno 2
if that map is correct, in my situation how would i add to it from the file?
I have tried:
if(words.contain("abc")) // would i even need to test for abc?????
{
uniqueWords.put("abc", words, ?) // not sure what to do about this?
}
this is what i have so far.
import java.util.Scanner;
import java.util.ArrayList;
import java.util.TreeSet;
import java.util.Iterator;
import java.util.HashSet;
public class Project1
{
public static void main(String[] args)
{
Scanner sc = new Scanner(System.in);
String word;
String grab;
int number;
// ArrayList<String> a = new ArrayList<String>();
// TreeSet<String> words = new TreeSet<String>();
Map<String, Map<String, Integer>> uniquWords = new HashMap<String, Map<String, Integer>>();
System.out.println("project 1\n");
while (sc.hasNext())
{
word = sc.next();
word = word.toLowerCase();
if (word.matches("abc")) // would i even need to test for abc?????
{
uniqueWords.put("abc", word); // syntax incorrect i still need an int!
}
if (word.equals("---"))
{
break;
}
}
System.out.println("size");
System.out.println(uniqueWords.size());
System.out.println("unique words");
System.out.println(uniqueWords.size());
System.out.println("\nbye...");
}
}
I hope someone can help me because i am banging my head and not learnign anything for weeks now.. Thank you...
I came up with this solution. I think your idea with the Map may be more elegant, but run this an lets see if we can refine:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
public class Main {
private static List<String> inputWords = new ArrayList<String>();
private static Map<String, List<String>> result = new HashMap<String, List<String>>();
public static void main(String[] args) {
collectInput();
process();
generateOutput();
}
/*
* Modify this method to collect the input
* however you require it
*/
private static void collectInput(){
// test code
inputWords.add("abc");
inputWords.add("def");
inputWords.add("abc");
inputWords.add("ghi");
inputWords.add("abc");
inputWords.add("def");
inputWords.add("abc");
}
private static void process(){
// Iterate through every word in our input list
for(int i = 0; i < inputWords.size() - 1; i++){
// Create references to this word and next word:
String thisWord = inputWords.get(i);
String nextWord = inputWords.get(i+1);
// If this word is not in the result Map yet,
// then add it and create a new empy list for it.
if(!result.containsKey(thisWord)){
result.put(thisWord, new ArrayList<String>());
}
// Add nextWord to the list of adjacent words to thisWord:
result.get(thisWord).add(nextWord);
}
}
/*
* Rework this method to output results as you need them:
*/
private static void generateOutput(){
for(Entry e : result.entrySet()){
System.out.println("Symbol: " + e.getKey());
// Count the number of unique instances in the list:
Map<String, Integer>count = new HashMap<String, Integer>();
List<String>words = (List)e.getValue();
for(String s : words){
if(!count.containsKey(s)){
count.put(s, 1);
}
else{
count.put(s, count.get(s) + 1);
}
}
// Print the occurances of following symbols:
for(Entry f : count.entrySet()){
System.out.println("\t following symbol: " + f.getKey() + " : " + f.getValue());
}
}
System.out.println();
}
}
In your table, you have Key | Value | Number of times. Is the "nubmer of times" specific to each of second words? This may work.
My suggestion in your last question was to use a map of Lists. Each unique word would have an associated List (empty to begin with). At the end of processing you would count up all identical words in the list to get a total:
Key | List of following words
abc | def def ghi mno ghi
Now, you could count identical items in your list to find out that:
abc --> def = 2
abc --> ghi = 2
abc --> mno = 1
I think this approach or yours would work well. I'll put some code together and update this post is nobody else responds.
You have initialized uniqueWords as a Map of Maps, not a Map of Strings as you are trying to populate it. For your design to work, you need to put a Map<String, Integer> as the value for the "abc" key.
....
Map<String, Map<String, Integer>> uniquWords = new HashMap<String, Map<String, Integer>>();
System.out.println("project 1\n");
while (sc.hasNext())
{
word = sc.next();
word = word.toLowerCase();
if (word.matches("abc")) // would i even need to test for abc?????
// no, just use the word
{
uniqueWords.put("abc", word); // <-- here you are putting a String value, instead of a Map<String, Integer>
}
if (word.equals("---"))
{
break;
}
}
Instead, you could do something akin to the following brute-force approach:
Map<String, Integer> followingWordsAndCnts = uniqueWords.get(word);
if (followingWordsAndCnts == null) {
followingWordsAndCnts = new HashMap<String,Integer>();
uniqueWords.put(word, followingWordsAndCnts);
}
if (sc.hasNext()) {
word = sc.next().toLowerCase();
Integer cnt = followingWordsAndCnts.get(word);
followingWordsAndCnts.put(word, cnt == null? 1 : cnt + 1);
}
You could make this a recursive method to ensure that each word gets its turn as the following word and the word that is being followed.
for each key (e.g. "abc") you want to store another string (e.g. "def","abc") paired with an integer(1,2)
I would download google collections and use a Map<String, Multiset<String>>
Map<String, Multiset<String>> myMap = new HashMap<String, Multiset<String>>();
...
void addPair(String word1, String word2) {
Multiset<String> set = myMap.get(word1);
if(set==null) {
set = HashMultiMap.create();
myMap.put(word1,set);
}
set.add(word2);
}
int getOccurs(String word1, String word2) {
if(myMap.containsKey(word1))
return myMap.get(word1).count(word2);
return 0;
}
If you don't want to use a Multiset, you can create the logical equivalents(for your purposes, not general purpose):
Multiset<String> === Map<String,Integer>
Map<String, Multiset<String>> === Map<String, Map<String,Integer>>
To make your answer in alphabetically order... Simply make all HashMap into TreeMap. For example:
new HashMap>();'
into
new TreeMap>();
and dont forget to add import java.util.TreeMap;

Categories

Resources