Count occurrence of multiple substrings

Count occurrence of multiple substrings - java

I want to count the occurrence of multiple substrings in a string.
I am able to do so by using the following code:
int score = 0;
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (String word : words) {
if(StringUtils.containsIgnoreCase(text, word)){
score += 1;
}
}
My algorithm increases the score by +1 for each word from my "words" list that occurs in the text.
In my example, the score would be 2 because "this" and "is" occur in the text.
However, my code has to loop through the text for each string in my list.
Is there a faster way to do this?

How about the following:
String text = "This is some random text. This is some random text.";
text = text.toLowerCase();
String[] tokens = text.split("\\PL+");
java.util.Set<String> source = new java.util.HashSet<>();
for (String token : tokens) {
source.add(token);
}
java.util.List<String> words = java.util.Arrays.asList("this", "is", "for", "stackoverflow");
source.retainAll(words);
int score = source.size();
Split text into words.
Add the words to a Set so that each word appears only once. Hence Set will contain the word this only once despite the fact that the word this appears twice in text.
After calling method retainAll, the Set only contains words that are in the words list. Hence your score is the number of elements in the Set.

The fastest way would be to map each words of the text.
Therefore for each word in the words that you are searching for, you just have to look up for the keys in the hashmap.
Given your text has n word and your words has m words
The solution would take O(n+m) instead of O(n*m)

This is a case where regex is your friend:
public static Map<String, Integer> countTokens(String src, List<String> tokens) {
Map<String, Integer> countMap = new HashMap<>();
String end = "(\\s|$)"; //trap whitespace or the end of the string.
String start = "(^|\\s)"; //trap for whitespace or the start of a the string
//iterate through your tokens
for (String token : tokens) {
//create a pattern, note the case insensitive flag
Pattern pattern = Pattern.compile(start+token+end, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(src);
int cnt = 0;
//count your matches.
while(matcher.find()) {
cnt++;
}
countMap.put(token, cnt);
}
return countMap;
}
public static void main(String[] args) throws IOException {
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (Entry<String, Integer> entry : countTokens(text, words).entrySet()) {
System.out.println(entry);
}
}
If you want to find tokens within token, like "is" within "this", simply remove the start and end regex.

You can use split method, to convert string to Array of Strings, sort it, and then you can binary search the elements of list in the array this has been implemented in the given code.
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
int count = 0;
Arrays.sort(wordsArr);
for(String word: words)
if(Arrays.binarySearch( wordsArr, word )>-1)
count++;
Another good approach can be to use a TreeSet, this one I got inspiration from #Abra
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
TreeSet<String> setOfWords = new TreeSet<String>(Arrays.asList(wordsArr));
int count = 0;
for(String word: words)
if(setOfWord.contains(word))
count++;
Both these methods have a Time Complexity of O(Nlog(M)), N being the size of words array, M being the size of wordsArr or setOfWords, However do be careful, using this method since this does have one flaw, which is quite obvious, It doesn't account for periods, thus, if you were to search for "text", it won't be found, because, The set/array contains "text.". You can get around that by removing all the punctuations from your initial text and searched text, however, if you do want it to be accurate then, you can set the regex string in split() to be "[^a-zA-Z]" this will split your String around non alphabetical characters.

Related

Find exact match from Array

In java I want to iterate an array to find any matching words from my input string
if the input string is appended to numbers it should return true.
Array arr = {"card","creditcard","debitcard"}
String inputStr = "need to discard pin" --> Return False
String inputStr = "need to 444card pin" --> Return True if its followed by number
I tried the below code, but it returns true as it takes "card" from the "discard" string and compares, but I need to do an exact match
Arrays.stream(arr).anymatch(inputString::contains)

Try this:
String[] arr = {"card","creditcard","debitcard"}; // array that keeps the words
String inputStr = "need to discard pin"; // String that keeps the 'sentence'
String[] wordsToBeChecked = inputStr.split(" "); // We take the string and split it at each " " (space)
HashSet<String> matchingWords = new HashSet<>(); // This will keep the matching words
for (String s : arr)
{
for (String s1 : wordsToBeChecked)
{
if(s.equalsIgnoreCase(s1)) // If first word matches with the second
{
matchingWords.add(s1); // add it to our container
}
}
}
Or using Java 8 Streams:
List<String> wordList = Arrays.asList(arr);
List<String> sentenceWordList = Arrays.asList(inputStr.split(" "));
List<String> matchedWords = wordList.stream().filter(sentenceWordList::contains)
.collect(Collectors.toList());

The problem with most answers here is that they do not take punctuation into consideration. To solve this, you could use a regular expression like below.
String[] arr = { "card", "creditcard", "debitcard" };
String inputStr = "You need to discard Pin Card.";
Arrays.stream(arr)
.anyMatch(word -> Pattern
.compile("(?<![a-z-])" + Pattern.quote(word) + "(?![a-z-])", Pattern.CASE_INSENSITIVE)
.matcher(inputStr)
.find());
With Pattern.quote(word), we escape any character within each word with is a special character in the context of a regular expression. For instance, the literal string a^b would never match, because ^ means the start of a string if used in a regular expression.
(?<![a-z-]) and (?![a-z-]) mean that there is not a word character immediately preceding or succeeding the word. For instance, discard will not match, even if it contains the word card. I have used only lowercase in these character classes because of the next bullet:
The flag CASE_INSENSITIVE passed to the compile method causes the pattern to be matched in a case-insensitive manner.
Online demo

You could split the string using a regular expression
String[] arr = {"card","creditcard","debitcard"};
String inputStr = "need to discard pin";
List<String> wordsToBeChecked = Arrays.asList(inputStr.split("[ 0-9]"));
Arrays.stream(arr).anyMatch(wordsToBeChecked::contains);
If your word list and input string is longer, consider splitting your input string into a hashset. Looksups will be faster, then:
Set<String> wordsToBeChecked = new HashSet<>(Arrays.asList(inputStr.split(" ")));

You can create a Set of the words in inputStr and then check the words list against that Set.
Set<String> inputWords = uniqueWords(inputStr);
List<String> matchedWords = Arrays.stream(arr)
.anyMatch(word -> inputWords.contains(word))
.collect(Collectors.toList());
Building the Set may be non-trivial if you have to account for hyphenation, numbers, punctuation, and so forth. I'll wave my hands and ignore that - here's a naive implementation of uniqueWords(String) that assumes they are separated by spaces.
public Set<String> uniqueWords(String string) {
return Arrays.stream(string.split(" "))
.collect(Collectors.toSet());
}

One way would be
String[] arr = {"card","creditcard","debitcard"};
String inputStr = "need to discard pin";
var contains = Arrays.stream(inputStr.split(" ")).anyMatch(word -> Arrays.asList(arr).contains(word));
You can adjust the split regex to include all kinds of whitespace too.
Also: Consider an appropriate data structure for lookups. Array will be O(n), HashSet will be O(1).

Why doesn't List.contain match substrings?

I have a program made where I add a +1 in the counter if the word matches the words I have in the list
For example if I have the word [OK, NICE] and I am looking at a word (the sentence is a split with a space).
In the split I don't want to put an option for commas and points, I just want a space like this now
private static int contWords(String line, List<String> list) {
String[] words= line.split(" ");
int cont = 0;
for (int i = 0; i < words.length; i++) {
if (list.contains(words[i].toUpperCase())) {
cont++;
}
}
return cont;
}
This would be an example of words that don't add +1 to the counter and should
OK = true
OKEY= false
NICE. = false
NICE, = false

The problem
Here is the problem you are trying to solve:
Take a list of target words
A sentence
Count number of occurrences of the target words in the words of the sentence
Suppose you are looking for 'OK' and 'NICE' in your sentence, and your sentence is "This is ok, nice work!", the occurances should be 2.
Options
You have a few options, I am going to show you the way using Streams
Solution
private static int countWords(String sentence, List<String> targets) {
String[] words = sentence.split(" ");
return (int) Stream.of(words)
.map(String::toUpperCase)
.filter(word -> targets.stream().anyMatch(word::contains))
.count();
}
How does it work?
Firstly, you take in a sentence, then split it into an array (You have done this already)
Then, we take the array, then use map to map every word to its uppercase form. This means that every word will now be in all caps.
Then, using filter we only keep the words that exist, as a substring, in the target list.
Then, we just return the count.
More in depth?
I can go through what this statement means in more detail:
.filter(word -> targets.stream().anyMatch(word::contains))
word -> ... is a function that takes in a word and outputs a boolean value. This is useful because for each word, we want to know whether or not it is a substring of the targets.
Then, the function will compute targets.stream().anyMatch(word::contains) which goes through the target stream, and tells us if any of the words in it contain (as a substring) our word that we are filtering.
NINJA EDIT:
In your original question, if the sentence was "This is Okey, nice work!" and the target list was ["OK", "OKEY"], it would have returned 2.
If this is the behaviour you want, you can change the method to:
private static int countWords(String sentence, List<String> targets) {
String[] words = sentence.split(" ");
return Stream.of(words)
.map(String::toUpperCase)
.map(word -> targets.stream().filter(word::contains).count())
.reduce(0L, Long::sum)
.intValue();
}
NINJA-IER EDIT:
Based on the other question proposed in the comments, you can replace all matched words with "***" by doing the following:
private static String replaceWordsWithAsterisks(String sentence, List<String> targets) {
String[] words = sentence.split(" ");
List<String> processedWords = Stream.of(words)
.map(word -> targets.stream().anyMatch(word.toUpperCase()::contains) ? "***" : word)
.collect(Collectors.toList());
return String.join(" ", processedWords);
}

How can I eliminate duplicate words from String in Java?

I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?

Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.

I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey

Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();

simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog

It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.

//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));

how to display total no. of substring in a string

How should I get the total no. of substrings in a string.
For all substrings in a string.
Ex:
str="This is this my book is This"
O/p should like below:
This-3
Is=2
my=1
book=1

If I understood you correctly this is a solution for your problem:
String str="This is this my book is This";
Map<String, Integer> counts = new HashMap<String, Integer>();
String[] words = str.toLowerCase().split("[\\s\\.,;!\\?]");
for (String word: words) {
int count = counts.containsKey(word) ? counts.get(word).intValue() : 0;
counts.put(word, Integer.valueOf(count + 1));
}
You just split the string by the delimiters you want to consider and collect the occurrences in a map.

If I'm right you want to search for the occurrences of all words, not all possible substrings. A very small, easy to understand, code would be the following:
// Split at space
String[] words = input.split(" ");
HashMap<String, Integer> countingMap = new HashMap<>();
for (String word : words) {
Integer counter = countingMap.get(word);
if (counter == null)) {
counter = 0;
}
countingMap.put(word, counter + 1);
}
However, this approach is limited as it assumes each word is surrounded by a space.
Regex is a more powerful tool, it provides a special character for a word boundary (this also matches ,.!? and so on). Consider the following Pattern:
\b(.+?)\b
You can see an example here: regex101.com/r/hO8kA0/1
How to do this in Java?
Pattern pattern = Pattern.compile("\\b(.+?)\\b");
Matcher matcher = pattern.matcher(input);
while(matcher.find()) {
String word = matcher.group(1);
// Here is your word, count the occurrences like above
}

String str="This is this my book is This";
String[] words = str.split(" ");
Map<String,Integer> unitwords = new HashMap<String,Integer>;
for(String word: words){
if(unitwords.containsKey(word)){
unitwords[word]++;
}else{
unitwords.add(word,1);
}
And print the map unitwords.

Troubling shooting java.lang.ArrayIndexOutOfBoundsException

Hi friends i'm doing my final year project for semantic similarity between sentences.
so i'm using word-net 2.1 database to retrieve the meaning. Each line i have to split no of words. In each word i'm get meaning and storing into array. But it can be get only meaning of first sentences.
String[] sentences = result.split("[\\.\\!\\?]");
for (int i=0;i<sentences.length;i++)
{
System.out.println(i);
System.out.println(sentences[i]);
int wcount1 = sentences[i].split("\\s+").length;
System.out.println(wcount1);int wcount1=wordCount(w2);
System.out.println(wcount1);
String[] word1 = sentences[i].split(" ");
for (int j=0;j<wcount1;j++){
System.out.println(j);
System.out.println(word1[j]);
}
}
IndexWordSet set = wordnet.lookupAllIndexWords(word1[j]);
System.out.println(set);
IndexWord[] ws = set.getIndexWordArray();
**POS p = ws[0].getPOS();///line no 103**
Set<String> synonyms = new HashSet<String>();
IndexWord indexWord = wordnet.lookupIndexWord(p, word1[j]);
Synset[] synSets = indexWord.getSenses();
for (Synset synset : synSets)
{
Word[] words = synset.getWords();
for (Word word : words)
{
synonyms.add(word.getLemma());
}
}
System.out.println(synonyms);
OUTPUT:
only the sentences[o](first sentence word's only shoe the meaning ...all the other words are not looping...)
it show this error..
**java.lang.ArrayIndexOutOfBoundsException: 0
at first_JWNL.main(first_JWNL.java:102)**

When you declare the variable wcount1, you assign in the value: sentences[i].split("\\s+")... And yet, when you assign the variable word1, it is assigned with sentences[i].split(" ").
Is it possible, because you are using two regular expressions, the second split (which is being assigned to the word1 variable) is not splitting correctly? And hence when you access the value (System.out.println(word1[j]);), it is throwing the ArrayIndexOutOfBoundsException. Since the value of wcount1 may be bigger than the length of word1.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Count occurrence of multiple substrings - java

The fastest way would be to map each words of the text. Therefore for each word in the words that you are searching for, you just have to look up for the keys in the hashmap. Given your text has n word and your words has m words The solution would take O(n+m) instead of O(n*m)

Related

Find exact match from Array

Why doesn't List.contain match substrings?

How can I eliminate duplicate words from String in Java?

how to display total no. of substring in a string

Troubling shooting java.lang.ArrayIndexOutOfBoundsException

Categories

Resources