Why doesn't List.contain match substrings? - java

I have a program made where I add a +1 in the counter if the word matches the words I have in the list
For example if I have the word [OK, NICE] and I am looking at a word (the sentence is a split with a space).
In the split I don't want to put an option for commas and points, I just want a space like this now
private static int contWords(String line, List<String> list) {
String[] words= line.split(" ");
int cont = 0;
for (int i = 0; i < words.length; i++) {
if (list.contains(words[i].toUpperCase())) {
cont++;
}
}
return cont;
}
This would be an example of words that don't add +1 to the counter and should
OK = true
OKEY= false
NICE. = false
NICE, = false

The problem
Here is the problem you are trying to solve:
Take a list of target words
A sentence
Count number of occurrences of the target words in the words of the sentence
Suppose you are looking for 'OK' and 'NICE' in your sentence, and your sentence is "This is ok, nice work!", the occurances should be 2.
Options
You have a few options, I am going to show you the way using Streams
Solution
private static int countWords(String sentence, List<String> targets) {
String[] words = sentence.split(" ");
return (int) Stream.of(words)
.map(String::toUpperCase)
.filter(word -> targets.stream().anyMatch(word::contains))
.count();
}
How does it work?
Firstly, you take in a sentence, then split it into an array (You have done this already)
Then, we take the array, then use map to map every word to its uppercase form. This means that every word will now be in all caps.
Then, using filter we only keep the words that exist, as a substring, in the target list.
Then, we just return the count.
More in depth?
I can go through what this statement means in more detail:
.filter(word -> targets.stream().anyMatch(word::contains))
word -> ... is a function that takes in a word and outputs a boolean value. This is useful because for each word, we want to know whether or not it is a substring of the targets.
Then, the function will compute targets.stream().anyMatch(word::contains) which goes through the target stream, and tells us if any of the words in it contain (as a substring) our word that we are filtering.
NINJA EDIT:
In your original question, if the sentence was "This is Okey, nice work!" and the target list was ["OK", "OKEY"], it would have returned 2.
If this is the behaviour you want, you can change the method to:
private static int countWords(String sentence, List<String> targets) {
String[] words = sentence.split(" ");
return Stream.of(words)
.map(String::toUpperCase)
.map(word -> targets.stream().filter(word::contains).count())
.reduce(0L, Long::sum)
.intValue();
}
NINJA-IER EDIT:
Based on the other question proposed in the comments, you can replace all matched words with "***" by doing the following:
private static String replaceWordsWithAsterisks(String sentence, List<String> targets) {
String[] words = sentence.split(" ");
List<String> processedWords = Stream.of(words)
.map(word -> targets.stream().anyMatch(word.toUpperCase()::contains) ? "***" : word)
.collect(Collectors.toList());
return String.join(" ", processedWords);
}

Related

Count occurrence of multiple substrings

I want to count the occurrence of multiple substrings in a string.
I am able to do so by using the following code:
int score = 0;
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (String word : words) {
if(StringUtils.containsIgnoreCase(text, word)){
score += 1;
}
}
My algorithm increases the score by +1 for each word from my "words" list that occurs in the text.
In my example, the score would be 2 because "this" and "is" occur in the text.
However, my code has to loop through the text for each string in my list.
Is there a faster way to do this?
How about the following:
String text = "This is some random text. This is some random text.";
text = text.toLowerCase();
String[] tokens = text.split("\\PL+");
java.util.Set<String> source = new java.util.HashSet<>();
for (String token : tokens) {
source.add(token);
}
java.util.List<String> words = java.util.Arrays.asList("this", "is", "for", "stackoverflow");
source.retainAll(words);
int score = source.size();
Split text into words.
Add the words to a Set so that each word appears only once. Hence Set will contain the word this only once despite the fact that the word this appears twice in text.
After calling method retainAll, the Set only contains words that are in the words list. Hence your score is the number of elements in the Set.
The fastest way would be to map each words of the text.
Therefore for each word in the words that you are searching for, you just have to look up for the keys in the hashmap.
Given your text has n word and your words has m words
The solution would take O(n+m) instead of O(n*m)
This is a case where regex is your friend:
public static Map<String, Integer> countTokens(String src, List<String> tokens) {
Map<String, Integer> countMap = new HashMap<>();
String end = "(\\s|$)"; //trap whitespace or the end of the string.
String start = "(^|\\s)"; //trap for whitespace or the start of a the string
//iterate through your tokens
for (String token : tokens) {
//create a pattern, note the case insensitive flag
Pattern pattern = Pattern.compile(start+token+end, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(src);
int cnt = 0;
//count your matches.
while(matcher.find()) {
cnt++;
}
countMap.put(token, cnt);
}
return countMap;
}
public static void main(String[] args) throws IOException {
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (Entry<String, Integer> entry : countTokens(text, words).entrySet()) {
System.out.println(entry);
}
}
If you want to find tokens within token, like "is" within "this", simply remove the start and end regex.
You can use split method, to convert string to Array of Strings, sort it, and then you can binary search the elements of list in the array this has been implemented in the given code.
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
int count = 0;
Arrays.sort(wordsArr);
for(String word: words)
if(Arrays.binarySearch( wordsArr, word )>-1)
count++;
Another good approach can be to use a TreeSet, this one I got inspiration from #Abra
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
TreeSet<String> setOfWords = new TreeSet<String>(Arrays.asList(wordsArr));
int count = 0;
for(String word: words)
if(setOfWord.contains(word))
count++;
Both these methods have a Time Complexity of O(Nlog(M)), N being the size of words array, M being the size of wordsArr or setOfWords, However do be careful, using this method since this does have one flaw, which is quite obvious, It doesn't account for periods, thus, if you were to search for "text", it won't be found, because, The set/array contains "text.". You can get around that by removing all the punctuations from your initial text and searched text, however, if you do want it to be accurate then, you can set the regex string in split() to be "[^a-zA-Z]" this will split your String around non alphabetical characters.

Find exact match from Array

In java I want to iterate an array to find any matching words from my input string
if the input string is appended to numbers it should return true.
Array arr = {"card","creditcard","debitcard"}
String inputStr = "need to discard pin" --> Return False
String inputStr = "need to 444card pin" --> Return True if its followed by number
I tried the below code, but it returns true as it takes "card" from the "discard" string and compares, but I need to do an exact match
Arrays.stream(arr).anymatch(inputString::contains)
Try this:
String[] arr = {"card","creditcard","debitcard"}; // array that keeps the words
String inputStr = "need to discard pin"; // String that keeps the 'sentence'
String[] wordsToBeChecked = inputStr.split(" "); // We take the string and split it at each " " (space)
HashSet<String> matchingWords = new HashSet<>(); // This will keep the matching words
for (String s : arr)
{
for (String s1 : wordsToBeChecked)
{
if(s.equalsIgnoreCase(s1)) // If first word matches with the second
{
matchingWords.add(s1); // add it to our container
}
}
}
Or using Java 8 Streams:
List<String> wordList = Arrays.asList(arr);
List<String> sentenceWordList = Arrays.asList(inputStr.split(" "));
List<String> matchedWords = wordList.stream().filter(sentenceWordList::contains)
.collect(Collectors.toList());
The problem with most answers here is that they do not take punctuation into consideration. To solve this, you could use a regular expression like below.
String[] arr = { "card", "creditcard", "debitcard" };
String inputStr = "You need to discard Pin Card.";
Arrays.stream(arr)
.anyMatch(word -> Pattern
.compile("(?<![a-z-])" + Pattern.quote(word) + "(?![a-z-])", Pattern.CASE_INSENSITIVE)
.matcher(inputStr)
.find());
With Pattern.quote(word), we escape any character within each word with is a special character in the context of a regular expression. For instance, the literal string a^b would never match, because ^ means the start of a string if used in a regular expression.
(?<![a-z-]) and (?![a-z-]) mean that there is not a word character immediately preceding or succeeding the word. For instance, discard will not match, even if it contains the word card. I have used only lowercase in these character classes because of the next bullet:
The flag CASE_INSENSITIVE passed to the compile method causes the pattern to be matched in a case-insensitive manner.
Online demo
You could split the string using a regular expression
String[] arr = {"card","creditcard","debitcard"};
String inputStr = "need to discard pin";
List<String> wordsToBeChecked = Arrays.asList(inputStr.split("[ 0-9]"));
Arrays.stream(arr).anyMatch(wordsToBeChecked::contains);
If your word list and input string is longer, consider splitting your input string into a hashset. Looksups will be faster, then:
Set<String> wordsToBeChecked = new HashSet<>(Arrays.asList(inputStr.split(" ")));
You can create a Set of the words in inputStr and then check the words list against that Set.
Set<String> inputWords = uniqueWords(inputStr);
List<String> matchedWords = Arrays.stream(arr)
.anyMatch(word -> inputWords.contains(word))
.collect(Collectors.toList());
Building the Set may be non-trivial if you have to account for hyphenation, numbers, punctuation, and so forth. I'll wave my hands and ignore that - here's a naive implementation of uniqueWords(String) that assumes they are separated by spaces.
public Set<String> uniqueWords(String string) {
return Arrays.stream(string.split(" "))
.collect(Collectors.toSet());
}
One way would be
String[] arr = {"card","creditcard","debitcard"};
String inputStr = "need to discard pin";
var contains = Arrays.stream(inputStr.split(" ")).anyMatch(word -> Arrays.asList(arr).contains(word));
You can adjust the split regex to include all kinds of whitespace too.
Also: Consider an appropriate data structure for lookups. Array will be O(n), HashSet will be O(1).

Reassemble split string based on previous split in JAVA?

If I split a string, say like this:
List<String> words = Arrays.asList(input.split("\\s+"));
And I then wanted to modify those words in various way, then reassmble them using the same logic, assuming no word lengths have changed, is there a way to do that easily? Humor me in that there's a reason I'm doing this.
Note: I need to match all whitspace, not just spaces. Hence the regex.
i.e.:
"Beautiful Country" -> ["Beautiful", "Country"] -> ["BEAUTIFUL", "COUNTRY"] -> "BEAUTIFUL COUNTRY"
If you use String.split, there is no way to be sure that the reassembled strings will be the same as the original ones.
In general (and in your case) there is no way to capture what the actual separators used were. In your example, "\\s+" will match one or more whitespace characters, but you don't know which characters were used, or how many there were.
When you use split, the information about the separators is lost. Period.
(On the other hand, if you don't care that the reassembled string may be a different length or may have different separators to the original, use the Joiner class ...)
Assuming you are have a limit on how many words you can expect, you could try writing a regular expression like
(\S+)(\s+)?(\S+)?(\s+)?(\S+)?
(for the case in which you expect up to three words). You could then use the Matcher API methods groupCount(), group(n) to pull the individual words (the odd groups) or whitespace separators (the even groups >0), do what you needed with the words, and re-assemble them once again...
I tried this:
import java.util.*;
import java.util.stream.*;
public class StringSplits {
private static List<String> whitespaceWords = new ArrayList<>();
public static void main(String [] args) {
String input = "What a Wonderful World! ...";
List<String> words = processInput(input);
// First transformation: ["What", "a", "Wonderful", "World!", "..."]
String first = words.stream()
.collect(Collectors.joining("\", \"", "[\"", "\"]"));
System.out.println(first);
// Second transformation: ["WHAT", "A", "WONDERFUL", "WORLD!", "..."]
String second = words.stream()
.map(String::toUpperCase)
.collect(Collectors.joining("\", \"", "[\"", "\"]"));
System.out.println(second);
// Final transformation: WHAT A WONDERFUL WORLD! ...
String last = IntStream.range(0, words.size())
.mapToObj(i -> words.get(i) + whitespaceWords.get(i))
.map(String::toUpperCase)
.collect(Collectors.joining());
System.out.println(last);
}
/*
* Accepts input string of words containing character words and
* whitespace(s) (as defined in the method Character#isWhitespce).
* Processes and returns only the character strings. Stores the
* whitespace 'words' (a single or multiple whitespaces) in a List<String>.
* NOTE: This method uses String concatenation in a loop. For processing
* large inputs consider using a StringBuilder.
*/
private static List<String> processInput(String input) {
List<String> words = new ArrayList<>();
String word = "";
String whitespaceWord = "";
boolean wordFlag = true;
for (char c : input.toCharArray()) {
if (! Character.isWhitespace(c)) {
if (! wordFlag) {
wordFlag = true;
whitespaceWords.add(whitespaceWord);
word = whitespaceWord = "";
}
word = word + String.valueOf(c);
}
else {
if (wordFlag) {
wordFlag = false;
words.add(word);
word = whitespaceWord = "";
}
whitespaceWord = whitespaceWord + String.valueOf(c);
}
} // end-for
whitespaceWords.add(whitespaceWord);
if (! word.isEmpty()) {
words.add(word);
}
return words;
}
}

How can I eliminate duplicate words from String in Java?

I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?
Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.
I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey
Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();
simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog
It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.
//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));

Check if string contains any part of substring

I have
String S = "Eng - Computer, Eng - Software.."
User inputs:
String I = "Engineering."
I would like this to return true because S contains "Eng" part of a substring of I.
How can I do this?
S.trim.toLowercase.contains(....)
Does not properly work because of the "part" of substring.
Is S a standard string that checks if a substring which enters a user is part of S?
If so, you can use the "contains" method:
if (I.contains("*/string that you want to check*/")
return true;
It depends on what you mean. Are you looking if any of the words in S are contained in I? If that's the case, I would recommend splitting S into an array of Strings for each word, and then checking if any of the words are a substring of I. Take a look at this function.
public boolean checkSubstring(String S, String I){
String[] words = S.split(" ");
for(int i=0; i<words.length; i++){
if(I.contains(words[i])
return true; //We found a word contained in I!
}
return false; //None of the words were contained in I
}
Try the code below:
You need to tokenize S and then ask if I contains any of those tokens.
String S = "Eng - Computer, Eng - Software..";
String I = "Engineering.";
//neither S or I contain each other
System.out.println(S.trim().toLowerCase().contains(I));
System.out.println(I.trim().toLowerCase().contains(S));
String[] tokens = S.split(" ");
//but parts of S are contained in I...
for(String token : tokens) {
if(I.contains(token)) {
System.out.println("found : " +token);
}
}

Categories

Resources