Compare two sentences and check if they have a similar word - java

I'm trying to take two sentences and see if they have words in common. Example:
A- "Hello world this is a test"
B- "Test to create things"
The common word here is "test"
I tried using .contains() but it doesn't work because I can only search for one word.
text1.toLowerCase ().contains(sentence1.toLowerCase ())

You can create HashSets from both of the words after splitting on whitespace. You can use Set#retainAll to find the intersection (common words).
final String a = "Hello world this is a test", b = "Test to create things";
final Set<String> words = new HashSet<>(Arrays.asList(a.toLowerCase().split("\\s+")));
final Set<String> words2 = new HashSet<>(Arrays.asList(b.toLowerCase().split("\\s+")));
words.retainAll(words2);
System.out.println(words); //[test]

Spilt the two sentences by space and add each word from first string in a Set. Now in a loop, try adding words from second string in the set. If add operation returns false then it is a common word.
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
public class Sample {
public static void main(String[] args) {
// TODO Auto-generated method stub
String str1 = "Hello world this is a test";
String str2 = "Test to create things";
str1 = str1.toLowerCase();
str2 = str2.toLowerCase();
String[] str1words = str1.split(" ");
String[] str2words = str2.split(" ");
boolean flag = true;
Set<String> set = new HashSet<String>(Arrays.asList(str1words));
for(int i = 0;i<str2words.length;i++) {
flag = set.add(str2words[i]);
if(flag == false)
System.out.println(str2words[i]+" is common word");
}
}
}

You can split the sentence by space and collect the word as list and then search one list item in another list and collect the common words.
Here an example using Java Stream API. Here first sentence words collect as Set to faster the search operation for every words (O(1))
String a = "Hello world this is a test";
String b = "Test to create things";
Set<String> aWords = Arrays.stream(a.toLowerCase().split(" "))
.collect(Collectors.toSet());
List<String> commonWords = Arrays.stream(b.toLowerCase().split(" "))
.filter(bw -> aWords.contains(bw))
.collect(Collectors.toList());
System.out.println(commonWords);
Output: test

Here's one approach:
// extract the words from the sentences by splitting on white space
String[] sentence1Words = sentence1.toLowerCase().split("\\s+");
String[] sentence2Words = sentence2.toLowerCase().split("\\s+");
// make sets from the two word arrays
Set<String> sentence1WordSet = new HashSet<String>(Arrays.asList(sentence1Words));
Set<String> sentence2WordSet = new HashSet<String>(Arrays.asList(sentence2Words));
// get the intersection of the two word sets
Set<String> commonWords = new HashSet<String>(sentence1WordSet);
commonWords.retainAll(sentence2WordSet);
This will yield a Set containing lower case versions of the common words between the two sentences. If it is empty there is no similarity. If you don't care about some words like prepositions you can filter those out of the final similarity set or, better yet, preprocess your sentences to remove those words first.
Note that a real-world (ie. useful) implementation of similarity checking is usually far more complex, as you usually want to check for words that are similar but with minor discrepancies. Some useful starting points to look into for these type of string similarity checking are Levenshtein distance and metaphones.
Note there is a redundant copy of the Set in the code above where I create the commonWords set because intersection is performed in-place, so you could improve performance by simply performing the intersection on sentence1WordSet, but I have favoured code clarity over performance.

Try this.
static boolean contains(String text1, String text2) {
String text1LowerCase = text1.toLowerCase();
return Arrays.stream(text2.toLowerCase().split("\\s+"))
.anyMatch(word -> text1LowerCase.contains(word));
}
and
String text1 = "Hello world this is a test";
String text2 = "Test to create things";
System.out.println(contains(text1, text2));
output:
true

Related

Find exact match from Array

In java I want to iterate an array to find any matching words from my input string
if the input string is appended to numbers it should return true.
Array arr = {"card","creditcard","debitcard"}
String inputStr = "need to discard pin" --> Return False
String inputStr = "need to 444card pin" --> Return True if its followed by number
I tried the below code, but it returns true as it takes "card" from the "discard" string and compares, but I need to do an exact match
Arrays.stream(arr).anymatch(inputString::contains)
Try this:
String[] arr = {"card","creditcard","debitcard"}; // array that keeps the words
String inputStr = "need to discard pin"; // String that keeps the 'sentence'
String[] wordsToBeChecked = inputStr.split(" "); // We take the string and split it at each " " (space)
HashSet<String> matchingWords = new HashSet<>(); // This will keep the matching words
for (String s : arr)
{
for (String s1 : wordsToBeChecked)
{
if(s.equalsIgnoreCase(s1)) // If first word matches with the second
{
matchingWords.add(s1); // add it to our container
}
}
}
Or using Java 8 Streams:
List<String> wordList = Arrays.asList(arr);
List<String> sentenceWordList = Arrays.asList(inputStr.split(" "));
List<String> matchedWords = wordList.stream().filter(sentenceWordList::contains)
.collect(Collectors.toList());
The problem with most answers here is that they do not take punctuation into consideration. To solve this, you could use a regular expression like below.
String[] arr = { "card", "creditcard", "debitcard" };
String inputStr = "You need to discard Pin Card.";
Arrays.stream(arr)
.anyMatch(word -> Pattern
.compile("(?<![a-z-])" + Pattern.quote(word) + "(?![a-z-])", Pattern.CASE_INSENSITIVE)
.matcher(inputStr)
.find());
With Pattern.quote(word), we escape any character within each word with is a special character in the context of a regular expression. For instance, the literal string a^b would never match, because ^ means the start of a string if used in a regular expression.
(?<![a-z-]) and (?![a-z-]) mean that there is not a word character immediately preceding or succeeding the word. For instance, discard will not match, even if it contains the word card. I have used only lowercase in these character classes because of the next bullet:
The flag CASE_INSENSITIVE passed to the compile method causes the pattern to be matched in a case-insensitive manner.
Online demo
You could split the string using a regular expression
String[] arr = {"card","creditcard","debitcard"};
String inputStr = "need to discard pin";
List<String> wordsToBeChecked = Arrays.asList(inputStr.split("[ 0-9]"));
Arrays.stream(arr).anyMatch(wordsToBeChecked::contains);
If your word list and input string is longer, consider splitting your input string into a hashset. Looksups will be faster, then:
Set<String> wordsToBeChecked = new HashSet<>(Arrays.asList(inputStr.split(" ")));
You can create a Set of the words in inputStr and then check the words list against that Set.
Set<String> inputWords = uniqueWords(inputStr);
List<String> matchedWords = Arrays.stream(arr)
.anyMatch(word -> inputWords.contains(word))
.collect(Collectors.toList());
Building the Set may be non-trivial if you have to account for hyphenation, numbers, punctuation, and so forth. I'll wave my hands and ignore that - here's a naive implementation of uniqueWords(String) that assumes they are separated by spaces.
public Set<String> uniqueWords(String string) {
return Arrays.stream(string.split(" "))
.collect(Collectors.toSet());
}
One way would be
String[] arr = {"card","creditcard","debitcard"};
String inputStr = "need to discard pin";
var contains = Arrays.stream(inputStr.split(" ")).anyMatch(word -> Arrays.asList(arr).contains(word));
You can adjust the split regex to include all kinds of whitespace too.
Also: Consider an appropriate data structure for lookups. Array will be O(n), HashSet will be O(1).

Reassemble split string based on previous split in JAVA?

If I split a string, say like this:
List<String> words = Arrays.asList(input.split("\\s+"));
And I then wanted to modify those words in various way, then reassmble them using the same logic, assuming no word lengths have changed, is there a way to do that easily? Humor me in that there's a reason I'm doing this.
Note: I need to match all whitspace, not just spaces. Hence the regex.
i.e.:
"Beautiful Country" -> ["Beautiful", "Country"] -> ["BEAUTIFUL", "COUNTRY"] -> "BEAUTIFUL COUNTRY"
If you use String.split, there is no way to be sure that the reassembled strings will be the same as the original ones.
In general (and in your case) there is no way to capture what the actual separators used were. In your example, "\\s+" will match one or more whitespace characters, but you don't know which characters were used, or how many there were.
When you use split, the information about the separators is lost. Period.
(On the other hand, if you don't care that the reassembled string may be a different length or may have different separators to the original, use the Joiner class ...)
Assuming you are have a limit on how many words you can expect, you could try writing a regular expression like
(\S+)(\s+)?(\S+)?(\s+)?(\S+)?
(for the case in which you expect up to three words). You could then use the Matcher API methods groupCount(), group(n) to pull the individual words (the odd groups) or whitespace separators (the even groups >0), do what you needed with the words, and re-assemble them once again...
I tried this:
import java.util.*;
import java.util.stream.*;
public class StringSplits {
private static List<String> whitespaceWords = new ArrayList<>();
public static void main(String [] args) {
String input = "What a Wonderful World! ...";
List<String> words = processInput(input);
// First transformation: ["What", "a", "Wonderful", "World!", "..."]
String first = words.stream()
.collect(Collectors.joining("\", \"", "[\"", "\"]"));
System.out.println(first);
// Second transformation: ["WHAT", "A", "WONDERFUL", "WORLD!", "..."]
String second = words.stream()
.map(String::toUpperCase)
.collect(Collectors.joining("\", \"", "[\"", "\"]"));
System.out.println(second);
// Final transformation: WHAT A WONDERFUL WORLD! ...
String last = IntStream.range(0, words.size())
.mapToObj(i -> words.get(i) + whitespaceWords.get(i))
.map(String::toUpperCase)
.collect(Collectors.joining());
System.out.println(last);
}
/*
* Accepts input string of words containing character words and
* whitespace(s) (as defined in the method Character#isWhitespce).
* Processes and returns only the character strings. Stores the
* whitespace 'words' (a single or multiple whitespaces) in a List<String>.
* NOTE: This method uses String concatenation in a loop. For processing
* large inputs consider using a StringBuilder.
*/
private static List<String> processInput(String input) {
List<String> words = new ArrayList<>();
String word = "";
String whitespaceWord = "";
boolean wordFlag = true;
for (char c : input.toCharArray()) {
if (! Character.isWhitespace(c)) {
if (! wordFlag) {
wordFlag = true;
whitespaceWords.add(whitespaceWord);
word = whitespaceWord = "";
}
word = word + String.valueOf(c);
}
else {
if (wordFlag) {
wordFlag = false;
words.add(word);
word = whitespaceWord = "";
}
whitespaceWord = whitespaceWord + String.valueOf(c);
}
} // end-for
whitespaceWords.add(whitespaceWord);
if (! word.isEmpty()) {
words.add(word);
}
return words;
}
}

How I can use regex to implement contains functionality?

P.S : If you don't understand anything from the below I describe, please ask me
I have a Dictionary with the list of words.
And I have String of one word with multiple characters.
Eg: Dictionary =>
String[] = {"Manager","age","range", "east".....} // list of words in dictionary
Now I have one string tageranm.
I have to find all the words in the dictionary which can be made using this string. I have been able to find the solution using create all string using Permuation and verify the string is present in the dictionary.
But I have another solution, but dint know how I can do it in Java using Regex
Algorithm:
// 1. Sort `tageranm`.
char c[] = "tageranm".toCharArray();
Arrays.sort(c);
letters = String.valueOf(c); // letters = "aaegmnrt"
2.Sort all words in dictionary:
Example: "range" => "aegnr" // After sorting
Now If I will use "aaegmnrt".contains("aegnr") will return false. As 'm' is coming in between.
Is there a way to use Regex and ignore the character m and get all the words in dictionary using the above approach?
Thanks in advance.
Here is a possible solution, using the regex-type stated by #MattTimmermans in the comments. It's not very fast though, so there are probably loads of ways to improve this.. I'm also pretty sure there should be libraries for this kind of searches, which will (hopefully) have used performance-reducing algorithms.
java.util.List<String> test(String[] words, String input){
java.util.List<String> result = new java.util.ArrayList<>();
// Sort the characters in the input-String:
byte[] inputArray = input.getBytes();
java.util.Arrays.sort(inputArray);
String sortedInput = new String(inputArray);
for(String word : words){
// Sort the characters of the word:
byte[] wordArray = word.getBytes();
java.util.Arrays.sort(wordArray);
String sortedWord = new String(wordArray);
// Create a regex to match from this word:
String wordRegex = ".*" + sortedWord.replaceAll(".", "$0.*");
// If the input matches this regex:
if(sortedInput.matches(wordRegex))
// Add the word to the result-List:
result.add(word);
}
return result;
}
Try it online (with added DEBUG-lines to see what's happening).
For your inputs {"Manager","age","range", "east"} and "tageranm" it will return ["age", "range"].
EDIT: Doesn't match Manager because the M is in uppercase. If you want case-insensitive matching, the easiest it to convert both the input and words to the same case before checking:
input.getBytes() becomes input.toLowerCase().getBytes()
word.getBytes() becomes word.toLowerCase().getBytes()
Try it online (now resulting in ["Manager", "age", "range"]).

How can I eliminate duplicate words from String in Java?

I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?
Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.
I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey
Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();
simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog
It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.
//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));

How to count a frequency of a particular word in a line?

I would like to know, that if I am having a single line string, then how to count the frequency of a particular word in that string, using simple java code?!
Thanks in advance..
What I am looking for is a sample logical code in java which is used to search a particular word in a sentence. I am building one spam filter, which needs to read the line and then classify it.
StringUtils from commons-lang has:
StringUtils.countMatches(string, searchedFor);
You can use regular expression. An example of code is:
public int count(String word, String line){
Pattern pattern = Pattern.compile(word);
Matcher matcher = pattern.matcher(line);
int counter = 0;
while (matcher.find())
counter++;
return counter;
}
First split by spaces (see String#split)
Then use a map to map the words with frequency.
String [] words = line.split(" ");
Map<String,Integer> frequency = new Map <String,Integer>();
for (String word:words){
Integer f = frequency.get(word);
frequency.put(word,f+1);
}
Then you can find out for a particular word with:
frequency.get(word)
Using Guava library:
MultiSet(Use when count of all words are required)
String line="Hello world bye bye world";
Multiset<String> countStr=HashMultiset.create(Splitter.on(' ').split(line));
System.out.println(countStr.count("Hello")); //gives count of the word 'Hello'
Iterators(Use when count of only few words are required)
String line="Hello world bye bye world";
Iterable<String> splitStr=Splitter.on(' ').split(line);
System.out.println(Iterables.frequency(splitStr, "Hello"));
After Googleing and little study i got this stuff __ may be helpfull
String str="hello new demo hello";
Map<String,Integer> hmap= new HashMap<String,Integer>();
for(String tempStr : str.split(" "))
{
if(hmap.containsKey(tempStr))
{
Integer i=hmap.get(tempStr);
i+=1;
hmap.put(tempStr,i);
}
else
hmap.put(tempStr,1);
}
System.out.println(hmap);
Several ways:
Use splits
Use tokenizers
Use Regular Expressions
Use good old loops and string manipulation (ie indexOf(), etc)
Option 1 & 2 has the overhead of trying to figure out if your word happens to be the last on the line (and needing to add an additional count)
Option 3 requires you to be able to form regex syntax
Option 4 is archaic
After getting the string array you can try the following code from Java 10 onwards. It uses streams to get the frequency map.
import java.util.Arrays;
import java.util.stream.Collectors;
public class StringFrequencyMap {
public static void main(String... args) {
String[] wordArray = {"One", "two", "three", "one", "two", "two", "three"};
var freqCaseSensitive = Arrays.stream(wordArray)
.collect(Collectors.groupingBy(x -> x, Collectors.counting()));
//If you want case insensitive then use
var freqCaseInSensitive = Arrays.stream(wordArray)
.collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
System.out.println(freqCaseSensitive);
System.out.println(freqCaseInSensitive);
System.out.println("Frequency of \"two\" is: "+freqCaseInSensitive.get("two"));
}
}
Output will be:
{one=1, One=1, three=2, two=3}
{one=2, three=2, two=3}
Frequency of "two" is: 3

Categories

Resources