How can I eliminate duplicate words from String in Java? - java

I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?

Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.

I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey

Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();

simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog

It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.

//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));

Related

Count occurrence of multiple substrings

I want to count the occurrence of multiple substrings in a string.
I am able to do so by using the following code:
int score = 0;
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (String word : words) {
if(StringUtils.containsIgnoreCase(text, word)){
score += 1;
}
}
My algorithm increases the score by +1 for each word from my "words" list that occurs in the text.
In my example, the score would be 2 because "this" and "is" occur in the text.
However, my code has to loop through the text for each string in my list.
Is there a faster way to do this?
How about the following:
String text = "This is some random text. This is some random text.";
text = text.toLowerCase();
String[] tokens = text.split("\\PL+");
java.util.Set<String> source = new java.util.HashSet<>();
for (String token : tokens) {
source.add(token);
}
java.util.List<String> words = java.util.Arrays.asList("this", "is", "for", "stackoverflow");
source.retainAll(words);
int score = source.size();
Split text into words.
Add the words to a Set so that each word appears only once. Hence Set will contain the word this only once despite the fact that the word this appears twice in text.
After calling method retainAll, the Set only contains words that are in the words list. Hence your score is the number of elements in the Set.
The fastest way would be to map each words of the text.
Therefore for each word in the words that you are searching for, you just have to look up for the keys in the hashmap.
Given your text has n word and your words has m words
The solution would take O(n+m) instead of O(n*m)
This is a case where regex is your friend:
public static Map<String, Integer> countTokens(String src, List<String> tokens) {
Map<String, Integer> countMap = new HashMap<>();
String end = "(\\s|$)"; //trap whitespace or the end of the string.
String start = "(^|\\s)"; //trap for whitespace or the start of a the string
//iterate through your tokens
for (String token : tokens) {
//create a pattern, note the case insensitive flag
Pattern pattern = Pattern.compile(start+token+end, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(src);
int cnt = 0;
//count your matches.
while(matcher.find()) {
cnt++;
}
countMap.put(token, cnt);
}
return countMap;
}
public static void main(String[] args) throws IOException {
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (Entry<String, Integer> entry : countTokens(text, words).entrySet()) {
System.out.println(entry);
}
}
If you want to find tokens within token, like "is" within "this", simply remove the start and end regex.
You can use split method, to convert string to Array of Strings, sort it, and then you can binary search the elements of list in the array this has been implemented in the given code.
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
int count = 0;
Arrays.sort(wordsArr);
for(String word: words)
if(Arrays.binarySearch( wordsArr, word )>-1)
count++;
Another good approach can be to use a TreeSet, this one I got inspiration from #Abra
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
TreeSet<String> setOfWords = new TreeSet<String>(Arrays.asList(wordsArr));
int count = 0;
for(String word: words)
if(setOfWord.contains(word))
count++;
Both these methods have a Time Complexity of O(Nlog(M)), N being the size of words array, M being the size of wordsArr or setOfWords, However do be careful, using this method since this does have one flaw, which is quite obvious, It doesn't account for periods, thus, if you were to search for "text", it won't be found, because, The set/array contains "text.". You can get around that by removing all the punctuations from your initial text and searched text, however, if you do want it to be accurate then, you can set the regex string in split() to be "[^a-zA-Z]" this will split your String around non alphabetical characters.

Java extracting substring from sentences

There are combination of words like is, is not, does not contain. We have to match these words in a sentence and have to split it.
Intput : if name is tom and age is not 45 or name does not contain tom then let me know.
Expected output:
If name is
tom and age is not
45 or name does not contain
tom then let me know
I tried below code to split and extract but the occurrence of "is" is in "is not" as well which my code is not able to find out:
public static void loadOperators(){
operators.add("is");
operators.add("is not");
operators.add("does not contain");
}
public static void main(String[] args) {
loadOperators();
for(String s : operators){
System.out.println(str.split(s).length - 1);
}
}
Since there could be multiple occurence of a word split wouldn't solve your use case, as in is and is not being different operators for you. You would ideally :
Iterate :
1. Find the index of the 'operator'.
2. Search for the next space _ or word.
3. Then update your string as substring from its index to length-1.
I am not entirely sure about what you try to achieve, but let's give it a shot.
For your case, a simple "workaround" might work just fine:
Sort the operators by their length, descending. This way the "largest match" will get found first. You can define "largest" as either literally the longest string, or preferably the number of words (number of spaces contained), so is a has precedence over contains
You'll need to make sure that no matches overlap though, which can be done by comparing all matches' start and end indices and discarding overlaps by some criteria, like first match wins
This code does what you seem to be wanting to do (or what I guessed you are wanting to do):
public static void main(String[] args) {
List<String> operators = new ArrayList<>();
operators.add("is");
operators.add("is not");
operators.add("does not contain");
String input = "if name is tom and age is not 45 or name does not contain tom then let me know.";
List<String> output = new ArrayList<>();
int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input
for (String operator : operators){
int indexOfOperator = input.indexOf(operator); // Find current operator's position
if (indexOfOperator > -1) { // If operator was found
int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator
output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space)
lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator
}
}
output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output
for (String part : output) { // Output to console
System.out.println(part);
}
}
But it is highly dependant on the order of the sentence and the operators. If we're talking about user-input, the task will be much more complicated.
A better method using regular expressions (regExp) would be:
public static void main(String... args) {
// Define inputs
String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know.";
String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old.";
// Output split strings
for (String part : split(input1)) {
System.out.println(part.trim());
}
System.out.println();
for (String part : split(input2)) {
System.out.println(part.trim());
}
}
private static String[] split(String input) {
// Define list of operators - 'is not' has to precede 'is'!!
String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
// Concatenate operators to regExp-String for search
StringBuilder searchString = new StringBuilder();
for (String operator : operators) {
if (searchString.length() > 0) {
searchString.append("|");
}
searchString.append(operator);
}
// Replace all operators by operator+\n and split resulting string at \n-character
return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n");
}
Notice the order of the operators! 'is' has to come after 'is not' or 'is not' will always be split.
You can prevent this by using a negative lookahead for the operator 'is'.
So "\\sis\\s" would become "\\sis(?! not)\\s" (reading like: "is", not followed by a " not").
A minimalist Version (with JDK 1.6+) could look like this:
private static String[] split(String input) {
String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n");
}

How to determine if a wordList contains one of the substrings of a phrase in Java

I have a List stopWord(each string is only one word) and I have a string phrases(at least 2 words). I want to check if my phrases includes one of the stopWord's element in Java. How can I do that?
if(!stopWord.contains(phrase.toLowerCase())
String delims = "[ ,-.…•“”‘’:;!()/?\"]+";
I used this code but I think it doesn't understand String with 2 words which is my phrase. stopWord's each element is single word. I didn't split my phrase. Cuz I'm dealing with large amount of data. Is there a simplier way?
String[] words = phrase.split(" ");
for (String word : words)
{
if (stopWord.contains(word))
{
// do here whatever you need :)
}
}
If you're using Java 8:
String phrase = "your phrase";
if (stopWord.parallelStream().anyMatch(s -> phrase.contains(s)))
{
// do stuff here
}

Java String.contains() to take care of natural numbers

I'm a computer science student learning Java, and as an exercise, we're doing a permutation algorhythm.
Now, i'm stuck at a point where i need to search for a natural number within a String full of numbers, splitted by a comma:
String myString = "0,1,2,10,14,";
The problem is i'm using...
myString.contains(String.valueOf(anInteger);
...to check for the presence of a specific number. This works for numbers from 0 to 9, but when looking for a more-than-1-digit number, the program does not recognize it as a natural number.
In other words, and as an example: "14" is not the integer 14, its just a string with an "1", and a "4"; so, if i run...
String myString = "0,1,2,10,14,";
if (myString.contains(myString.valueOf(4))) { doSomething(); }
...the "if" statement will be true, since the integer "4" is present in the string, as part of the natural number "14".
At this point, i've been searching through StackOverflow and other pages for a solution, and learnt i should use Pattern and Matcher.
My question is: what's the best way to do use them?
Relevant part of my code:
for (int i = 0; i<r; i++)
{
if (!act.contains(String.valueOf(i)))
{
...
}
...
}
I use this method several times in my code, so an exact substitution would be nice.
Thank you all in advance!
You only need a method call to matches():
if (myString.matches(".*\\b" + anInteger + "\\b.*"))
// string contains the number
This works using by creating a regex that has a word boundary (\b) at either end of the target number. The leading and trailing .* are required because matches() must match the whole string to return true.
Look into how to split a String into an array of String. So:
String[] splitStrings = myString.split(",")
ArrayList<Integer> parsedInts = new ArrayList<Integer>();
for (String str : splitStrings) {
parsedInts.add(Integer.parseInt(str));
}
then in your for loop:
if (parsedInts.contains(i)) {
// body
}
Something like this:
String myString = "0,1,2,10,14,";
String[] split = myString.split(",");
for (String string : split) {
int num = Integer.parseInt(string);
if (num == 4) {
System.out.println(num);
// ...
}
}
String myString = "0,1,2,10,14,2323232";
String[] allList = myString.split(",");
for (String string : allList) {
if(string.matches("[0-9]*"))
{
System.out.println("Its number with value "+string);
}
}
I think you need to pick all the numbers in the given string and find the permutation.
I think you need to Tokenize the given string with the Comma Separator.
When I do such program, I divide my logic to parse the String and write the logic in another method. Below is the snippet
String myString = "0,1,2,10,14,";
StringTokenizer st2 = new StringTokenizer(myString , ",");
while (st2.hasMoreElements()) {
doSomething(st2.nextElement());
}

Troubling shooting java.lang.ArrayIndexOutOfBoundsException

Hi friends i'm doing my final year project for semantic similarity between sentences.
so i'm using word-net 2.1 database to retrieve the meaning. Each line i have to split no of words. In each word i'm get meaning and storing into array. But it can be get only meaning of first sentences.
String[] sentences = result.split("[\\.\\!\\?]");
for (int i=0;i<sentences.length;i++)
{
System.out.println(i);
System.out.println(sentences[i]);
int wcount1 = sentences[i].split("\\s+").length;
System.out.println(wcount1);int wcount1=wordCount(w2);
System.out.println(wcount1);
String[] word1 = sentences[i].split(" ");
for (int j=0;j<wcount1;j++){
System.out.println(j);
System.out.println(word1[j]);
}
}
IndexWordSet set = wordnet.lookupAllIndexWords(word1[j]);
System.out.println(set);
IndexWord[] ws = set.getIndexWordArray();
**POS p = ws[0].getPOS();///line no 103**
Set<String> synonyms = new HashSet<String>();
IndexWord indexWord = wordnet.lookupIndexWord(p, word1[j]);
Synset[] synSets = indexWord.getSenses();
for (Synset synset : synSets)
{
Word[] words = synset.getWords();
for (Word word : words)
{
synonyms.add(word.getLemma());
}
}
System.out.println(synonyms);
OUTPUT:
only the sentences[o](first sentence word's only shoe the meaning ...all the other words are not looping...)
it show this error..
**java.lang.ArrayIndexOutOfBoundsException: 0
at first_JWNL.main(first_JWNL.java:102)**
When you declare the variable wcount1, you assign in the value: sentences[i].split("\\s+")... And yet, when you assign the variable word1, it is assigned with sentences[i].split(" ").
Is it possible, because you are using two regular expressions, the second split (which is being assigned to the word1 variable) is not splitting correctly? And hence when you access the value (System.out.println(word1[j]);), it is throwing the ArrayIndexOutOfBoundsException. Since the value of wcount1 may be bigger than the length of word1.

Categories

Resources