Optimize Finding Words in Paragraph

Optimize Finding Words in Paragraph - java

I'm searching for words in a paragraph but it takes ages with long paragraphs. Hence, I want to remove the words after I find it in the paragraph to shorten the number of words I have to go through. Or if there's a better way to make this efficient do tell!
List<String> list = new ArrayList<>();
for (String word : wordList) {
String regex = ".*\\b" + Pattern.quote(word) + "\\b.*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(paragraph);
if (m.find()) {
System.out.println("Found: " + word);
list.add(word);
}
}
For example, lets say my wordList has the following values "apple","hungry","pie"
And my paragraph is "I ate an apple, but I am still hungry, so I will eat pie"
I want to find the words in wordList in the paragraph and eliminate them in the hopes of making the above code faster

You may use
String paragraph = "I ate an apple, but I am still hungry, so I will eat pie";
List<String> wordList = Arrays.asList("apple","hungry","pie");
Pattern p = Pattern.compile("\\b(?:" + String.join("|", wordList) + ")\\b");
Matcher m = p.matcher(paragraph);
if (m.find()) { // To find all matches, replace "if" with "while"
System.out.println("Found " + m.group()); // => Found apple
}
See the Java demo.
The regex will look like \b(?:word1|word2|wordN)\b and will match:
\b - a word boundary
(?:word1|word2|wordN) - any of the alternatives inside the non-capturing group
\b - a word boundary
Since you say the characters in the words can only be uppercase letters, digits and hyphens with slashes, none of them need escaping, so Pattern.quote is not important here. Also, since the slashes and hyphens will never appear at the start/end of the string, you won't have issues usually caused by \b word boundary. Otherwise, replace the first "\\b" with "(?<!\\w)" and the last one with "(?!\\w)".

Related

Checking if there is whitespace between two elements in a String

I am working with Strings where I need to separate two chars/elements if there is a whitespace between them. I have seen a former post on SO about the same however it still has not worked for me as intended yet. As you would assume, I could just check if the String contains(" ") and then substring around the space. However my strings could possibly contains countless whitespaces at the end despite not having whitespace in between characters. Hence my question is "How do I detect a whitespace between two chars (numbers too) " ?
//Example with numbers in a String
String test = "2 2";
final Pattern P = Pattern.compile("^(\\d [\\d\\d] )*\\d$");
final Matcher m = P.matcher(test);
if (m.matches()) {
System.out.println("There is between space!");
}

You would use String.strip() to remove any leading or trailing whitespace, followed by String.split(). If there is a whitespace, the array will be of length 2 or greater. If there is not, it will be of length 1.
Example:
String test = " 2 2 ";
test = test.strip(); // Removes whitespace, test is now "2 2"
String[] testSplit = test.split(" "); // Splits the string, testSplit is ["2", "2"]
if (testSplit.length >= 2) {
System.out.println("There is whitespace!");
} else {
System.out.println("There is no whitespace");
}
If you need an array of a specified length, you can also specify a limit to split. For example:
"a b c".split(" ", 2); // Returns ["a", "b c"]
If you want a solution that only uses regex, the following regex matches any two groups of characters separated by a single space, with any amount of leading or trailing whitespace:
\s*(\S+\s\S+)\s*

Positive lookahead and lookbehind may also work if you use the regex (?<=\\w)\\s(?=\\w)
\w : a word character [a-zA-Z_0-9]
\\s : whitespace
(?<=\\w)\\s : positive lookbehind, matches if a whitespace preceeded by a \w
\\s(?=\\w) : positive lookahead, matches if a whitespace followed by a \w
List<String> testList = Arrays.asList("2 2", " 245 ");
Pattern p = Pattern.compile("(?<=\\w)\\s(?=\\w)");
for (String str : testList) {
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(str + "\t: There is a space!");
} else {
System.out.println(str + "\t: There is not a space!");
}
}
Output:
2 2 : There is a space!
245 : There is not a space!

The reason you pattern does not work as expected is because ^(\\d [\\d\\d] )*\\d$ which can be simplified to (\\d \\d )*\\d$ starts by repeating 0 or more times what is between the parenthesis.
Then it matches a digit at the end of the string. As the repetition is 0 or more times, it is optional and it would also match just a single digit.
If you want to check if there is a single space between 2 non whitespace chars:
\\S \\S
Regex demo | Java demo
final Pattern P = Pattern.compile("\\S \\S");
final Matcher m = P.matcher(test);
if (m.find()) {
System.out.println("There is between space!");
}

Here is the simplest way you can do it:
String testString = " Find if there is a space. ";
testString.trim(); //This removes all the leading and trailing spaces
testString.contains(" "); //Checks if the string contains a whitespace still
You can also use a shorthand method in one line by chaining the two methods:
String testString = " Find if there is a space. ";
testString.trim().contains(" ");

Use
String text = "2 2";
Matcher m = Pattern.compile("\\S\\s+\\S").matcher(text.trim());
if (m.find()) {
System.out.println("Space detected.");
}
Java code demo.
text.trim() will remove leading and trailing whitespaces, \S\s+\S pattern matches a non-whitespace, then one or more whitespace characters, and then a non-whitespace character again.

Find out number of words in a string with a lot of special character

I need to find out the number of words in a string. However, this string is not the normal type of string. It has a lot of special character like < , /em, /p and many more. So most of the method used in StackOverflow does not work. As a result, I need to define a regular expression by myself.
What I intend to do is to define what is a word using a regular expression and count the number of time a word appears.
This is how I define a word.
It must start with a letter and end with one of this : or , or ! or ? or ' or - or ) or . or "
This is how I define my regular expression
pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
matcher = pattern.matcher(line);
while (matcher.find())
wordCount++;
However, there is an error with the first line
pattern = Pattern.compile("^[a-zA-Z](:|,|!|?|'|-|)|.|")$");
How can I fix this problem?

In fact you also want to remove tags, like <em> (HTML emphasized), which otherwise would count as words. If you then consider full tags with attributes:
<span font="Consolas"> then it is easier to remove tags:
public int static wordCount(String s) {
s.replaceAll("<[A-Za-z/][^>]*>", " ") // Tags as space
.replaceAll("[^\\p{L}\\p{M}\\d]+", " ") // Non-letters, -accents, -digits as blank
.trim() // Not before or after (empty words)
.split(" ").length;
}
It is quite inefficient, replaceAll and trim. At least precompiling and using Pattern would be nicer. But probably not worth it.

Does this help?
String line = "so.this:is,what)you!wanted?";
int wordCount = 0;
Pattern pattern = Pattern.compile("([a-zA-Z]++[:'-,\\.!\\?\")]{1})");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
wordCount++;
}
System.out.println(wordCount); // Prints 6

Extract Arabic phrases from a given text in java

Can you help me in finding a regex that take list of phrases and check if one of these phrases exist in the given text, please?
Example:
If I have in the hashSet the following words:
كيف الحال
إلى أين
أين يوجد
هل من أحد هنا
And the given text is: كيف الحال أتمنى أن تكون بخير
I want to get after performing regex: كيف الحال
My initial code:
HashSet<String> QWWords = new HashSet<String>();
QWWords.add("كيف الحال");
QWWords.add("إلى أين");
QWWords.add("أين يوجد");
QWWords.add("هل من أحد هنا");
String s1 = "كيف الحال أتمنى أن تكون بخير";
for (String qp : QWWords) {
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
Matcher m = p.matcher(s1);
String found = "";
while (m.find()) {
found = m.group();
System.out.println(found);
}
}

[...] is character class and character class can match only one character it specifies. For instance character class like [abc] can match only a OR b OR c. So if you want to find only word abc don't surround it with [...].
Another problem is that you are using \\s as word separator, so in following String
String data = "foo foo foo foo";
regex \\sfoo\\s will not be able to match first foo because there is no space before.
So first match it will find will be
String data = "foo foo foo foo";
// this one--^^^^^
Now, since regex consumed space after second foo it can't reuse it in next match so third foo will also be skipped because there is no space available to match before it.
You will also not match forth foo because this time there is no space after it.
To solve this problem you can use \\b - word boundary which checks if place it represents is between alphanumeric and non-alphanumeric characters (or start/end of string).
So instead of
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
use
Pattern p = Pattern.compile("\\b" + qp + "\\b");
or maybe better as Tim mentioned
Pattern p = Pattern.compile("\\b" + qp + "\\b",Pattern.UNICODE_CHARACTER_CLASS);
to make sure that \\b will include Arabic characters in predefined alphanumeric class.
UPDATE:
I am not sure if your words can contain regex metacharacters like { [ + * and so on, so just in case you can also add escaping mechanism to change such characters into literals.
So
"\\b" + qp + "\\b"
can become
"\\b" + Pattern.quote(qp) + "\\b"

String Matches, Java

I have a sort of a problem with this code:
String[] paragraph;
if(paragraph[searchKeyword_counter].matches("(.*)(\\b)"+"is"+"(\\b)(.*)")){
if i am not mistaken to use .matches() and search a particular character in a string i need a .* but what i want to happen is to search a character without matching it to another word.
For example is the keyword i am going to search I do not want it to match with words that contain is character like ship, his, this. so i used \b for boundary but the code above is not working for me.
Example:
String[] Content= {"is,","his","fish","ish","its","is"};
String keyword = "is";
for(int i=0;i<Content.length;i++){
if(content[i].matches("(.*)(\\b)"+keyword+"(\\b)(.*)")){
System.out.println("There are "+i+" is.");
}
}
What i want to happen here is that it will only match with is is, but not with his fish. So is should match with is, and is meaning I want it to match even the character is beside a non-alphanumerical character and spaces.
What is the problem with the code above?
what if one of the content has a uppercase character example IS and it is compared with is, it will be unmatched. Correct my if i am wrong. How to match a lower cased character to a upper cased character without changing the content of the source?

String string = "...";
String word = "is";
Pattern p = Pattern.compile("\\b" + Pattern.quote(word) + "\\b");
Matcher m = p.matcher(string);
if (m.find()) {
...
}

just add spaces like this:
suppose message equal your content string and pattern is your keyword
if ((message).matches(".* " + pattern + " .*")||(message).matches("^" + pattern + " .*")
||(message).matches(".* " + pattern + "$")) {

How to replace any occurrence of a word between quotes

I need to be able to replace all occurrences of the word "and" ONLY when it occurs between single quotes. For example replacing "and" with "XXX" in the string:
This and that 'with you and me and others' and not 'her and him'
Results in:
This and that 'with you XXX me XXX others' and not 'her XXX him'
I have been able to come up with regular expressions which nearly gets every case, but I'm failing with the "and" between the two sets of quoted text.
My code:
String str = "This and that 'with you and me and others' and not 'her and him'";
String patternStr = ".*?\\'.*?(?i:and).*?\\'.*";
Pattern pattern= Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
while(matcher.matches()) {
System.out.println("in matcher");
str = str.replaceAll("(?:\\')(.*?)(?i:and)(.*?)(?:\\')", "'$1XXX$2'");
matcher = pattern.matcher(str);
}
System.out.println(str);

Try this code:
str = "This and that 'with you and me and others' and not 'her and him'";
Matcher matcher = Pattern.compile("('[^']*?')").matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, matcher.group(1).replaceAll("and", "XXX"));
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);
OUTPUT
Output: This and that 'with you XXX me XXX others' and not 'her XXX him'

String str = "This and that 'with you and me and others' and not 'her and him'";
Pattern p = Pattern.compile("(\\s+)and(\\s+)(?=[^']*'(?:[^']*+'[^']*+')*+[^']*+$)");
System.out.println(p.matcher(str).replaceAll("$1XXX$2"));
The idea is, each time you find the complete word and, you you scan from the current match position to the end of the string, looking for an odd number of single-quotes. If the lookahead succeeds, the matched word must be between a pair of quotes.
Of course, this assumes quotes always come in matched pairs, and that quotes can't be escaped. Quotes escaped with backslashes can be dealt with, but it makes the regex much longer.
I'm also assuming the target word never appears at the beginning or end of a quoted sequence, which seems reasonable for the word and. If you want to allow for target words that are not surrounded by whitespace, you could use something like "\\band\\b" instead, but be aware of Java's problems in the area of word characters vs word boundaries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimize Finding Words in Paragraph - java

Related

Checking if there is whitespace between two elements in a String

Find out number of words in a string with a lot of special character

Extract Arabic phrases from a given text in java

String Matches, Java

How to replace any occurrence of a word between quotes

Categories

Resources