Extract Arabic phrases from a given text in java - java

Can you help me in finding a regex that take list of phrases and check if one of these phrases exist in the given text, please?
Example:
If I have in the hashSet the following words:
كيف الحال
إلى أين
أين يوجد
هل من أحد هنا
And the given text is: كيف الحال أتمنى أن تكون بخير
I want to get after performing regex: كيف الحال
My initial code:
HashSet<String> QWWords = new HashSet<String>();
QWWords.add("كيف الحال");
QWWords.add("إلى أين");
QWWords.add("أين يوجد");
QWWords.add("هل من أحد هنا");
String s1 = "كيف الحال أتمنى أن تكون بخير";
for (String qp : QWWords) {
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
Matcher m = p.matcher(s1);
String found = "";
while (m.find()) {
found = m.group();
System.out.println(found);
}
}

[...] is character class and character class can match only one character it specifies. For instance character class like [abc] can match only a OR b OR c. So if you want to find only word abc don't surround it with [...].
Another problem is that you are using \\s as word separator, so in following String
String data = "foo foo foo foo";
regex \\sfoo\\s will not be able to match first foo because there is no space before.
So first match it will find will be
String data = "foo foo foo foo";
// this one--^^^^^
Now, since regex consumed space after second foo it can't reuse it in next match so third foo will also be skipped because there is no space available to match before it.
You will also not match forth foo because this time there is no space after it.
To solve this problem you can use \\b - word boundary which checks if place it represents is between alphanumeric and non-alphanumeric characters (or start/end of string).
So instead of
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
use
Pattern p = Pattern.compile("\\b" + qp + "\\b");
or maybe better as Tim mentioned
Pattern p = Pattern.compile("\\b" + qp + "\\b",Pattern.UNICODE_CHARACTER_CLASS);
to make sure that \\b will include Arabic characters in predefined alphanumeric class.
UPDATE:
I am not sure if your words can contain regex metacharacters like { [ + * and so on, so just in case you can also add escaping mechanism to change such characters into literals.
So
"\\b" + qp + "\\b"
can become
"\\b" + Pattern.quote(qp) + "\\b"

Related

Optimize Finding Words in Paragraph

I'm searching for words in a paragraph but it takes ages with long paragraphs. Hence, I want to remove the words after I find it in the paragraph to shorten the number of words I have to go through. Or if there's a better way to make this efficient do tell!
List<String> list = new ArrayList<>();
for (String word : wordList) {
String regex = ".*\\b" + Pattern.quote(word) + "\\b.*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(paragraph);
if (m.find()) {
System.out.println("Found: " + word);
list.add(word);
}
}
For example, lets say my wordList has the following values "apple","hungry","pie"
And my paragraph is "I ate an apple, but I am still hungry, so I will eat pie"
I want to find the words in wordList in the paragraph and eliminate them in the hopes of making the above code faster
You may use
String paragraph = "I ate an apple, but I am still hungry, so I will eat pie";
List<String> wordList = Arrays.asList("apple","hungry","pie");
Pattern p = Pattern.compile("\\b(?:" + String.join("|", wordList) + ")\\b");
Matcher m = p.matcher(paragraph);
if (m.find()) { // To find all matches, replace "if" with "while"
System.out.println("Found " + m.group()); // => Found apple
}
See the Java demo.
The regex will look like \b(?:word1|word2|wordN)\b and will match:
\b - a word boundary
(?:word1|word2|wordN) - any of the alternatives inside the non-capturing group
\b - a word boundary
Since you say the characters in the words can only be uppercase letters, digits and hyphens with slashes, none of them need escaping, so Pattern.quote is not important here. Also, since the slashes and hyphens will never appear at the start/end of the string, you won't have issues usually caused by \b word boundary. Otherwise, replace the first "\\b" with "(?<!\\w)" and the last one with "(?!\\w)".

trying to find a word with seperators in string

i have a full string like this - "Hello all you guys"
and i have a bad word like "all"
now i managed to find the second string in the first that's easy,
but let's say my first string is "Hello a.l.l you guys"
or "Hello a,l,l you guys"
or even "Hello a l l you guys"
is there a regex way to find it ?
what i've got so far is
String wordtocheck =pair.getKey().toString();
String newerstr = "";
for(int i=0;i<wordtocheck.length();i++)
newerstr+=wordtocheck.charAt(i)+"\\.";
Pattern.compile("(?i)\\b(newerstr)(?=\\W)").matcher(currentText.toString());
but it doesn't do the trick
thanks to all helpers
You may build the pattern dynamically by inserting \W* (=zero or more non-word chars, that is, chars that are not letters, digits or underscore) in between the characters of a keyword to search for:
String s = "Hello a l l you guys";
String key = "all";
String pat = "(?i)\\b" + TextUtils.join("\\W*", key.split("")) + "\\b";
System.out.println("Pattern: " + pat);
Matcher m = Pattern.compile(pat).matcher(s);
if (m.find())
{
System.out.println("Found: " + m.group());
}
See the online demo (String.join is used instead of TextUtils.join since this is a Java demo)
If there can be non-word chars in the search words, you need to replace \b word boundaries with (?<!\\S) (the initial \b) and (?!\\S) (instead of the trailing \b), or remove altogether.
Try this
String str="Hello .a-l l? guys";
str=str.replaceAll("\\W",""); //replaces all non-words chars with empty string.
str is now "Helloallguys"

Regex in Java to match a word that is not part of other word

I have created this regex for this:
(?<!\w)name(?!\w)
Which I expect that will match things like:
name
(name)
But should not match things like:
myname
names
My problem is that if I use this pattern in Java it doesn't work for the case where other symbols different than whitespaces are used, like brackets.
I tested the regex in this site (http://gskinner.com/RegExr/, which is a very nice site btw) and it works, so I'm wondering if Java requires a different syntax.
String regex = "((?<!\\w)name(?!\\w))";
"(name".matches(regex); //it returns false
Why not use word boundary?
Pattern pattern = Pattern.compile("\\bname\\b");
String test = "name (name) mynames";
Matcher matcher = pattern.matcher(test);
while (matcher.find()) {
System.out.println(matcher.group() + " found between indexes: " + matcher.start() + " and " + matcher.end());
}
Output:
name found between indexes: 0 and 4
name found between indexes: 6 and 10
Use the "word boundary" regex \b:
if (str.matches(".*\\bname\\b.*")
// str contains "name" as a separate word
Note that this will not match "foo _name bar" or "foo name1 bar", because underscore and digits are considered a word character. If you want to match a "non-letter" around "name", use this:
if (str.matches(".*(^|[^a-zA-Z])name([^a-zA-Z]|$).*")
// str contains "name" as a separate word
See Regex Word Boundaries

Iterating through String with .find() in Java regex

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.
This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij
Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here
Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

String Matches, Java

I have a sort of a problem with this code:
String[] paragraph;
if(paragraph[searchKeyword_counter].matches("(.*)(\\b)"+"is"+"(\\b)(.*)")){
if i am not mistaken to use .matches() and search a particular character in a string i need a .* but what i want to happen is to search a character without matching it to another word.
For example is the keyword i am going to search I do not want it to match with words that contain is character like ship, his, this. so i used \b for boundary but the code above is not working for me.
Example:
String[] Content= {"is,","his","fish","ish","its","is"};
String keyword = "is";
for(int i=0;i<Content.length;i++){
if(content[i].matches("(.*)(\\b)"+keyword+"(\\b)(.*)")){
System.out.println("There are "+i+" is.");
}
}
What i want to happen here is that it will only match with is is, but not with his fish. So is should match with is, and is meaning I want it to match even the character is beside a non-alphanumerical character and spaces.
What is the problem with the code above?
what if one of the content has a uppercase character example IS and it is compared with is, it will be unmatched. Correct my if i am wrong. How to match a lower cased character to a upper cased character without changing the content of the source?
String string = "...";
String word = "is";
Pattern p = Pattern.compile("\\b" + Pattern.quote(word) + "\\b");
Matcher m = p.matcher(string);
if (m.find()) {
...
}
just add spaces like this:
suppose message equal your content string and pattern is your keyword
if ((message).matches(".* " + pattern + " .*")||(message).matches("^" + pattern + " .*")
||(message).matches(".* " + pattern + "$")) {

Categories

Resources