Android - Java - Regular Expression question - consecutive words not being matched

Android - Java - Regular Expression question - consecutive words not being matched - java

For my example I am trying to replace ALL cases of "the" and "a" in a string with a space.
Including cases where these words are next to characters such as quotes and other punctuation
String oldString = "A test of the exp."
Pattern p = Pattern.compile("(((\\W|\\A)the(\\W|\\Z))|((\\W|\\A)a(\\W|\\Z)))",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(oldString);
newString = m.replaceAll(" ");
"A test of the exp." returns "test of exp." - Yeah!
"A test of the a exp." returns "test of a exp." - Boooo!
"The a in this test is a the." returns "a in this test is the. - DoubleBoooo!
Any help would be greatly appreciated.
Thanks!

String resultString = subjectString.replaceAll("\\b(?:a|the)\\b", " ");
\b matches at a word boundary (i. e. at the start or end of a word, where "word" is a sequence of alphanumeric characters).
(?:...) is a non-capturing group, needed to separate the alternative words (in this case a and the) from the surrounding word boundary anchors.

Or per simplified #Robokop soln.
Pattern.compile("(\\b(the|a)\\b)",Pattern.CASE_INSENSITIVE);
or
Pattern.compile('\b(the|a)\b',Pattern.CASE_INSENSITIVE);
Not sure about quoting in Java.

Pattern.compile("(\\bthe\\b)|(\\ba\\b)",Pattern.CASE_INSENSITIVE);

Related

Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 2

I'm trying to remove some words from the line. I don't want to prescribe them but replace them, because the words that I remove may increase. So I used an array. But when I try to do this, I get errors. I will be glad of any help.
String[] a = new String[]{
"\\bof \\b",
"\\bor \\b "," \\bit \\b "," \\bto \\b "
};
String str = "asdasdas of or asdasd";
str = str.replaceAll(
Arrays.toString(a)
, "");

This is a partial answer..
Problem #1: The Arrays.toString(a) returns a string which is bounded by [] so this is the result:
[\b of \b, \bor \b , \bit \b , \bto \b ]
The brackets are special to regex to define the start-end of a character class definition. \b in a character class definition does not mean "word boundary" but rather results in an "illegal escape sequence".
So first step is to remove the brackets from the regex string (from here).
String regex = Arrays.toString(a).replace("[","").replace("]", "");
Problem #2: Your regex now looks like:
\b of \b, \bor \b , \bit \b , \bto \b
This is not likely what you want - here's what I interpret what you want:
Remove all occurrences of of OR or OR it OR to iff surrounded by word boundaries to include removing the trailing space.
(Note also the space after the expressions which should be removed.)
So assume we start with:
String[] a = new String[] { "\\bof \\b", "\\bor \\b", " \\bit \\b", "\\bto \\b" };
So to implement the "OR" part you need:
\bof \b|\bor \b|\bit \b|\bto \b
And so the replace operation would look like:
str = str.replaceAll("\\bof \\b|\\bor \\b|\\bit \\b|\\bto \\b","");
To continue using your array approach you could then use:
String regex = Arrays.toString(a).replace("[","").replace("]", "");
regex = regex.replaceAll(", +","|");
which yields the regex from your array (same as discussed before):
\bof \b|\bor \b|\bit \b|\bto \b
And so
String str = "asdasdas of or asdasd";
str = str.replaceAll(regex,"");
yields:
asdasdas asdasd
I was a bit loose in the treatment of regex explanation so I would invite others to be more precise in their answers.
Good luck.

Regex to find the first word in a string java without using the string name

I am having a string which can have a sentence containing symbols and numbers and also the sentence can have different lengths
For Example
String myString = " () Huawei manufactures phones"
And the next time myString can have the following words
String myString = " * Audi has amazing cars &^"
How can i use regex to get the first word from the string so that the only word i get in the first myString is "Huawei" and the word i get on the second myString is Audi
Below is what i have tried but it fails when there is a space before the first words and symbols
String regexString = myString .replaceAll("\\s.*","")

You may use this regex with a capture group for matching:
^\W*\b(\w+).*
and replace with: $1
RegEx Demo
Java Code:
s = s.replaceAll("^\\W*\\b(\\w+).*", "$1");
RegEx Details:
^: Start
\W*: Match 0 or more non-word characters
\b: Word boundary
(\w+): Match 1+ word characters and capture it in group #1
.*: Match anything aftereards

See how you get on with:
s = s.replaceAll("^[^\\p{Alpha}]*", "");

Java regex exact match with question mark and word boundary

In java, I am trying to determine if a user inputted string (meaning I do not know what the input will be) is contained exactly within another string, on word boundaries. So input of the should not be matched in the text there is no match. I am running into issues when there is punctuation in the inputted string however and could use some help.
With no punctuation, this works just fine:
String input = "string contain";
Pattern p = Pattern.compile("\\b" + Pattern.quote(input) + "\\b");
//both should and do match
System.out.println(p.matcher("does this string contain the input").find());
System.out.println(p.matcher("does this string contain? the input").find());
However when the input has a question mark in it, the matching with the word boundary doesn't seem to work:
String input = "string contain?";
Pattern p = Pattern.compile("\\b" + Pattern.quote(input) + "\\b");
//should not match - doesn't
System.out.println(p.matcher("does this string contain the input").find());
//expected match - doesn't
System.out.println(p.matcher("does this string contain? the input").find());
//should not match - doesn't
System.out.println(p.matcher("does this string contain?fail the input").find());
Any help would be appreciated.

There's no word boundary between ? and , because there's no adjacent word character; that's why your pattern doesn't match. You can change it to this:
Pattern.compile("(^|\\W)" + Pattern.quote(input) + "($|\\W)");
That matches begin of input or non-word character - pattern - end of input or non-word character. Or, better, you use a negative lookbehind and a negative lookahead:
Pattern p = Pattern.compile("(?<!\\w)" + Pattern.quote(input) + "(?!\\w)");
This means, before and after your pattern there must not be a word character.

You can use :
Pattern p = Pattern.compile("(\\s|^)" + Pattern.quote(input) + "(\\s|$)");
//---------------------------^^^^^^^----------------------------^^^^^^^
for Strings you will get :
does this string contain the input -> false
does this string contain? the input -> true
does this fail the input string contain? -> true
does this string contain?fail the input -> false
string contain? the input -> true
The idea is, matches the strings that contains your input + space, or end with your input.

You are matching using word boundaries: \b.
Java RegEx implementation deems following characters as word characters:
\w := [a-zA-Z_0-9]
Any non-word characters are simply ones outside the above group
[^\w] := [^a-zA-Z_0-9]
Word boundary is a transition from [a-zA-Z_0-9] to [^a-zA-Z_0-9] and vice-versa.
For input "does this string contain? the input" and literal pattern \\b\\Qstring contain?\\E\\b the last word boundary \\b falls within the input text into a transition from ? to <white space> and therefore is not a valid word to non-word nor non-word to word transition as per above definitions, which means that it is not a word boundary.

String split regex [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.

I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}

There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)

If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/

The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+

(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.

It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"

String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."

I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"

Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program

If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random

1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)

I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.

A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)

You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}

The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();

When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

Replace whole tokens that may contain regular expression

I want to do a startStr.replaceAll(searchStr, replaceStr) and I have two requirements.
The searchStr must be a whole word, meaning it must have a space, beginning of string or end of string character around it.
e.g.
startStr = "ON cONfirmation, put ON your hat"
searchStr = "ON"
replaceStr = ""
expected = " cONfirmation, put your hat"
The searchStr may contain a regex pattern
e.g.
startStr = "remove this * thing"
searchStr = "*"
replaceStr = ""
expected = "remove this thing"
For requirement 1, I've found that this works:
startStr.replaceAll("\\b"+searchStr+"\\b",replaceStr)
For requirement 2, I've found that this works:
startStr.replaceAll(Pattern.quote(searchStr), replaceStr)
But I can't get them to work together:
startStr.replaceAll("\\b"+Pattern.quote(searchStr)+"\\b", replaceStr)
Here is the simple test case that's failing
startStr = "remove this * thing but not this*"
searchStr = "*"
replaceStr = ""
expected = "remove this thing but not this*"
actual = "remove this * thing but not this*"
What am I missing?
Thanks in advance

First off, the \b, or word boundary, is not going to work for you with the asterisks. The reason is that \b only detects boundaries of word characters. A regex parser won't acknowledge * as a word character, so a wildcard-endowed word that begins or ends with a regex won't be surrounded by valid word boundaries.
Reference pages:
http://www.regular-expressions.info/wordboundaries.html
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
An option you might like is to supply wildcard permutations in a regex:
(?<=\s|^)(ON|\*N|O\*|\*)(?=\s|$)
Here's a Java example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegExTest
{
public static void main(String[] args){
String sourcestring = "ON cONfirmation, put * your hat";
sourcestring = sourcestring.replaceAll("(?<=\\s|^)(ON|\\*N|O\\*|\\*)(?=\\s|$)","").replaceAll(" "," ").trim();
System.out.println("sourcestring=["+sourcestring+"]");
}
}
You can write a little function to generate the wildcard permutations automatically. I admit I cheated a little with the spaces, but I don't think that was a requirement anyway.
Play with it online here: http://ideone.com/7uGfIS

The pattern "\\b" matches a word boundary, with a word character on one side and a non-word character on the other. * is not a word character, so \\b\\*\\b won't work. Look-behind and look-ahead match but do not consume patterns. You can specify that the beginning of the string or whitespace must come before your pattern and that whitespace or the end of the string must follow:
startStr.replaceAll("(?<=^|\\s)"+Pattern.quote(searchStr)+"(?=\\s|$)", replaceStr)

Try this,
For removing "ON"
StringBuilder stringBuilder = new StringBuilder();
String[] splittedValue = startStr.split(" ");
for (String value : splittedValue)
{
if (!value.equalsIgnoreCase("ON"))
{
stringBuilder.append(value);
stringBuilder.append(" ");
}
}
System.out.println(stringBuilder.toString().trim());
For removing "*"
String startStr1 = "remove this * thing";
System.out.println(startStr1.replaceAll("\\*[\\s]", ""));

You can use (^| )\*( |$) instead of using \\b
Try this startStr.replaceAll("(^| )youSearchString( |$)", replaceStr);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android - Java - Regular Expression question - consecutive words not being matched - java

Or per simplified #Robokop soln. Pattern.compile("(\\b(the|a)\\b)",Pattern.CASE_INSENSITIVE); or Pattern.compile('\b(the|a)\b',Pattern.CASE_INSENSITIVE); Not sure about quoting in Java.

Pattern.compile("(\\bthe\\b)|(\\ba\\b)",Pattern.CASE_INSENSITIVE);

Related

Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 2

Regex to find the first word in a string java without using the string name

Java regex exact match with question mark and word boundary

String split regex [duplicate]

Replace whole tokens that may contain regular expression

Categories

Resources