Iterating through String with .find() in Java regex

Iterating through String with .find() in Java regex - java

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.

This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij

Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here

Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

Related

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"

Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result

This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.

You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):

So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}

Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}

To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.

The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

trying to find a word with seperators in string

i have a full string like this - "Hello all you guys"
and i have a bad word like "all"
now i managed to find the second string in the first that's easy,
but let's say my first string is "Hello a.l.l you guys"
or "Hello a,l,l you guys"
or even "Hello a l l you guys"
is there a regex way to find it ?
what i've got so far is
String wordtocheck =pair.getKey().toString();
String newerstr = "";
for(int i=0;i<wordtocheck.length();i++)
newerstr+=wordtocheck.charAt(i)+"\\.";
Pattern.compile("(?i)\\b(newerstr)(?=\\W)").matcher(currentText.toString());
but it doesn't do the trick
thanks to all helpers

You may build the pattern dynamically by inserting \W* (=zero or more non-word chars, that is, chars that are not letters, digits or underscore) in between the characters of a keyword to search for:
String s = "Hello a l l you guys";
String key = "all";
String pat = "(?i)\\b" + TextUtils.join("\\W*", key.split("")) + "\\b";
System.out.println("Pattern: " + pat);
Matcher m = Pattern.compile(pat).matcher(s);
if (m.find())
{
System.out.println("Found: " + m.group());
}
See the online demo (String.join is used instead of TextUtils.join since this is a Java demo)
If there can be non-word chars in the search words, you need to replace \b word boundaries with (?<!\\S) (the initial \b) and (?!\\S) (instead of the trailing \b), or remove altogether.

Try this
String str="Hello .a-l l? guys";
str=str.replaceAll("\\W",""); //replaces all non-words chars with empty string.
str is now "Helloallguys"

match ;ABC12;10;250.3 using regex java

String regex = "^;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}.[\\d]{1,}";
String str = ";ABC12;10;250.3";
System.out.println(str.matches(regex));
The above regex works fine.
Consider the following strings
str1=";ABC12;10;250.3"
str2=;ABB62;5;2.3
str3=;ABF02;8;25120.3
str4=;AKC12;11;2504.303
Now i have the string as String strToMatch= str1,str2,str3,str4
How do i convert my regex expression above inorder to match the above string.
Note : There can be n number of comma separated values in the above string. And i also need to take care that the string strToMatch doesnot end with comma.

You can capture the regex with round brackets and repeat one or more times:
String regex = "^(;[A-Z0-9]{5};\\d+;\\d+\\.\\d+){1,}";

Try this pattern instead: (;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}\\.[\\d]{1,},?)+
This has two differences to your pattern: first I use \\. to denote that this has to be a . because a single dot means "any character" in regex.
Then I used the grouping brackets (...) and the + at the end to say: "Look for this once or more". As the , is optional at the end, I added a ?
If you want to get single matches to process using a Matcher later on, a simple modification should do the trick: (;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}\\.[\\d]{1,}),?
The + is gone and the ,? is outside the grouping brackets, because those are now capturing brackets (as well).
Example:
final Pattern pattern = Pattern.compile("(;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}\\.[\\d]{1,}),?");
final Matcher matcher = pattern.matcher(";ABC12;10;250.3,;ABB62;5;2.3,;ABF02;8;25120.3,;AKC12;11;2504.303");
while (matcher.find()) {
System.out.println("Whole match: " + matcher.group());
for (int i = 1; i <= matcher.groupCount(); ++i) {
System.out.println("Group #" + i + ": " + matcher.group(i));
}
}

I have found below way of solving the problem.
String strToMatch = ";ABC12;10;250.3,;ABB62;5;2.3,;ABF02;8;25120.3,;AKC12;11;2504.303";
if(strToMatch.endsWith(",") || strToMatch.startsWith(","))
return false;
else{
String[] str = strToMatch.split(",");
int count = 0;
for (String s : str){
String regex = ";[A-Z0-9]{5};\\d+;\\d+\\.\\d+";
if(s.matches(regex))
return false;
}
return true;
}
Any simpler way than this?

String Matches, Java

I have a sort of a problem with this code:
String[] paragraph;
if(paragraph[searchKeyword_counter].matches("(.*)(\\b)"+"is"+"(\\b)(.*)")){
if i am not mistaken to use .matches() and search a particular character in a string i need a .* but what i want to happen is to search a character without matching it to another word.
For example is the keyword i am going to search I do not want it to match with words that contain is character like ship, his, this. so i used \b for boundary but the code above is not working for me.
Example:
String[] Content= {"is,","his","fish","ish","its","is"};
String keyword = "is";
for(int i=0;i<Content.length;i++){
if(content[i].matches("(.*)(\\b)"+keyword+"(\\b)(.*)")){
System.out.println("There are "+i+" is.");
}
}
What i want to happen here is that it will only match with is is, but not with his fish. So is should match with is, and is meaning I want it to match even the character is beside a non-alphanumerical character and spaces.
What is the problem with the code above?
what if one of the content has a uppercase character example IS and it is compared with is, it will be unmatched. Correct my if i am wrong. How to match a lower cased character to a upper cased character without changing the content of the source?

String string = "...";
String word = "is";
Pattern p = Pattern.compile("\\b" + Pattern.quote(word) + "\\b");
Matcher m = p.matcher(string);
if (m.find()) {
...
}

just add spaces like this:
suppose message equal your content string and pattern is your keyword
if ((message).matches(".* " + pattern + " .*")||(message).matches("^" + pattern + " .*")
||(message).matches(".* " + pattern + "$")) {

Optimizing several RegEx in Java Code

The below mentioned RegEx perform very poorly on a very large string or more than 2000 Lines. Basically the Java String is composed of PL/SQL script.
1- Replace each occurrence of delimiting character, for example ||, != or > sign with a space before and after the characters. This takes infinite time and never ends, so no time can be recorded.
// Delimiting characters for SQLPlus
private static final String[] delimiters = { "\\|\\|", "=>", ":=", "!=", "<>", "<", ">", "\\(", "\\)", "!", ",", "\\+", "-", "=", "\\*", "\\|" };
for (int i = 0; i < delimiters.length; i++) {
script = script.replaceAll(delimiters[i], " " + delimiters[i] + " ");
}
2- The following pattern looks for all occurances of forward slash / except the ones that are preceded by a *. That mean don't look for forward slash in a block comment syntax. This takes about 103 Seconds for a 2000 lines of String.
Pattern p = Pattern.compile("([^\\*])([\\/])([^\\*])");
Matcher m = p.matcher(script);
while (m.find()) {
script = script.replaceAll(m.group(2), " " + m.group(2) + " ");
}
3- Remove any white spaces from within date or date format
Pattern p = Pattern.compile("(?i)(\\w{1,2}) +/ +(\\w{1,2}) +/ +(\\w{2,4})");
// Create a matcher with an input string
Matcher m = p.matcher(script);
while (m.find()) {
part1 = script.substring(0, m.start());
part2 = script.substring(m.end());
script = part1 + m.group().replaceAll("[ \t]+", "") + part2;
m = p.matcher(script);
}
Is there any way to optimize all the three RegEx so that they take less time?
Thanks
Ali

I'll answer the first question.
You can combine all this into a single regex replace operation:
script = script.replaceAll("\\|\\||=>|[:!]=|<>|[<>()!,+=*|-]", " $0 ");
Explanation:
\|\| # Match ||
| # or
=> # =>
| # or
[:!]= # := or !=
| # or
<> # <>
| # or
[<>()!,+=*|-] # <, >, (, ), !, comma, +, =, *, | or -

Sure. Your second approach is "almost" good. The problem is that you do not use your pattern for replacement itself. When you are using str.replaceAll() you actually creating Pattern instance every time you are calling this method. Pattern.compile() is called for you and it takes 90% of time.
You should use Matcher.replaceAll() instead.
String script = "dfgafjd;fjfd;jfd;djf;jds\\fdfdf****\\/";
String result = script;
Pattern p = Pattern.compile("[\\*\\/\\\\]"); // write all characters you want to remove here.
Matcher m = p.matcher(script);
if (m.find()) {
result = m.replaceAll("");
}
System.out.println(result);

It isn't the regexes causing your performance problem, it's that fact that you're doing many passes over the text, and constantly creating new Pattern objects. And it's not just performance that suffers, as Tim pointed out; it's much too easy to mess up the results of prior passes when you do that.
In fact, I'm guessing that those extra spaces in the dates are just a side effect your other replacements. If so, here's a way you can do all the replacements in one pass, without adding unwanted characters:
static String doReplace(String input)
{
String regex =
"/\\*[^*]*(?:\\*(?!/)[^*]*)*\\*/|" // a comment
+ "\\b\\d{2}/\\d{2}/\\d{2,4}\\b|" // a date
+ "(/|\\|\\||=>|[:!]=|<>|[<>()!,+=*|-])"; // an operator
Matcher m = Pattern.compile(regex).matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find())
{
// if we found an operator, replace it
if (m.start(1) != -1)
{
m.appendReplacement(sb, " $1 ");
}
}
m.appendTail(sb);
return sb.toString();
}
see the online demo
The trick is, if you don't call appendReplacement(), the match position is not updated, so it's as if the match didn't occur. Because I ignore them, the comments and dates get reinserted along with the rest of the unmatched text, and I don't have to worry about matching the slash characters inside them.
EDIT Make sure the "comment" part of the regex comes before the "operator" part. Otherwise, the leading / of every comment will be treated as an operator.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Iterating through String with .find() in Java regex - java

Use regex as follows: Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a); for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++); Check this demo.

Related

Java regex: Replace all characters with `+` except instances of a given string

trying to find a word with seperators in string

match ;ABC12;10;250.3 using regex java

String Matches, Java

Optimizing several RegEx in Java Code

Categories

Resources