Java: String.replaceAll(regex, replacement); - java

I have a string of comma-separated user-ids and I want to eliminate/remove specific user-id from a string.
I’ve following possibilities of string and expected the result
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
// The expected result in all cases, after replacement, should be:
// "22,33,44,55"
I tried the following:
String result = css#.replaceAll("," + elimiateUserId, ""); // # = 1 or 2 or 3
result = css#.replaceAll(elimiateUserId + "," , "");
This logic fails in case of css3. Please suggest me a proper solution for this issue.
Note: I'm working with Java 7
I checked around the following posts, but could not find any solution:
Java String.replaceAll regex
java String.replaceAll regex question
Java 1.3 String.replaceAll() , replacement

You can use the Stream API in Java 8:
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css1Result = Stream.of(css1.split(","))
.filter(value -> !String.valueOf(elimiateUserId).equals(value))
.collect(Collectors.joining(","));
// css1Result = 22,33,44,55

If you want to use regex, you may use (remember to properly escape as java string literal)
,\b11\b|\b11\b,
This will ensure that 11 won't be matched as part of another number due to the word boundaries and only one comma (if two are present) is matched and removed.

You may build a regex like
^11,|,11\b
that will match 11, at the start of a string (^11,) or (|) ,11 not followed with any other word char (,11\b).
See the regex demo.
int elimiate_user_id = 11;
String pattern = "^" + elimiate_user_id + ",|," + elimiate_user_id + "\\b";
System.out.println("11,22,33,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,11,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,44,55,111,11".replaceAll(pattern, "")); // => 22,33,44,55,111
See the Java demo

Try to (^(11)(?:,))|((?<=,)(11)(?:,))|(,11$) expression to replaceAll:
final String regexp = MessageFormat.format("(^({0})(?:,))|((?<=,)({0})(?:,))|(,{0}$)", elimiateUserId)
String result = css#.replaceAll(regexp, "") //for all cases.
Here is an example:
https://regex101.com/r/LwJgRu/3

try this:
String result = css#.replaceAll("," + elimiateUserId, "")
.replaceAll(elimiateUserId + "," , "");

You can use two replace in one shot like :
int elimiateUserId = 11;
String result = css#.replace("," + elimiateUserId , "").replace(elimiateUserId + ",", "");
If your string is like ,11 the the first replace will do replace it with empty
If your string is like 11, the the second replace will do replace it with empty
result
11,22,33,44,55 -> 22,33,44,55
22,33,11,44,55 -> 22,33,44,55
22,33,44,55,11 -> 22,33,44,55
ideone demo

String result = css#.replaceAll("," + eliminate_user_id + "\b|\b" + eliminate_user_id + ",", '');
The regular expression here is:
, A leading comma.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
| OR
\b Word boundary: word/number characters begin here.
eliminate_user_id again.
, A trailing comma.
The word boundary marker, matching the beginning or end of a "word", is the magic here. It means that the 11 will match in these strings:
11,22,33,44,55
22,33,11,44,55
22,33,44,55,11
But not these strings:
111,112,113,114
411,311,211,111
There's a cleaner way, though:
String result = css#.replaceAll("(,?)\b" + eliminate_user_id + "\b(?(1)|,)", "");
The regular expression here is:
( A capturing group - what's in here, is in group 1.
,? An optional leading comma.
) End the capturing group.
\b Word boundary: word/number characters begin here.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
(?(1) If there's something in group 1, then require...
| ...nothing, but if there was nothing, then require...
, A trailing comma.
) end the if.
The "if" part here is a little unusual - you can find a little more information on regex conditionals here: http://www.regular-expressions.info/conditional.html
I am not sure if Java supports regex conditionals. Some posts here (Conditional Regular Expression in Java?) suggest that it does not :(
Side-note: for performance, if the list is VERY long and there are VERY many removals to be performed, the most obvious option is to just run the above line for each number to be removed:
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
for (i=0; i<removals.length; i++) {
css = css.replaceAll("," + removals[i] + "\b|\b" + eliminate_user_id + ",", "");
}
(code not tested: don't have access to a Java compiler here)
This will be fast enough (worst case scales with about O(m*n) for m removals from a string of n ids), but we can maybe do better.
One is to build the regex to be \b(11,42,18,13,123,...etc)\b - that is, make the regex search for all ids to be removed at the same time. In theory this scales a little worse, scaling with O(m*n) in every case rather than jut the worst case, but in practice should be considerably faster.
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
String removalsStr = String.join("|", removals);
css = css.replaceAll("," + removalsStr + "\b|\b" + removalsStr + ",", "");
But another approach might be to build a hashtable of the ids in the long string, then remove all the ids from the hashtable, then concatenate the remaining hashtable keys back into a string. Since hashtable lookups are effectively O(1) for sparse hashtables, that makes this scale with O(n). The tradeoff here is the extra memory for that hashtable, though.
(I don't think I can do this version without a java compiler handy. I would not recommend this approach unless you have a VAST (many thousands) list of IDs to remove, anyway, as it will be much uglier and more complex code).

I think its safer to maintain a whitelist and then use it as a reference to make further changes.
List<String> whitelist = Arrays.asList("22", "33", "44", "55");
String s = "22,33,44,55,11";
String[] sArr = s.split(",");
StringBuilder ids = new StringBuilder();
for (String id : sArr) {
if (whitelist.contains(id)) {
ids.append(id).append(", ");
}
}
String r = ids.substring(0, ids.length() - 2);
System.out.println(r);

If you need a solution with Regex, then the following works perfectly.
int elimiate_user_id = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
String resultCss=css1.replaceAll(elimiate_user_id+"[,]*", "").replaceAll(",$", "");
I works with all types of input you desire.

This should work
replaceAll("(11,|,11)", "")
At least when you can guarantee when there is no 311, or ,113 or so

Related

How to split a string and save the 2 characters that I split with?

I am trying to split a given string using the java split method while the string should be devided by two different characters (+ and -) and I am willing to save the characters inside the array aswell in the same index the string has been saven.
for example :
input : String s = "4x^2+3x-2"
output :
arr[0] = 4x^2
arr[1] = +3x
arr[2] = -2
I know how to get the + or - characters in a different index between the numbers but it is not helping me,
any suggestions please?
You can face this problem in many ways. I´m sure there are clever and fancy ways to split this expression. I will show you the simplest problem-solving process that can help you.
State the problem you need to solve, the input and output
Problem: Split a math expression into subexpressions at + and - signals
Input: 4x^2+3x-2
Output: 4x^2,+3x,-2
Create a pseudo code with some logic you might think works
Given an expression string
Create an empty list of expressions
Create a subExpression string
For each character in the expression
Check if the character is + ou - then
add the subExpression in the list and create a new empty subexpression
otherwise, append the character in the subExpression
In the end, add the left subexpression in the list
Implement the pseudo-code in the programming language of your choice
String expression = "4x^2+3x-2";
List<String> expressions = new ArrayList();
StringBuilder subExpression = new StringBuilder();
for (int i = 0; i < expression.length(); i++) {
char character = expression.charAt(i);
if (character == '-' || character == '+') {
expressions.add(subExpression.toString());
subExpression = new StringBuilder(String.valueOf(character));
} else {
subExpression.append(String.valueOf(character));
}
}
expressions.add(subExpression.toString());
System.out.println(expressions);
Output
[4x^2, +3x, -2]
You will end with one algorithm that works for your problem. You can start to improve it.
Try this code:
String s = "4x^2+3x-2";
s = s.replace("+", "#+");
s = s.replace("-", "#-");
String[] ss = s.split("#");
for (int i = 0; i < ss.length; i++) {
Log.e("XOP",ss[i]);
}
This code replaces + and - with #+ and #- respectively and then splits the string with #. That way the + and - operators are not lost in the result.
If you require # as input character then you can use any other Unicode character instead of #.
Try this one:
String s = "4x^2+3x-2";
String[] arr = s.split("[\\+-]");
for(int i=0;i<arr.length;i++){
System.out.println(arr[i]);
}
Personally I like it better to have positive matches of patterns, especially if the split pattern itself is empty.
So for instance you could use a Pattern and Matcher like this:
Pattern p = Pattern.compile("(^|[+-])([^+-]*)");
Matcher m = p.matcher("4x^2+3x-2");
while (m.find()) {
System.out.printf("%s or %s %s%n", m.group(), m.group(1), m.group(2));
}
This matches the start of the string or a plus or minus: ^|[+-], followed by any amount of characters that are not a plus or minus: [^+-]*.
Do note that the ^ first matches the start of the string, and is then used to negate a character class when used between brackets. Regular expressions are tricky like that.
Bonus: you can also use the two groups (within the parenthesis in the pattern) to match the operators - if any.
All this is presuming that you want to use/test regular expressions; generally things like this require a parser rather than a regular expression.
A one-liner for persons thinking that this is too complex:
var expressions = Pattern.compile("^|[+-][^+-]*")
.matcher("4x^2+3x-2")
.results()
.map(r -> r.group())
.collect(Collectors.toList());

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"
Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result
This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.
You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):
So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}
Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}
To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.
The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

String split regex [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

Removing a substring between two characters (java)

I have a java string such as this:
String string = "I <strong>really</strong> want to get rid of the strong-tags!";
And I want to remove the tags. I have some other strings where the tags are way longer, so I'd like to find a way to remove everything between "<>" characters, including those characters.
One way would be to use the built-in string method that compares the string to a regEx, but I have no idea how to write those.
Caution is advised when using regex to parse HTML (due its allowable complexity), however for "simple" HTML, and simple text (text without literal < or > in it) this will work:
String stripped = html.replaceAll("<.*?>", "");
To avoid Regex:
String toRemove = StringUtils.substringBetween(string, "<", ">");
String result = StringUtils.remove(string, "<" + toRemove + ">");
For multiple instances:
String[] allToRemove = StringUtils.substringsBetween(string, "<", ">");
String result = string;
for (String toRemove : allToRemove) {
result = StringUtils.remove(result, "<" + toRemove + ">");
}
Apache StringUtils functions are null-, empty-, and no match- safe
You should use
String stripped = html.replaceAll("<[^>]*>", "");
String stripped = html.replaceAll("<[^<>]*>", "");
where <[^>]*> matches substrings starting with <, then zero or more chars other than > (or the chars other than < and > if you choose the second version) and then a > char.
Note that <.*?>
is less efficient than a negated character class (see Which would be better non-greedy regex or negated character class?)
does not find substrings spanning across multiple lines (see How do I match any character across multiple lines in a regular expression?), but it can be solved with (?s)<.*?>, <(?s:.)*?>, <[\w\W]*?>, and many other not-so-efficient variations.
See the regex demo.

Optimizing several RegEx in Java Code

The below mentioned RegEx perform very poorly on a very large string or more than 2000 Lines. Basically the Java String is composed of PL/SQL script.
1- Replace each occurrence of delimiting character, for example ||, != or > sign with a space before and after the characters. This takes infinite time and never ends, so no time can be recorded.
// Delimiting characters for SQLPlus
private static final String[] delimiters = { "\\|\\|", "=>", ":=", "!=", "<>", "<", ">", "\\(", "\\)", "!", ",", "\\+", "-", "=", "\\*", "\\|" };
for (int i = 0; i < delimiters.length; i++) {
script = script.replaceAll(delimiters[i], " " + delimiters[i] + " ");
}
2- The following pattern looks for all occurances of forward slash / except the ones that are preceded by a *. That mean don't look for forward slash in a block comment syntax. This takes about 103 Seconds for a 2000 lines of String.
Pattern p = Pattern.compile("([^\\*])([\\/])([^\\*])");
Matcher m = p.matcher(script);
while (m.find()) {
script = script.replaceAll(m.group(2), " " + m.group(2) + " ");
}
3- Remove any white spaces from within date or date format
Pattern p = Pattern.compile("(?i)(\\w{1,2}) +/ +(\\w{1,2}) +/ +(\\w{2,4})");
// Create a matcher with an input string
Matcher m = p.matcher(script);
while (m.find()) {
part1 = script.substring(0, m.start());
part2 = script.substring(m.end());
script = part1 + m.group().replaceAll("[ \t]+", "") + part2;
m = p.matcher(script);
}
Is there any way to optimize all the three RegEx so that they take less time?
Thanks
Ali
I'll answer the first question.
You can combine all this into a single regex replace operation:
script = script.replaceAll("\\|\\||=>|[:!]=|<>|[<>()!,+=*|-]", " $0 ");
Explanation:
\|\| # Match ||
| # or
=> # =>
| # or
[:!]= # := or !=
| # or
<> # <>
| # or
[<>()!,+=*|-] # <, >, (, ), !, comma, +, =, *, | or -
Sure. Your second approach is "almost" good. The problem is that you do not use your pattern for replacement itself. When you are using str.replaceAll() you actually creating Pattern instance every time you are calling this method. Pattern.compile() is called for you and it takes 90% of time.
You should use Matcher.replaceAll() instead.
String script = "dfgafjd;fjfd;jfd;djf;jds\\fdfdf****\\/";
String result = script;
Pattern p = Pattern.compile("[\\*\\/\\\\]"); // write all characters you want to remove here.
Matcher m = p.matcher(script);
if (m.find()) {
result = m.replaceAll("");
}
System.out.println(result);
It isn't the regexes causing your performance problem, it's that fact that you're doing many passes over the text, and constantly creating new Pattern objects. And it's not just performance that suffers, as Tim pointed out; it's much too easy to mess up the results of prior passes when you do that.
In fact, I'm guessing that those extra spaces in the dates are just a side effect your other replacements. If so, here's a way you can do all the replacements in one pass, without adding unwanted characters:
static String doReplace(String input)
{
String regex =
"/\\*[^*]*(?:\\*(?!/)[^*]*)*\\*/|" // a comment
+ "\\b\\d{2}/\\d{2}/\\d{2,4}\\b|" // a date
+ "(/|\\|\\||=>|[:!]=|<>|[<>()!,+=*|-])"; // an operator
Matcher m = Pattern.compile(regex).matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find())
{
// if we found an operator, replace it
if (m.start(1) != -1)
{
m.appendReplacement(sb, " $1 ");
}
}
m.appendTail(sb);
return sb.toString();
}
see the online demo
The trick is, if you don't call appendReplacement(), the match position is not updated, so it's as if the match didn't occur. Because I ignore them, the comments and dates get reinserted along with the rest of the unmatched text, and I don't have to worry about matching the slash characters inside them.
EDIT Make sure the "comment" part of the regex comes before the "operator" part. Otherwise, the leading / of every comment will be treated as an operator.

Categories

Resources