Regex does not store the element in the first index - java

I have a function which takes a String containing a math expression such as 6+9*8 or 4+9 and it evaluates them from left to right (without normal order of operation rules).
I've been stuck with this problem for the past couple of hours and have finally found the culprit BUT I have no idea why it is doing what it does. When I split the string through regex (.split("\\d") and .split("\\D")), I make it go into 2 arrays, one is a int[] where it contains the numbers involved in the expression and a String[] where it contains the operations.
What I've realized is that when I do the following:
String question = "5+9*8";
String[] mathOperations = question.split("\\d");
for(int i = 0; i < mathOperations.length; i++) {
System.out.println("Math Operation at " + i + " is " + mathOperations[i]);
}
it does not put the first operation sign in index 0, rather it puts it in index 1. Why is this?
This is the system.out on the console:
Math Operation at 0 is
Math Operation at 1 is +
Math Operation at 2 is *

Because on position 0 of mathOperations there's an empty String. In other words
mathOperations = {"", "+", "*"};
According to split documentation
The array returned by this method contains each substring of this
string that is terminated by another substring that matches the given
expression or is terminated by the end of the string. ...
Why isn't there an empty string at the end of the array too?
Trailing empty strings are therefore not included in the resulting
array.
More detailed explanation - your regex matched the String like this:
"(5)+(9)*(8)" -> "" + (5) + "+" + (9) + "*" + (8) + ""
but the trailing empty string is discarded as specified by the documentation.
(hope this silly illustration helps)
Also a thing worth noting, the regex you used "\\d", would split following string "55+5" into
["", "", "+"]
That's because you match only a single character, you should probably use "\\d+"

You may find the following variation on your program helpful, as one split does the jobs of both of yours...
public class zw {
public static void main(String[] args) {
String question = "85+9*8-900+77";
String[] bits = question.split("\\b");
for (int i = 0; i < bits.length; ++i) System.out.println("[" + bits[i] + "]");
}
}
and its output:
[]
[85]
[+]
[9]
[*]
[8]
[-]
[900]
[+]
[77]
In this program, I used \b as a "zero-width boundary" to do the splitting. No characters were harmed during the split, they all went into the array.
More info here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
and here: http://www.regular-expressions.info/wordboundaries.html

Related

Why is String#split(...) implemented like this? [duplicate]

This question already has answers here:
Java String split removed empty values
(5 answers)
Closed 3 years ago.
I am actually working on a software that requires to read text files with some features that won't be explained here. While testing my code, I've found an anomaly which seems to come from the implementation of str.split("\r\n"), where str is a substring of the file's content.
When my substring ends with a succession of "\r\n" (several line breaks), the method completely neglects this part. For example, if I work with the following string:
"\r\nLine 1\r\n\r\nLine 2\r\n\r\n"
, I would like to get the following array;
["", "Line 1", "", "Line 2", "", ""]
, but it returns:
["", "Line 1", "", "Line 2"]
The String.split() Javadoc only notifies this without explaining:
... Trailing empty strings are therefore not included in the resulting array.
I cannot understand this asymmetry; why did they neglect empty string at the end, but not at the beginning?
The Javadocs explain why it works the way it does; you'd have to ask them why they chose this default implementation. Why not just call split(regex, n) as per the docs? Using -1 does what you say you want, just like the docs imply.
class Main {
public static void main(String[] args) {
String s = "\r\nLine 1\r\n\r\nLine 2\r\n\r\n";
String[] r = s.split("\\r\\n", -1);
for (int i = 0; i < r.length; i++) {
System.out.println("i: " + i + " = \"" + r[i] + "\"");
}
}
}
Produces:
i: 0 = ""
i: 1 = "Line 1"
i: 2 = ""
i: 3 = "Line 2"
i: 4 = ""
i: 5 = ""
You missed the part of the doc that explains the therefore, which states:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero.
Looking at the referenced two-arg doc shows
If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
So this is just not the special case you want. Call with a negative integer instead:
str.split("\r\n", -1)
It's unclear why the authors thought 0 would be a more popular use-case than -1, but it doesn't really matter since the option you want exists.

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"
Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result
This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.
You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):
So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}
Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}
To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.
The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

Java: String.replaceAll(regex, replacement);

I have a string of comma-separated user-ids and I want to eliminate/remove specific user-id from a string.
I’ve following possibilities of string and expected the result
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
// The expected result in all cases, after replacement, should be:
// "22,33,44,55"
I tried the following:
String result = css#.replaceAll("," + elimiateUserId, ""); // # = 1 or 2 or 3
result = css#.replaceAll(elimiateUserId + "," , "");
This logic fails in case of css3. Please suggest me a proper solution for this issue.
Note: I'm working with Java 7
I checked around the following posts, but could not find any solution:
Java String.replaceAll regex
java String.replaceAll regex question
Java 1.3 String.replaceAll() , replacement
You can use the Stream API in Java 8:
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css1Result = Stream.of(css1.split(","))
.filter(value -> !String.valueOf(elimiateUserId).equals(value))
.collect(Collectors.joining(","));
// css1Result = 22,33,44,55
If you want to use regex, you may use (remember to properly escape as java string literal)
,\b11\b|\b11\b,
This will ensure that 11 won't be matched as part of another number due to the word boundaries and only one comma (if two are present) is matched and removed.
You may build a regex like
^11,|,11\b
that will match 11, at the start of a string (^11,) or (|) ,11 not followed with any other word char (,11\b).
See the regex demo.
int elimiate_user_id = 11;
String pattern = "^" + elimiate_user_id + ",|," + elimiate_user_id + "\\b";
System.out.println("11,22,33,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,11,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,44,55,111,11".replaceAll(pattern, "")); // => 22,33,44,55,111
See the Java demo
Try to (^(11)(?:,))|((?<=,)(11)(?:,))|(,11$) expression to replaceAll:
final String regexp = MessageFormat.format("(^({0})(?:,))|((?<=,)({0})(?:,))|(,{0}$)", elimiateUserId)
String result = css#.replaceAll(regexp, "") //for all cases.
Here is an example:
https://regex101.com/r/LwJgRu/3
try this:
String result = css#.replaceAll("," + elimiateUserId, "")
.replaceAll(elimiateUserId + "," , "");
You can use two replace in one shot like :
int elimiateUserId = 11;
String result = css#.replace("," + elimiateUserId , "").replace(elimiateUserId + ",", "");
If your string is like ,11 the the first replace will do replace it with empty
If your string is like 11, the the second replace will do replace it with empty
result
11,22,33,44,55 -> 22,33,44,55
22,33,11,44,55 -> 22,33,44,55
22,33,44,55,11 -> 22,33,44,55
ideone demo
String result = css#.replaceAll("," + eliminate_user_id + "\b|\b" + eliminate_user_id + ",", '');
The regular expression here is:
, A leading comma.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
| OR
\b Word boundary: word/number characters begin here.
eliminate_user_id again.
, A trailing comma.
The word boundary marker, matching the beginning or end of a "word", is the magic here. It means that the 11 will match in these strings:
11,22,33,44,55
22,33,11,44,55
22,33,44,55,11
But not these strings:
111,112,113,114
411,311,211,111
There's a cleaner way, though:
String result = css#.replaceAll("(,?)\b" + eliminate_user_id + "\b(?(1)|,)", "");
The regular expression here is:
( A capturing group - what's in here, is in group 1.
,? An optional leading comma.
) End the capturing group.
\b Word boundary: word/number characters begin here.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
(?(1) If there's something in group 1, then require...
| ...nothing, but if there was nothing, then require...
, A trailing comma.
) end the if.
The "if" part here is a little unusual - you can find a little more information on regex conditionals here: http://www.regular-expressions.info/conditional.html
I am not sure if Java supports regex conditionals. Some posts here (Conditional Regular Expression in Java?) suggest that it does not :(
Side-note: for performance, if the list is VERY long and there are VERY many removals to be performed, the most obvious option is to just run the above line for each number to be removed:
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
for (i=0; i<removals.length; i++) {
css = css.replaceAll("," + removals[i] + "\b|\b" + eliminate_user_id + ",", "");
}
(code not tested: don't have access to a Java compiler here)
This will be fast enough (worst case scales with about O(m*n) for m removals from a string of n ids), but we can maybe do better.
One is to build the regex to be \b(11,42,18,13,123,...etc)\b - that is, make the regex search for all ids to be removed at the same time. In theory this scales a little worse, scaling with O(m*n) in every case rather than jut the worst case, but in practice should be considerably faster.
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
String removalsStr = String.join("|", removals);
css = css.replaceAll("," + removalsStr + "\b|\b" + removalsStr + ",", "");
But another approach might be to build a hashtable of the ids in the long string, then remove all the ids from the hashtable, then concatenate the remaining hashtable keys back into a string. Since hashtable lookups are effectively O(1) for sparse hashtables, that makes this scale with O(n). The tradeoff here is the extra memory for that hashtable, though.
(I don't think I can do this version without a java compiler handy. I would not recommend this approach unless you have a VAST (many thousands) list of IDs to remove, anyway, as it will be much uglier and more complex code).
I think its safer to maintain a whitelist and then use it as a reference to make further changes.
List<String> whitelist = Arrays.asList("22", "33", "44", "55");
String s = "22,33,44,55,11";
String[] sArr = s.split(",");
StringBuilder ids = new StringBuilder();
for (String id : sArr) {
if (whitelist.contains(id)) {
ids.append(id).append(", ");
}
}
String r = ids.substring(0, ids.length() - 2);
System.out.println(r);
If you need a solution with Regex, then the following works perfectly.
int elimiate_user_id = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
String resultCss=css1.replaceAll(elimiate_user_id+"[,]*", "").replaceAll(",$", "");
I works with all types of input you desire.
This should work
replaceAll("(11,|,11)", "")
At least when you can guarantee when there is no 311, or ,113 or so

Using Split() in arithmetic formula

I thought a problem for a day but still cannot solve it.
I have a formula input like "11+1+1+2". without space
I want to split the formula according to the operator.
Then I wrote like these:
String s = "11+1+1+2";
String splitByOp[] = s.split("[+|-|*|/|%]");
for(int c=0; c < splitByOp.length; c++){
System.out.println(splitByOp[c]);
The output is:
11
1
1
2
I want to put the operand(the output) and also the operator(+) into an ArrayList. But how can I keep the operator after spliting them?
I try to have one more Array to split the number.
String operator[] = s.split("\\d");
But the result is 11 become 1 1. The length of operator[] is 5.
In other words, how can I perform like:
The output:
11
+
1
+
1
+
2
You need to split on a regex that is non consuming. Specifically, on "word boundary":
String[] terms = s.split("\\b");
A "word boundary" is the gap between the word char and a non-word char, but digits are classified as word chars. Importantly, the match is non-consuming, so all of the content of the input is preserved in the split terms.
Here's some test code:
String s = "11+1+1+2";
String[] terms = s.split("\\b");
for (String term : terms)
System.out.println(term);
Output:
11
+
1
+
1
+
2
public static void main(String[] args) {
String s = "11+1+1+2";
String[] terms = s.split("(?=[+])|(?<=[+])");
System.out.println(Arrays.toString(terms));
}
output
[11, +, 1, +, 1, +, 2]
You could combine lookahead/lookbehind assertions
String[] array = s.split("(?=[+])|(?<=[+])");

RegEx Split on / Except when Surrounded by []

I am trying to split a string in Java on / but I need to ignore any instances where / is found between []. For example if I have the following string
/foo/bar[donkey=King/Kong]/value
Then I would like to return the following in my output
foo
bar[donkey=King/Kong]
value
I have seen a couple other similar posts, but I haven't found anything that fits exactly what I'm trying to do. I've tried the String.split() method and as follows and have seen weird results:
Code: value.split("/[^/*\\[.*/.*\\]]")
Result: [, oo, ar[donkey=King, ong], alue]
What do I need to do in order to get back the following:
Desired Result: [, foo, bar[donkey=King/Kong], value]
Thanks,
Jeremy
You need to split on the / followed by an 0 or more balanced pairs of brackets:
String str = "/foo/bar[donkey=King/Kong]/value";
String[] arr = str.split("/(?=([[^\\[\\]]*\\[[^\\[\\]]*\\])*[^\\[\\]]*$)");
System.out.println(Arrays.toString(arr));
Output:
[, foo, bar[donkey=King/Kong], value]
More User friendly explanation
String[] arr = str.split("(?x)/" + // Split on `/`
"(?=" + // Followed by
" (" + // Start a capture group
" [^\\[\\]]*" + // 0 or more non-[, ] character
" \\[" + // then a `[`
" [^\\]\\[]*" + // 0 or more non-[, ] character
" \\]" + // then a `]`
" )*" + // 0 or more repetition of previous pattern
" [^\\[\\]]*" + // 0 or more non-[, ] characters
"$)"); // till the end
Of the following string, the regex below will match foo and bar, but not fox and baz, because they're followed by a close bracket. Study up on negative lookahead.
fox]foo/bar/baz]
Regex:
\b(\w+)\b(?!])

Categories

Resources