RegEx Split on / Except when Surrounded by []

RegEx Split on / Except when Surrounded by [] - java

I am trying to split a string in Java on / but I need to ignore any instances where / is found between []. For example if I have the following string
/foo/bar[donkey=King/Kong]/value
Then I would like to return the following in my output
foo
bar[donkey=King/Kong]
value
I have seen a couple other similar posts, but I haven't found anything that fits exactly what I'm trying to do. I've tried the String.split() method and as follows and have seen weird results:
Code: value.split("/[^/*\\[.*/.*\\]]")
Result: [, oo, ar[donkey=King, ong], alue]
What do I need to do in order to get back the following:
Desired Result: [, foo, bar[donkey=King/Kong], value]
Thanks,
Jeremy

You need to split on the / followed by an 0 or more balanced pairs of brackets:
String str = "/foo/bar[donkey=King/Kong]/value";
String[] arr = str.split("/(?=([[^\\[\\]]*\\[[^\\[\\]]*\\])*[^\\[\\]]*$)");
System.out.println(Arrays.toString(arr));
Output:
[, foo, bar[donkey=King/Kong], value]
More User friendly explanation
String[] arr = str.split("(?x)/" + // Split on `/`
"(?=" + // Followed by
" (" + // Start a capture group
" [^\\[\\]]*" + // 0 or more non-[, ] character
" \\[" + // then a `[`
" [^\\]\\[]*" + // 0 or more non-[, ] character
" \\]" + // then a `]`
" )*" + // 0 or more repetition of previous pattern
" [^\\[\\]]*" + // 0 or more non-[, ] characters
"$)"); // till the end

Of the following string, the regex below will match foo and bar, but not fox and baz, because they're followed by a close bracket. Study up on negative lookahead.
fox]foo/bar/baz]
Regex:
\b(\w+)\b(?!])

Related

Java Regex Handling

Looking for a java regex function to 1 - return true if Special characters present before or after from my list of array elements
2 - Return False if any alpha characters present before and after from my list of array elements
My array elements wordsList = {"TIN","tin"}
Input:
1-CardTIN is 1111
2-Card-TIN:2222
3-CardTINis3333
4-Card#TIN#4444
5-CardTIN#5555
6-TINis9999
Expected Output:
1-True
2-True
3-False
4-True
5-True
6-False
I have tried regex function to cover these cases
Arrays.stream(wordsList).anyMatch(word -> Pattern .compile("([^a-zA-Z0-9])" + Pattern.quote(word) +"([^a-zA-Z0-9])".matcher(string).find()
But the scenario CardTin#555 is not giving the desired result as expected
Kindly help with these case

You can make sure that either tin or TIN using the character classes is not present:
^(?![a-zA-Z0-9]*(?:TIN|tin)[a-zA-Z0-9]*$).*(?:TIN|tin).*
(?i) Case insensitive match
^ Start of string
(?![a-zA-Z0-9]*(?:TIN|tin)[a-zA-Z0-9]*$) Assert that TIN or tin (as it is case insensitive, it does not matter for this example) does not occur between alpha numeric chars (no special characters so to say)
.*(?:TIN|tin).* Match the word in the line
You might add word boundaries \\b(?:TIN|tin)\\b for a more precise match.
Regex demo
Example for a single line:
String s = "CardTIN is 1111";
String[] wordsList = {"TIN","tin"};
String alt = "(?:" + String.join("|", wordsList) + ")";
String regex = "(?i)^(?![a-zA-Z0-9]*" + alt + "[a-zA-Z0-9]*$).*" + alt + ".*";
System.out.println(s.matches(regex));
Output
true
You can also join the list of words on | and then filter the list:
String strings[] = { "CardTIN is 1111", "Card-TIN:2222", "CardTINis3333", "Card#TIN#4444", "CardTIN#5555", "TINis9999", "test", "Card Tin is 1111" };
String[] wordsList = {"TIN","tin"};
String alt = "(?:" + String.join("|", wordsList) + ")";
String regex = "(?i)^(?![a-zA-Z0-9]*" + alt + "[a-zA-Z0-9]*$).*" + alt + ".*";
List<String> result = Arrays.stream(strings)
.filter(word -> word.matches(regex))
.collect(Collectors.toList());
for (String res : result)
System.out.println(res);
Output
CardTIN is 1111
Card-TIN:2222
Card#TIN#4444
CardTIN#5555
Card Tin is 1111
See a Java demo.

I am not sure whether your requirements can be put in regex. If at all, you are running way into specialities that will become hard to maintain.
Therefore you might be better off using a lexer/scanner combination which usually make up a complete parser. Check out ANTLR.

Split string to include null values represented by white space

I am trying to split an input string that contains whitespace, but I do not want to cut it off from my split, I want to include it in my split array. Is there a better regex or method to use in this case?
String data = "1 a1 b1 r5";
String splitData = data.split("\\s+");
for(String x : splitData){
System.out.print(x + ", ");
}
Expected output: 1, , , , , ,a1, b1, , r5

I'm confused by your methodology here. If this is all you're trying to do, it can be done much more simply:
String input = "1 a1 b1 r5";
String output = input.replace(" ", ", ");
System.out.println(output);
The middle line simply replaces the space character, " ", with a comma followed by a space, ", ". The final output matches your requested output:
1, , , , , a1, b1, , r5
If this is a minimal example and you actually intend to use a more complex regex, please post that regex and we can get to work on it.

If you want to create an array of tokens, just use:
String[] sp = s.split( " " );
However, this will create 4 empty items after the first one, not 5.
There is 1 space between "a1" and "b1", and you expect 0 empty items there.
There are 2 spaces between "b1" and "r5" and you expect 1 empty item there.
There are 5 empty spaces between "1" and "a1". Why do you expect 5 empty items there instead of 4?
And why does your expected output not have a space after the comma in front of "a1" ?

Regex does not store the element in the first index

I have a function which takes a String containing a math expression such as 6+9*8 or 4+9 and it evaluates them from left to right (without normal order of operation rules).
I've been stuck with this problem for the past couple of hours and have finally found the culprit BUT I have no idea why it is doing what it does. When I split the string through regex (.split("\\d") and .split("\\D")), I make it go into 2 arrays, one is a int[] where it contains the numbers involved in the expression and a String[] where it contains the operations.
What I've realized is that when I do the following:
String question = "5+9*8";
String[] mathOperations = question.split("\\d");
for(int i = 0; i < mathOperations.length; i++) {
System.out.println("Math Operation at " + i + " is " + mathOperations[i]);
}
it does not put the first operation sign in index 0, rather it puts it in index 1. Why is this?
This is the system.out on the console:
Math Operation at 0 is
Math Operation at 1 is +
Math Operation at 2 is *

Because on position 0 of mathOperations there's an empty String. In other words
mathOperations = {"", "+", "*"};
According to split documentation
The array returned by this method contains each substring of this
string that is terminated by another substring that matches the given
expression or is terminated by the end of the string. ...
Why isn't there an empty string at the end of the array too?
Trailing empty strings are therefore not included in the resulting
array.
More detailed explanation - your regex matched the String like this:
"(5)+(9)*(8)" -> "" + (5) + "+" + (9) + "*" + (8) + ""
but the trailing empty string is discarded as specified by the documentation.
(hope this silly illustration helps)
Also a thing worth noting, the regex you used "\\d", would split following string "55+5" into
["", "", "+"]
That's because you match only a single character, you should probably use "\\d+"

You may find the following variation on your program helpful, as one split does the jobs of both of yours...
public class zw {
public static void main(String[] args) {
String question = "85+9*8-900+77";
String[] bits = question.split("\\b");
for (int i = 0; i < bits.length; ++i) System.out.println("[" + bits[i] + "]");
}
}
and its output:
[]
[85]
[+]
[9]
[*]
[8]
[-]
[900]
[+]
[77]
In this program, I used \b as a "zero-width boundary" to do the splitting. No characters were harmed during the split, they all went into the array.
More info here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
and here: http://www.regular-expressions.info/wordboundaries.html

Java Regex Split \\S --> Strange result for String split methode \\S

I am puzzled about the split methode with regex in Java. It is a rather theoretical question that poped up and i can't figure it out.
I found this answer: Java split by \\S
but the advice to use \\s instead of \\S does not explain what is happening here.
Why: does quote.split("\\S") has 2 results in case A and 8 in case B ?
case A)
String quote = " x xxxxxx";
String[] words = quote.split("\\S");
System.out.print("\\S >>\t");
for (String word : words) {
System.out.print(":" + word);
}
System.out.println(words.length);
Result:
\\S >> : : 2
case B)
String quote = " x xxxxxx ";
String[] words = quote.split("\\S");
System.out.print("\\S >>\t");
for (String word : words) {
System.out.print(":" + word);
}
System.out.println(words.length);
Result:
\\S >> : : :::::: 8
It would be wonderfull to understand what happens here. Thanks in advance.

As Jongware noticed, the documentation for String.split(String) says:
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
So it works somewhat like this:
"a:b:::::".split(":") === removeTrailing([a,b,,,,,]) === [a,b]
"a:b:::::c".split(":") === removeTrailing([a,b,,,,,c]) === [a,b,,,,,c]
And in your example:
" x xxxxxx".split("\\S") === removeTrailing([ , ,,,,,,]) === [ , ]
" x xxxxxx ".split("\\S") === removeTrailing([ , ,,,,,, ]) === [ , ,,,,,, ]
To collapse multiple delimiters into one, use \S+ pattern.
" x xxxxxx".split("\\S+") === removeTrailing([ , ,]) === [ , ]
" x xxxxxx ".split("\\S+") === removeTrailing([ , , ]) === [ , , ]
As suggested in the comments, to maintain the trailing empty strings we can use overloaded version of split method (String.split(String, int)) with a negative number passed as limit.
"a:b:::::".split(":", -1) === [a,b,,,,,]

String.replaceAll Strange Behaviour

String s = "hi hello";
s = s.replaceAll("\\s*", " ");
System.out.println(s);
I have the code above, but I can't work out why it produces
h i h e l l o
rather than
hi hello
Many thanks

Use + quantifier to match 1 or more spaces instead of *: -
s = s.replaceAll("\\s+", " ");
\\s* means match 0 or more spaces, and will match an empty character before every character and is replaced by a space.

The * matches 0 or more spaces, I think you want to change it to + to match 1 or more spaces.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RegEx Split on / Except when Surrounded by [] - java

Of the following string, the regex below will match foo and bar, but not fox and baz, because they're followed by a close bracket. Study up on negative lookahead. fox]foo/bar/baz] Regex: \b(\w+)\b(?!])

Related

Java Regex Handling

Split string to include null values represented by white space

Regex does not store the element in the first index

Java Regex Split \\S --> Strange result for String split methode \\S

String.replaceAll Strange Behaviour

Categories

Resources