Java - What is the right regex for my String.split() - java

I am splitting equation string into string array like this:
String[] equation_array = (equation.split("(?<=[-+×÷)(])|(?=[-+×÷)(])"));
Now for test string:
test = "4+(2×5)"
result is fine:
test_array = {"4", "+", "(", "2",...}
but for test string:
test2 = "(2×5)+5"
I got string array:
test2_array = {"", "(", "×",...}.
So, problem is why does it add an empty string before ( in array after splitting?

This is actually known behavior in Java regex.
To avoid this empty result use this negative lookahead based regex:
String[] equation_array = "(2×5)+5".split("(?!^)((?<=[-+×÷)(])|(?=[-+×÷)(]))");
//=> ["(", "2", "×", "5", ")", "+", "5"]
What (?!^) means is to avoid splitting at line start.

You can add condition that not to split if before token is start of string like
"(?<=[-+×÷)(])|(?<!^)(?=[-+×÷)(])"
^^^^^^

What about looking backwards to make sure we're not at the start of the string, and looking forwards to make sure we're not at the end?
"(?<=[-+×÷)(])(?!$)|(?<!^)(?=[-+×÷)(])"
Here ^ and $ are start and end of string indicators and (?!...) and (?<!...) are negative lookahead and lookbehind.

problem is why does it add an empty string before ( in array after splitting?
Because for the input (2×5)+5 the regex used for splitting matches right at the start-of-string because of the positive look ahead (?=[-+×÷)(]).
(2×5)+5
↖
It matches right here before the (, resulting in an empty string: "".
My advice would be not to use regular expressions to parse mathematical expressions, there are more suitable algorithms for this.

Related

Java regex - split string with leading special characters

I am trying to split a string that contains whitespaces and special characters. The string starts with special characters.
When I run the code, the first array element is an empty string.
String s = ",hm ..To?day,.. is not T,uesday.";
String[] sArr = s.split("[^a-zA-Z]+\\s*");
Expected result is ["hm", "To", "day", "is", "not", "T", "uesday"]
Can someone explain how this is happening?
Actual result is ["", "hm", "To", "day", "is", "not", "T", "uesday"]
Split is behaving as expected by splitting off a zero-length string at the start before the first comma.
To fix, first remove all splitting chars from the start:
String[] sArr = s.replaceAll("^([^a-zA-Z]*\\s*)*", "").split("[^a-zA-Z]+\\s*");
Note that I’ve altered the removal regex to trim any sequence of spaces and non-letters from the front.
You don’t need to remove from the tail because split discards empty trailing elements from the result.
I would simplify it by making it a two-step process rather than trying to achieve a pure regex split() operation:
s.replaceAll( '[^a-zA-Z]+', ' ' ).trim().split( ' ' )

Splitting string in java produces empty element first

Im trying to split a sting on multiple or single occurences of "O" and all other characters will be dots. I'm wondering why this produces en empty string first.
String row = ".....O.O.O"
String[] arr = row.split("\\.+");
This produces produces:
["", "O", "O", "O"]
You just need to make sure that any trailing or leading dots are removed.
So one solution is:
row.replaceAll("^\\.+|\\.+$", "").split("\\.+");
For this pattern you can use replaceFirstMethod() and then split by dot
String[] arr = row.replaceFirst("\\.+","").split("\\.");
Output will be
["O","O","O"]
The "+" character is removing multiple instances of the seperator, so what your split is essentially doing is splitting the following string on "."
.0.0.0.
This, of course, means that your first field is empty. Hence the result you get.
To avoid this, strip all leading separators from the string before splitting it. Rather than type some examples on how to do this, here's a thread with a few suggestions.
Java - Trim leading or trailing characters from a string?

how to ignore newlines for split function

I am splitting the string using ^ char. The String which I am reading, is coming from some external source. This string contains some \n characters.
The string may look like:
Hi hello^There\nhow are\nyou doing^9987678867abc^popup
when I am splitting like below, why the array length is coming as 2 instead of 4:
String[] st = msg[0].split("^");
st.length //giving "2" instead of "4"
It look like, split is ignoring after \n.
How can I fix it without replacing \n to some other character.
the string parameter for split is interpreted as regular expression.
So you have to escape the char and use:
st.split("\\^")
see this answer for more details
Escape the ^ character. Use msg[0].split("\\^") instead.
String.split considers its argument as regular expression. And as ^ has a special meaning when it comes to regular expressions, you need to escape it to use its literal representation.
If you want to split by ^ only, then
String[] st = msg[0].split("\\^");
If I read your question correctly, you want to split by ^ and \n characters, so this would suffice.
String[] st = msg[0].split("[\\^\\\\n]");
This considers that \n literally exists as 2 characters in a string.
"^" it's know as regular expression by the JDK.
To avoid this confusion you need to modify the code as below
old code = msg[0].split("^")
new code = msg[0].split("\\^")

Split string based on regex but keep delimiters

I'm trying to split a string using a variety of characters as delimiters and also keep those delimiters in their own array index. For example say I want to split the string:
if (x>1) return x * fact(x-1);
using '(', '>', ')', '*', '-', ';' and '\s' as delimiters. I want the output to be the following string array: {"if", "(", "x", ">", "1", ")", "return", "x", "*", "fact", "(", "x", "-", "1", ")", ";"}
The regex I'm using so far is
split("(?=(\\w+(?=[\\s\\+\\-\\*/<(<=)>(>=)(==)(!=)=;,\\.\"\\(\\)\\[\\]\\{\\}])))")
which splits at each word character regardless of whether it is followed by one of the delimiters. For example
test + 1
outputs {"t","e","s","t+","1"} instead of {"test+", "1"}
Why does it split at each character even if that character is not followed by one of my delimiters? Also is a regex which does this even possible in Java?
Thank you
Well, you can use lookaround to split at points between characters without consuming the delimiters:
(?<=[()>*-;\s])|(?=[()>*-;\s])
This will create a split point before and after each delimiter character. You might need to remove superfluous whitespace elements from the resulting array, though.
Quick PowerShell test (| marks the split points):
PS Home:\> 'if (x>1) return x * fact(x-1);' -split '(?<=[()>*-;\s])|(?=[()>*-;\s])' -join '|'
if| |(|x|>|1|)| |return| |x| |*| |fact|(|x|-|1|)|;|
How about this pattern?
(\w+)|([\p{P}\p{S}])
To answer your question, "Why?", it's because your entire expression is a lookahead assertion. As long as that assertion is true at each character (or maybe I should say "between"), it is able to split.
Also, you cannot group within character classes, e.g. (<=) is not doing what you think it is doing.

Matching strings and storing into an array with regex in java

Im making a program that takes a file and finds identifiers. So far I removed any words in quotes, any words that start with a number and I removed all the non word characters.
Is there a way to find words that dont match words in an array and store those words into another array using regex? I can figure it out, I was trying to use the split method but its not working right when I try to split by spaces...This is what I did to split it.
String[] SplitString = newLine.split("[\\s]");
Use
String[] SplitString = newLine.split("\\s");
if you don't want to combine multiple spaces/tabs, etc., but use
String[] SplitString = newLine.split("\\s+");
if you do. For example, if your string is:
"a b c"
the first will give you four tokens: "a", "", "b", and "c", and the second will give you three: "a", "b", and "c".
You can do it simply in one line by first removing the known words, then splitting:
String[] unknownWords = newLine.replaceAll("\\b(apple|orange|banana)\\b", "").split("\\s+");
Notes:
Your regex [\s] is equivalent to \s, so I simplified it
You should probably split on any number of spaces: \s+
\b means "word boundary" - this means the removal regex won't match applejack
The regex (A|B|C|etc) is the syntax for "OR" logic

Categories

Resources