Split string based on regex but keep delimiters - java

I'm trying to split a string using a variety of characters as delimiters and also keep those delimiters in their own array index. For example say I want to split the string:
if (x>1) return x * fact(x-1);
using '(', '>', ')', '*', '-', ';' and '\s' as delimiters. I want the output to be the following string array: {"if", "(", "x", ">", "1", ")", "return", "x", "*", "fact", "(", "x", "-", "1", ")", ";"}
The regex I'm using so far is
split("(?=(\\w+(?=[\\s\\+\\-\\*/<(<=)>(>=)(==)(!=)=;,\\.\"\\(\\)\\[\\]\\{\\}])))")
which splits at each word character regardless of whether it is followed by one of the delimiters. For example
test + 1
outputs {"t","e","s","t+","1"} instead of {"test+", "1"}
Why does it split at each character even if that character is not followed by one of my delimiters? Also is a regex which does this even possible in Java?
Thank you

Well, you can use lookaround to split at points between characters without consuming the delimiters:
(?<=[()>*-;\s])|(?=[()>*-;\s])
This will create a split point before and after each delimiter character. You might need to remove superfluous whitespace elements from the resulting array, though.
Quick PowerShell test (| marks the split points):
PS Home:\> 'if (x>1) return x * fact(x-1);' -split '(?<=[()>*-;\s])|(?=[()>*-;\s])' -join '|'
if| |(|x|>|1|)| |return| |x| |*| |fact|(|x|-|1|)|;|

How about this pattern?
(\w+)|([\p{P}\p{S}])

To answer your question, "Why?", it's because your entire expression is a lookahead assertion. As long as that assertion is true at each character (or maybe I should say "between"), it is able to split.
Also, you cannot group within character classes, e.g. (<=) is not doing what you think it is doing.

Related

Java regex splitting, but only removing one whitespace

I have this code:
String[] parts = sentence.split("\\s");
and a sentence like: "this is a whitespace and I want to split it" (note there are 3 whitespaces after "whitespace")
I want to split it in a way, where only the last whitespace will be removed, keeping the original message intact. The output should be
"[this], [is], [a], [whitespace ], [and], [I], [want], [to], [split], [it]"
(two whitespaces after the word "whitespace")
Can I do this with regex and if not, is there even a way?
I removed the + from \\s+ to only remove one whitespace
You can use
String[] parts = sentence.split("\\s(?=\\S)");
That will split with a whitespace char that is immediately followed with a non-whitespace char.
See the regex demo. Details:
\s - a whitespace char
(?=\S) - a positive lookahead that requires a non-whitespace char to appear immediately to the right of the current location.
To make it fully Unicode-aware in Java, add the (?U) (Pattern.UNICODE_CHARACTER_CLASS option equivalent) embedded flag option: .split("(?U)\\s(?=\\S)").

Java regex - split string with leading special characters

I am trying to split a string that contains whitespaces and special characters. The string starts with special characters.
When I run the code, the first array element is an empty string.
String s = ",hm ..To?day,.. is not T,uesday.";
String[] sArr = s.split("[^a-zA-Z]+\\s*");
Expected result is ["hm", "To", "day", "is", "not", "T", "uesday"]
Can someone explain how this is happening?
Actual result is ["", "hm", "To", "day", "is", "not", "T", "uesday"]
Split is behaving as expected by splitting off a zero-length string at the start before the first comma.
To fix, first remove all splitting chars from the start:
String[] sArr = s.replaceAll("^([^a-zA-Z]*\\s*)*", "").split("[^a-zA-Z]+\\s*");
Note that I’ve altered the removal regex to trim any sequence of spaces and non-letters from the front.
You don’t need to remove from the tail because split discards empty trailing elements from the result.
I would simplify it by making it a two-step process rather than trying to achieve a pure regex split() operation:
s.replaceAll( '[^a-zA-Z]+', ' ' ).trim().split( ' ' )

Java String Split() Method

I was wondering what the following line would do:
String parts = inputLine.split("\\s+");
Would this simply split the string at any spaces in the line? I think this a regex, but I've never seen them before.
Yes, as documentation states split takes regex as argument.
In regex \s represents character class of containing whitespace characters like:
tab \t,
space " ",
line separators \n \r
more...
+ is quantifier which can be read as "once or more" which makes \s+ representing text build from one or more whitespaces.
We need to write this regex as "\\s+ (with two backslashes) because in String \ is considered special character which needs escaping (with another backslash) to produce \ literal.
So split("\\s+") will produce array of tokens separated by one or more whitespaces. BTW trailing empty elements are removed so "a b c ".split("\\s+") will return array ["a", "b", "c"] not ["a", "b", "c", ""].
Yes, though actually any number of space meta-characters (including tabs, newlines etc). See the Java documentation on Patterns.
It will split the string on one (or more) consecutive white space characters. The Pattern Javadoc describes the Predefined character classes (of which \s is one) as,
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
Note that the \\ is to escape the back-slash as required to embed it in a String.
Yes, and it splits both tab and space:
String t = "test your function aaa";
for(String s : t.split("\\s+"))
System.out.println(s);
Output:
test
your
function
aaa

Java - What is the right regex for my String.split()

I am splitting equation string into string array like this:
String[] equation_array = (equation.split("(?<=[-+×÷)(])|(?=[-+×÷)(])"));
Now for test string:
test = "4+(2×5)"
result is fine:
test_array = {"4", "+", "(", "2",...}
but for test string:
test2 = "(2×5)+5"
I got string array:
test2_array = {"", "(", "×",...}.
So, problem is why does it add an empty string before ( in array after splitting?
This is actually known behavior in Java regex.
To avoid this empty result use this negative lookahead based regex:
String[] equation_array = "(2×5)+5".split("(?!^)((?<=[-+×÷)(])|(?=[-+×÷)(]))");
//=> ["(", "2", "×", "5", ")", "+", "5"]
What (?!^) means is to avoid splitting at line start.
You can add condition that not to split if before token is start of string like
"(?<=[-+×÷)(])|(?<!^)(?=[-+×÷)(])"
^^^^^^
What about looking backwards to make sure we're not at the start of the string, and looking forwards to make sure we're not at the end?
"(?<=[-+×÷)(])(?!$)|(?<!^)(?=[-+×÷)(])"
Here ^ and $ are start and end of string indicators and (?!...) and (?<!...) are negative lookahead and lookbehind.
problem is why does it add an empty string before ( in array after splitting?
Because for the input (2×5)+5 the regex used for splitting matches right at the start-of-string because of the positive look ahead (?=[-+×÷)(]).
(2×5)+5
↖
It matches right here before the (, resulting in an empty string: "".
My advice would be not to use regular expressions to parse mathematical expressions, there are more suitable algorithms for this.

Java (Regex) - Get all words in a sentence

I need to split a java string into an array of words. Let's say the string is:
"Hi!! I need to split this string, into a serie's of words?!"
At the moment I'm tried using this String[] strs = str.split("(?!\\w)") however it keeps symbols such as ! in the array and it also keeps strings like "Hi!" in the array as well. The string I am splitting will always be lowercase. What I would like is for an array to be produced that looks like:
{"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"} - Note the apostrophe is kept.
How could I change my regex to not include the symbols in the array?
Apologies, I would define a word as a sequence of alphanumeric characters only but with the ' character inclusive if it is in the above context such as "it's", not if it is used to a quote a word such as "'its'". Also, in this context "hi," or "hi-person" are not words but "hi" and "person" are. I hope that clarifies the question.
You can remove all ?! symbols and then split into words
str = str.replaceAll("[!?,]", "");
String[] words = str.split("\\s+");
Result:
Hi, I, need, to, split, this, string, into, a, serie's, of, words
Should work for what you want.
String line = "Hi!! I need to split this string, into a serie's of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));
Gives
[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]
If you define a word as a sequence of non-whitespace characters (whitespace character as defined by \s), then you can split along space characters:
str.split("\\s+")
Note that ";.';.##$>?>#4", "very,bad,punctuation", and "'goodbye'" are words under the definition above.
Then the other approach is to define a word as a sequence of characters from a set of allowed characters. If you want to allow a-z, A-Z, and ' as part of a word, you can split along everything else:
str.split("[^a-zA-Z']+")
This will still allow "''''''" to be defined as a word, though.
So what you want is to split on anything that is not a wordcharacter [a-zA-Z] and is not a '
This regex will do that "[^a-zA-Z']\s"
There will be a problem if the string contains a quote that is quoted in '
I usually use this page for testing my regex'
http://www.regexplanet.com/advanced/java/index.html
I would use str.split("[\\s,?!]+"). You can add whatever character you want to split with inside the brackets [].
You could filter out the characters you deem as "non-word" characters:
String[] strs = str.split("[,!? ]+");
myString.replaceAll("[^a-zA-Z'\\s]","").toLowerCase().split("\\s+");
replaceAll("[^a-zA-Z'\\s]","") method replaces all the characters which are not a-z or A-Z or ' or a whitespace with nothing ("") and then toLowerCase method make all the chars returned from replaceAll method lower case. Finally we are splitting the string in terms of whitespace char. more readable one;
myString = myString.replaceAll("[^a-zA-Z'\\s]","");
myString = myString.toLowerCase();
String[] strArr = myString.split("\\s+");

Categories

Resources