Java String Split using Regex with Escape Character - java

I have a string which needs to be split based on a delimiter(:). This delimiter can be escaped by a character (say '?'). Basically the delimiter can be preceded by any number of escape character. Consider below example string:
a:b?:c??:d???????:e
Here, after the split, it should give the below list of string:
a
b?:c??
d???????:e
Basically, if the delimiter (:) is preceded by even number of escape characters, it should split. If it is preceded by odd number of escape characters, it should not split. Is there a solution to this with regex?
Any help would be greatly appreciated.
Similar question has been asked earlier here, But the answers are not working for this use case.
Update:
The solution with the regex: (?:\?.|[^:?])* correctly split the string. However, this also gives few empty strings. If + is given instead of *, even the real empty matches also ignored. (Eg:- a::b gives only a,b)

Scenario 1: No empty matches
You may use
(?:\?.|[^:?])+
Or, following the pattern in the linked answer
(?:\?.|[^:?]++)+
See this regex demo
Details
(?: - start of a non-capturing group
\?. - a ? (the delimiter) followed with any char
| - or
[^:?] - any char but the : (your delimiter char) and ? (the escape char)
)+ - 1 or more repetitions.
In Java:
String regex = "(?:\\?.|[^:?]++)+";
In case the input contains line breaks, prepend the pattern with (?s) (like (?s)(?:\\?.|[^:?])+) or compile the pattern with Pattern.DOTALL flag.
Scenario 2: Empty matches included
You may add (?<=:)(?=:) alternative to the above pattern to match empty strings between : chars, see this regex demo:
String s = "::a:b?:c??::d???????:e::";
Pattern pattern = Pattern.compile("(?>\\?.|[^:?])+|(?<=:)(?=:)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("'" + matcher.group() + "'");
}
Output of the Java demo:
''
'a'
'b?:c??'
''
'd???????:e'
''
Note that if you want to also match empty strings at the start/end of the string, use (?<![^:])(?![^:]) rather than (?<=:)(?=:).

Related

Regex: string can contain spaces, but not only spaces. It cannot contain `*` nor `:` characters either

I need help finding a regex that will allow most strings, except:
if the string only contains whitespaces
if the string contains : or *
I want to reject the following strings:
"hello:world"
"hello*world"
" " (just a whitespace)
But the following strings will pass:
"hello world"
"hello"
So far, I can accomplish what I want... in two patterns.
[^:*]* rejects the 2 special characters
.*\S.* rejects any string with only whitespaces
I'm not sure how to combine these two patterns into one...
I'll be using the regex pattern along with Java.
An example of how you could combine your two patterns for use with the matches method:
"[^:*]*[^:*\\s][^:*]*"
[^\s] is equivalent to \S.
You could use a negative lookahead:
^(?!\s*$)[^:*]+$
^ - start of string anchor
(?!\s*$) negative lookahead rejecting whitespace-only strings
[^:*]+ - one or more of any character except : and *
$ - end of string anchor
Demo
You can use matches to match the whole string with the doubled backslash:
\\s*[^\\s:*][^:*]*
Explanation
\s* Match optional whitespace chars
[^\s:*] Match a non whitespace char other than : and *
[^:*]* Match optional chars other than : and *
See a regex demo.
As \s can also match a newline, if you don't want to cross matching newlines:
\\h*[^\\s:*][^\\r\\n:*]*
Explanation
\h* Match optional horizontal whitespace chars
[^\s:*] Match a non whitespace char other than : and *
[^\\r\\n:*]* Match optional chars other than : and * or newlines
See another regex demo.

Problem with regex creation if excape character is at the end of the parameter value

I get three parameters in a string. Each parameter is written in the form: Quotes, Name, Quotes, Equals sign, Quotes, Text, Quotes. The parameter separator is a space.
Example 1:
"param1"="Peter" "param2"="Harald" "param3"="Marie"
With java.util.regex.Matcher I can find any name and text by the following regex:
"([^"]*)"\s*=\s*"([^"]*)"
Now, however, there may be a quotation mark in the text. This is masked by a backslash.
Example 2:
"param1"="Peter" "param2"="Har\"ald" "param3"="Marie"
I have built the following regex:
"([^"]*)"\s*=\s*("([^"]*(\\")*[^"]*)*[^\\]")
This works well for example 2, but is not a universal solution.
If the backslash is at the end of a parameter-value, the solution does not work anymore.
Example 3:
"param1"="Peter" "param2"="Harald\" "param3"="Marie"
If the backslash is at the end of the value, the matcher interprets "Harald\" " as the value of parameter 2 instead of "Harald\".
Do you have a universal solution for this problem? Thanks in advance for your input.
Kind regards
Dominik
You may use this regex in Java:
\"([^\"]*)\"\h*=\h*(\"[^\\\"]*(?:\\(?=\"(?:\h|$))|(?:\\.[^\\\"]*))*\")
RegEx Demo
RegEx Demo:
\"([^\"]*)\": Match quoted string a parameter name
\h*=\h*: Match = surrounded with optional spaces
(: Start capture group #1
\": Match opening "
[^\\\"]*: Match 0 or more of non-quote, non-backslash characters
(?::
\\: Match a \
(?=\"(?:\h|$)): Must be followed by a " that has a whitespace or line afterwards
|: OR
(?:\\.[^\\\"]*))*: Match an escaped character followed by 0 or more of non-quote, non-backslash characters
\": Match closing "
): End capture group #1

regex pattern accepting comma and colon

I am searching for regex pattern that matches the following String. I am using this regex pattern as,
^;[A-za-z0-9,:]+
Above regex doesn't matches the following.
I am looking for all given string to be matched with regex pattern.
:a123,234,444:322 //String started with semicolon and values are separated with comma and colon
;123,A234:123;123,345,456:999,456 // Above case with repeated condition
;;123,345,C555:123 //String started with double semicolon
Can anyone provide regex pattern that matches above string.
This one
[;:]+[A-za-z0-9,;:]+
will work for all three you want, see online on regex101.
[;:]+: Started with one or more ; or : .
[A-za-z0-9,;:]+: You miss' a : here.
You can match the above with this regex
^;+[A-za-z0-9,;:]+
Modifications:
;+ will match 1 or more semicolons
colon : has been added in characters you want to match

regex whole word option

I have a problem about matching whole words in java, what I want to do is finding the start indices of each word in a given line
Pattern pattern = Pattern.compile("("+str+")\\b");
Matcher matcher = pattern.matcher(line.toLowerCase(Locale.ENGLISH));
if(matcher.find()){
//Doing something
}
I have a problem with this given case
line = "Watson has Watson's items.";
str = "watson";
I want to match with only the first watson here without matching the other one and i dont want my pattern to have some empty space control, what should i do in this case
The word boundary \b matches the location between a non-word and a word character (or the start/end before/after a word character). The ', -, +, etc. are non-word characters, so Watson\b will match in Watson's (partial match).
You might want to only match Watson if it is not enclosed with non-whitespace symbols:
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?!\\S)");
To match Watson at the end of the sentence, you will need to allow matching before ., ? and !, use
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?![^\\s.!?])");
See the regex demo
Just FYI: perhaps, it is a good idea to also use Pattern.quote(str) instead of plain str to avoid issues when your str contains special regex metacharacters.
Use find() method in matcher
Refer java docs

Correct existing regular expression / create a new one

I am trying to learn Regular expressions and am trying to replace values in a string with white-spaces using regular expressions to feed it into a tokenizer. The string might contain many punctuations. However, I do not want to replace whitespaces in string which contain an apostrophe/ hyphen within them.
For example,
six-pack => six-pack
He's => He's
This,that => This That
I tried to replace all the punctuations with whitespace initially but that would not work.
I tried to replace only those punctuations by specifying the wordboundaries as in
\B[^\p{L}\p{N}\s]+\B|\b[^\p{L}\p{N}\s]+\B|\B[^\p{L}\p{N}\s]+\b
But, I am not able to exclude the hyphen and apostrophe from them.
My guess is that the above regex is also very cumbersome and there should be a better way. Is there any?
So, all I am trying to do is:
Replace all punctuations with whitespace
Do not do the above if they are hyphen/apostrophe
Do replace if the hyphen/apostrophe does occur at start/end of a word.
Any help is appreciated.
You can probably work out a set of punctuation characters that are ok between words, and another set that isn't, then define your regular expression based on that.
For instance:
String[] input = {
"six-pack",// => six-pack
"He's",// => He's
"This,that"// => This That"
};
for (String s: input) {
System.out.println(s.replaceAll("(?<=\\w)[\\p{Punct}&&[^'-]](?=\\w)", " "));
}
Output
six-pack
He's
This that
Note
Here I'm defining the Pattern by using a character class including all posix for punctuation, preceded and followed by a word character, but negating a character class containing either ' or -.
You can use this lookahead based regex:
(?!((?!^)['-].))\\p{Punct}
RegEx Demo
You could use negative lookahead assertion like below,
String s = "six-pack\n"
+ "He's\n"
+ "This,that";
System.out.println(s.replaceAll("(?m)^['-]|['-]$|(?!['-])\\p{Punct}", " "));
Output:
six-pack
He's
This that
Explanation:
(?m) Multiline Mode
^['-] Matches ' or - which are at the start.
| OR
['-]$ Matches ' or - which are at the end of the line.
| OR
(?!['-])\\p{Punct} Matches all the punctuations except these two ' or - . It won't touch the matched [-'] symbols (ie, at the start and end).
RegEx Demo

Categories

Resources