Error caused by RegEx? - java

I'm using a system where I need to enter hundreds of RegEx expressions. I've recently changed a few things and am getting the following error:
java.lang.RuntimeException: ?+* follows nothing in expression
I've no idea what this means and would really appreciate any pointers for what I should be looking for to fix it.
Many thanks :)
Katie

The obvious interpretation is that you have a regex that starts with a '?', a '+' or a '*' meta-character. Maybe it should have been escaped. Maybe you've accidentally deleted the preceding things that is "quantified" by the meta-character.
I do have a '*' at the beginning of some expressions - is that bad?
Yup. If that is supposed to match a literal asterisk character, it must be preceded by a '\' to escape it. (And as Felix Kling pointed out, the '\' will itself need to be escaped if the regex is embedded in a Java string literal.)
Should I be putting '.*' (ie. dot star) instead?
It depends what you want the regex to match at that point. '.*' means "eagerly match zero or more characters". If that's what you mean, that's what you should use.

It means you have a quantifier (+,?, or *) that isn't quantifying anything. My guess is you might have forgotten to escape one of those characters (with \) when trying to match it.

Related

Using a Regular Expression, how to ignore a given character

Good afternoon,
I'm part of a migration project from SQL Server to DataBricks (Apache Spark) and while we're enjoying all the benefits of DB, I must admit I'm missing all of those lovely Microsoft SQL functions.
As part of my migration, I'm trying to write a regular expression to find the first instance of a "-" or a "+" and return all characters after this.
Here's my regular expression so far:
\+(.*)|\-.*
Here's my complex test set:
dlfsdlfkgjbsdfg / sdklfjgsdfgsdfg-sdfgsdfg / sdfgjh-sdfgsdfg / sdfg+sdfgsdfg / sdfgsdgfhf4
The bold text is what I expect to return, but currently I'm seeing the plus and the minus chars returning.
I've tried following examples but it seems that I'm missing a trick because I can either highlight everything after (but including) the chars, or just the char itself.
Thanks in advance!
Your
\+(.*)|\-.*
Matches either
a plus followed by anything and capturing that anything
or
a hyphen and anything and doing nothing with it.
You should use the character class and then a capturing .*, like
[+-](.*)
or a noncapturing alternation (of one each of + and -) and capturing .* like
(?:\+|-)(.*)
You can extract matches of the following regular expression with the single-line or DOTALL flag set, which cause the dot to match line terminators as well as all other characters.
(?<=[+-]).*
Start your engine!
(?<=[+-]) is a positive lookbehind (supported by Java) which asserts that the current location in the match is preceded immediately by the first plus or minus character.

Why do I need two slashes in Java Regex to find a "+" symbol?

Just something I don't understand the full meaning behind. I understand that I need to escape any special meaning characters if I want to find them using regex. And I also read somewhere that you need to escape the backslash in Java if it's inside a String literal. My question though is if I "escape" the backslash, doesn't it lose its meaning? So then it wouldn't be able to escape the following plus symbol?
Throws an error (but shouldn't it work since that's how you escape those special characters?):
replaceAll("\+\s", ""));
Works:
replaceAll("\\+\\s", ""));
Hopefully that makes sense. I'm just trying to understand the functionality behind why I need those extra slashes when the regex tutorials I've read don't mention them. And things like "\+" should find the plus symbol.
There are two "escapings" going on here. The first backslash is to escape the second backslash for the Java language, to create an actual backslash character. The backslash character is what escapes the + or the s for interpretation by the regular expression engine. That's why you need two backslashes -- one for Java, one for the regular expression engine. With only one backslash, Java reports \s and \+ as illegal escape characters -- not for regular expressions, but for an actual character in the Java language.
Funda behind extra slashes is that , first slash '\' is escape for the string and second slash '\' is escape for the regex.

Unclosed Character Class (Regex)

So, I have this semi-complex regex that is searching for all text in between two strings, then replacing it.
My search regex for this is:
(jump *[A-Z].*)(?:[^])*?([A-Z].*:)
This gives an Unclosed Character Class on the final closing bracket, which I have been struggling to solve. The regex seems to work as intended on RegexR (http://regexr.com/?38k63)
Could anyone provide some help or insight?
Thanks in advance.
The error is at here:
(jump *[A-Z].*)(?:[^])*?([A-Z].*:)
^
In character class ^ is still a special character. It usually negates other characters when you place there. So escape it with \\ in Java.
Different regex engines will treat [^] differently. Some will assume that it's the beginning of a negative character class excluding ] and any characters up to the next ] in the pattern, (e.g. [^][] will match anything except ] and [). Other engines will treat as a empty negative character class (which will match anything). This is why some regex engines will work, and others report it as an error.
If you meant for it to match a literal ^ character, you'll have to escape it like this:
(jump *[A-Z].*)(?:[\^])*?([A-Z].*:)
Or better yet, just remove it from the character class (you'll still have to escape it because ^ has special meaning outside of a character class, too):
(jump *[A-Z].*)(?:\^)*?([A-Z].*:)
Or if you meant for it to match everything up to the next [A-Z].*:, try a character class like this:
(jump *[A-Z].*)(?:[\s\S])*?([A-Z].*:)
And of course, because this is Java, don't forget that you'll need to escape the all the \ characters in any string literals.
Problem seems here in use of [^]:
(jump *[A-Z].*)(?:[^])*?([A-Z].*:)
^
-------------------|
Try this regex instead:
(jump *[A-Z].*)[\\s\\S]*?([A-Z].*:)
OR this:
(?s)(jump *[A-Z].*).*?([A-Z].*:)

Matching any character in regex?

I have a state machine which is capable of matching the comments. So it can handle :
/* /* */ */
But I bogged down of skipping the contents that are inside the comment lines. Currently my comments-word regex looks something strange :
[0-9A-Za-zA-Z0-9\*\(\*\*\)\.\{\}\_\;\,\-\:" "\#]*
Are there any simple regex ( in java ) which matches all the characters? Alphabets along with special characters?
Thanks for the help.
use . (dot) if you want to match any character.
See here: Dot
. matches anything once. .* will match 0 or more of anything, while .+ will match one or more, depending on your needs.
. is the character that matches all other characters, with the possible exception of newlines (depending on whether DOTALL is enabled).
If you want to match everything EXCEPT a certain character or two, use [^...] syntax (such as [^0-9a-fA-F] to avoid matching every hexadecimal digit).
It is often useful to add a trailing ? to expressions with a dot, to match the fewest characters as possible (such as .*? or .+?). Otherwise, an unterminated dot expression may match the rest of the string.

Simple Java regex not working

I have this regex which is supposed to remove sentence delimiters(. and ?):
sentence = sentence.replaceAll("\\.|\\?$","");
It works fine it converts
"I am Java developer." to "I am Java developer"
"Am I a Java developer?" to "Am I a Java developer"
But after deployment we found that it also replaces any other dots in the sentence as
"Hi.Am I a Java developer?" becomes "HiAm I a Java developer"
Why is this happening?
The pipe (|) has the lowest precedence of all operators. So your regex:
\\.|\\?$
is being treated as:
(\\.)|(\\?$)
which matches a . anywhere in the string and matches a ? at the end of the string.
To fix this you need to group the . and ? together as:
(?:\\.|\\?)$
You could also use:
[.?]$
Within a character class . and ? are treated literally so you need not escape them.
What you're saying with "\\.|\\?$" is "either a period" or "a question mark as the last character".
I would recommend "[.?]$" instead in order to avoid the confusing escaping (and undesirable result, of course).
Your problem is because of the low precedence of the alternation operator |. Your regular expression means match one of:
. anywhere or
? at the end of a line.
Use a character class instead:
"[.?]$"
You have forgotten to embrace the sentence-ending characters with round brackets:
sentence = sentence.replaceAll("(\\.|\\?)$","");
The better approach is to use [.?]$ like #Mark Byers suggested.
sentence = sentence.replaceAll("[.?]$","");

Categories

Resources