Regex matching in Java - java

I have a string in Java that I need to split using "<$" and "$>" as delimiters.
But if I have something looking like "\<$something_we_dont_care_what$>" than we ignore it and move on.
I've been trying to write a regex doing this for a while but I keep failing and reading about regular expressions in Java is just making me more and more confused...
Can anyone tell me the right way to do this?
Thank you.

Think you have two strings - not in your code, but read from file or a JTextField:
s = "\<$foo$>";
p = "[^\\]?<\$[^\$]*\$>";
And you want to match the pattern to the String.
What I have done so far:
A group, which does not contain a backslash [^\\]? but might be optional.
<$, where the Dollar, as special regex has to be masked by a backslash, as the backslash before.
A group [^\$]* which does not contain another Dollar of free length.
A Dollar with \$> a greater-than. Again: Dollar masked.
A question for your domain is, whether the foo-part, or something_we_dont_care_what, might contain a dollar sign, not followed by a >. I asssumed not.
s.match (p);
Should now return true or false, but the problem is, how to get it into your code. The problem is, that not only regex, but Java itself treats the backslash as masking character. So you have to double each of them:
p = "[^\\\\]?<\\$[^\\$]*\\$>";
If the test case is a literal text in your code too, this applies for it too:
"\\<$foo$>".matches (p);
Trying them out is often a good idea if you have a tool where you can omit the Java masking first - a simple GUI with two JTextFields, or code which reads the pattern from a properties file, which saves you from repeated recompiles.
public class PM
{
public static void main (String args[])
{
String bad = "\\<$foo$>";
String good = "<$foo$>";
String p = "[^\\\\]?<\\$[^\\$]*\\$>";
System.out.println ("bad:\t" + bad.matches (p));
System.out.println ("good:\t" + good.matches (p));
}
}

Never mind.
I've found a solution after a few hours of browsing and experimenting.
Regex expression that does exactly what I wanted is following:
// char $ needs to be escaped because it has different meaning in regular expressions
// <$
String leftDelimiter = "(<\\$)";
// $>
String rightDelimiter = "(\\$>)";
// leftDelimiter | rightDelimiter
// when used to split a string would split it each time it detected those two patters
// and it would also split it in the case I dont want them to split it
// and that is "\<$foo$>" case - when they are "escaped" in the string
// to solve it we can try to match our leftDelimiter only if char \ isnt before it
// matches all [$ that dont start with \
String fixedLeftDelimiter = "(?<!\\\\)"+leftDelimiter;
// the problem presents itself with the rightDelimiter because it needs to check
// whether there had been a leftDelimiter before it that has been escaped
// the following takes care of that
// matches all $> that dont have a <$ starting with \
String betterRightDelimiter = "(?<!\\\\"+leftDelimiter+whatCanBeInTags+rightDelimiter;
// whatCanBeInTags is everything that can be in out tags besides $ sign
// we are using {0,"+(Integer.MAX_VALUE-3)+"}? instead of *? because of a limitation
// of number of characters put in lookbehind assertion
String whatCanBeInTags = "[^\\$]{0,"+(Integer.MAX_VALUE-3)+"}?)";

Related

regex to filter out string

I'm filtering out string using below regex
^(?!.*(P1 | P2)).*groupName.*$
Here group name is specific string which I replace at run time. This regex is already running fine.
I've two input strings which needs to pass through from this regex. Can't change ^(?!.*(P1 | P2)) part of regex, so would like to change regex after this part only. Its a very generic regex which is being used at so many places, so I have only place to have changes is groupName part of regex. Is there any way where only 2 string could pass through this regex ?
1) ADMIN-P3-UI-READ-ONLY
2) ADMIN-P3-READ-ONLY
In regex groupName is a just a variable which will be replaced at run time with required string. In this case I want 2 string to be passed, so groupName part can be replaced with READ-ONLY but it will pass 1 string too.
Can anyone suggest on this how to make this work ?
You could use negative lookBehind:
(?<!UI-)READ-ONLY
so there must be no UI- before READ-ONLY
You can add another lookahead at the very start of your pattern to further restrict what it matches because your pattern is of the "match-everything-but" type.
So, it may look like
String extraCondition = "^(?!.*UI)";
String regex = "^(?!.*(P1|P2)).*READ-ONLY.*$";
String finalRegex = extraCondition + regex;
The pattern will look like
^(?!.*UI)^(?!.*(P1|P2)).*READ-ONLY.*$
matching
^(?!.*UI) - no UI after any zero or more chars other than line break chars as many as possible from the start of string
^(?!.*(P1|P2)) - no P1 nor P2 after any zero or more chars other than line break chars as many as possible from the start of string
.*READ-ONLY - any zero or more chars other than line break chars as many as possible and then READ-ONLY
.*$ - the rest of the string. Note you may safely remove $ here unless you want to make sure there are no extra lines in the input string.

Negative Look-Ahead assertion for multiline text [duplicate]

This question already has answers here:
How to use java regex to match a line
(2 answers)
Closed 4 years ago.
i'm looking for a way to check whether a multiline string (from a pdf) contains a certain letter combination which must not start with a specific prefix. Specifically, i'm trying to find Strings that contain ARC but don't contain NON-ARC.
I found this great example Regular expression for a string that does not start with a sequence but it seems it does not work with my problem. With my pattern ^(?!NON\\-)ARC.* i get the expected result in a single line test, with real input the negative look ahead assertion has a false positive. Here is what i did:
#Test
public void testRegexLookAhead() {
String strTestSimplePos = "ARC 0.1-1";
String strTestSimpleNeg = "NON-ARC 3.4-1";
String strTestRealPos = "HEADLINE\r\n" + "Subheader Author\r\n" + "ARC 0.1-1\r\n" + "20190211";
String strTestRealNeg = "HEADLINE\r\n" + "Subheader Author\r\n" + "NON-ARC 0.1-1\r\n" + "20190211";
//based on https://stackoverflow.com/questions/899422/regular-expression-for-a-string-that-does-not-start-with-a-sequence
String regexNoNON = "^(?!NON\\-)ARC.*";
Pattern noNONPatter = Pattern.compile(regexNoNON);
System.out.println(noNONPatter.matcher(strTestSimplePos).find()); //true OK
System.out.println(noNONPatter.matcher(strTestSimpleNeg).find()); //false OK
System.out.println(noNONPatter.matcher(strTestRealPos).find()); //false but should be true -> does not work as intended
System.out.println(noNONPatter.matcher(strTestRealNeg).find()); //false OK
Would be glad if anyone can point out what went wrong...
Edit: This was marked as a duplicate of How to use java regex to match a line - however i didn't try to use a regex to match a line at all. Just needed a way to find a specific sequence (with negative look-ahead) for a multiline text input. One approach to solve the other question is also the solution to this one (compile pattern with java.util.regex.Pattern.MULTILINE) - but the questions are at best related.
Your input strings have multiple lines and you're using the caret, you need to add the multi-line flag:
Pattern.compile(regexNoNON, java.util.regex.Pattern.MULTILINE);
About MULTILINE:
Enables multiline mode.
In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.
Try this Regex:
HEADLINE(?:(?!HEADLINE)[\s\S])*(?<!NON-)ARC(?:(?!HEADLINE)[\s\S])*
Click for Demo
JAVA Code
Explanation:
HEADLINE - matches the word HEADLINE
(?:(?!HEADLINE)[\s\S])* - matches 0+ occurrences of any character that does not start with the word HEADLINE
(?<!NON-)ARC - matches the word ARC if it is not immediately preceded by NON-
(?:(?!HEADLINE)[\s\S])* - matches 0+ occurrences of any character that does not start with the word HEADLINE

How do i check if string contains char sequence and backslash "\"?

I'm trying to get true in the following test. I have a string with the backslash, that for some reason doesn't recognized.
String s = "Good news\\ everyone!";
Boolean test = s.matches("(.*)news\\.");
System.out.println(test);
I've tried a lot of variants, but only one (.*)news(.*) works. But that actually means any characters after news, i need only with \.
How can i do that?
Group the elements at the end:(.*)news\\(.*)
You can use this instead :
Boolean test = s.matches("(.*)news\\\\(.*)");
Try something like:
Boolean test = s.matches(".*news\\\\.*");
Here .* means any number of characters followed by news, followed by double back slashes (escaped in a string) and then any number of characters after that (can be zero as well).
With your regex what it means is:
.* Any number of characters
news\\ - matches by "news\" (see one slash)
. followed by one character.
which doesn't satisfies for String in your program "Good news\ everyone!"
You are testing for an escaped occurrence of a literal dot: ".".
Refactor your pattern as follows (inferring the last part as you need it for a full match):
String s = "Good news\\ everyone!";
System.out.println(s.matches("(.*)news\\\\.*"));
Output
true
Explanation
The back-slash is used to escape characters and the back-slash itself in Java Strings
In Java Pattern representations, you need to double-escape your back-slashes for representing a literal back-slash ("\\\\"), as double-back-slashes are already used to represent special constructs (e.g. \\p{Punct}), or escape them (e.g. the literal dot \\.).
String.matches will attempt to match the whole String against your pattern, so you need the terminal part of the pattern I've added
you can try this :
String s = "Good news\\ everyone!";
Boolean test = s.matches("(.*)news\\\\(.*)");
System.out.println(test);

Java Regex - Trying to isolate text from a line that starts with a certain string?

EDIT: MAKE SURE YOU CALL Matcher#matches or Matcher#find before trying to use group!
Source
I'm trying to do something very simple - I'm trying to get the text from a line that starts with a word. In this case, the word is Location:. I'm reading from raw HTML so the line of interest actually looks like this:
Location: Main Hall
Obviously, I want Main Hall returned to me so I can read the location for my application.
This is what I've tried:
String t_location = "";
Pattern t_pat = Pattern.compile("^[\\s]+?(?s)Location: (?-s)(.*)$");
Matcher t_match = t_pat.matcher(t_inner_html);
t_location = t_match.group(0);
But I keep getting the error:
java.lang.IllegalStateException: No successful match so far
Breaking down my Regex, this is what (I think) I'm doing:
^ - Read from the beginning of the line
[\\s]+? - With a reluctant qualifier, read the whitespace at the beginning of the line until we hit something else
(?s)Location: (?-s) - The literal string "Location: " is read
(.*)$ - Read characters (except newlines) until the end of the line
That is what I THINK I'm doing. I'm not so good at Regex, but I've tried to follow the documentation to no avail. Can someone please help me?
For example purposes, the String t_inner_html looks like this:
8/28/2014
Alumni Reunion
Location: Main Hall
<span class="extra-info">
Blah blah blah....
</span>
If this were not Java, this regex should work, depending on what your end-of-line (EOL) character sequence is:
(.|\n)*Location:\s*(.*)\n
The string you want is at group index 1.
Now since this regex is going to be inside a Java String, and since backslashes are escape characters in Java strings, you will actually have to pollute the pure regex with double backslashes:
Pattern t_pat = Pattern.compile("(.|\\n)*Location:\\s*(.*)\\n");
In general, to test regexes, I really like this tool:
http://regexpal.com/
It's an interactive tester that will progressively highlight your sample input as it matches the regex. When you edit the regex or change the sample input, the matching highlighting will update in real time. This does not support the required double backslashes of Java, so test in the tool with the singles, paste them to Java, and then add the extra backslashes.
You may also want to play around with this tool, which is not as real-time but does support Java String regexes:
http://www.regexplanet.com/advanced/java/index.html
To break down what I have:
(.|\n)* - zero or more characters or EOL sequences
Location: - the string "Location:"
\s* - zero or more white space
(.*) - a regex group consisting of absolutely anything, which is what you will capture
\n - EOL sequence
You may need to replace \n with \r\n if you are on Windows, but try \n first and see.
This will match everything in your sample input through "Main Hall", and will ignore everything after (<span . . .> etc.) "Main Hall" will end up in the match group 1.
Please try the following:
String t_location = "";
Pattern t_pat = Pattern.compile("^\\s+Location:\\s+(.*)$", Pattern.MULTILINE);
Matcher t_match = t_pat.matcher(t_inner_html);
if (t_match.find()) {
t_location = t_match.group(1);
}
You need to use Pattern.MULTILINE for the expressions ^ and $ to match each line instead of the whole string.
Java Fiddle Demo
First use String indexOf Method to find wether line contains "Location :".
Then use str.replace("Location : ",""); on the line which has "Location :".
.*?Location:(.*?)\n
This should get you what you want.
See demo.
http://regex101.com/r/rJ1oQ3/1

Regular expression not working despite testing

I'm trying to enforce validation of an ID that includes the first two letters being letters and the next four being numbers, there can be one 0 i.e. 0333 but can never be full zeroes with 0000 therefore something like ID0000 is not allowed. The expression I came up with seems to check out when testing it online but doesn't seem to work when trying to enforce it in the program:
\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\b
and heres the code I'm currently using to implement it:
String pattern = "/\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\b/";
Pattern regEx = Pattern.compile(pattern);
String ingID = ingredID.getText().toString();
Matcher m = regEx.matcher(ingID);
if (m.matches()) {
ingredID.setError("Please enter a valid Ingrediant ID");
}
For some reason it doesn't seem to validate correctly with accepting ids like ID0000 when it shouldn't be. Any thoughts folks ?
Change your regex pattern to "\\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\\b"
Your problem is essentially that Java isn't all that Regex-friendly; you need to deal with the limitations of Java strings in order to create a string that can be used as a Regex pattern. Since \ is the escape character in Regex and the escape character in Java strings (and since there's no such thing as a raw string literal in Java), you must double-escape anything that must be escaped in the Regex in order to create a literal \ character within the Java string, which, when parsed as a Regex pattern, will be correctly treated as the escape character.
So, for instance, the Regex pattern /\b/ (where /, as mentioned in my comment, delimits the pattern itself) would be represented in Java as the string "\\b".

Categories

Resources