How to exclude occurrence of a substring from a string using regex? - java

I have a string input in the following two forms.
1.
<!--XYZdfdjf., 15456, hdfv.4002-->
<!DOCTYPE
2.
<!--XYZdfdjf., 15456, hdfv.4002
<!DOCTYPE
I want to return a match if the form 2 is encountered and no match for the form 1.
Thus basically I want a regex that accepts arbitrarily all characters between <!-- and <!DOCTYPE, except when there is an occurance of --> in between.
I am using Pattern , Matcher and java regex.
Help is sought in terms of a regex specifically usable with Pattern.compile()
Thanks in advance.

Pattern p = Pattern.compile("(?s)<!--(?:(?!-->).)*<!DOCTYPE");
(?:(?!-->).)* matches one character at a time, after checking that it's not the first character of -->.
(?s) sets DOTALL mode (a.k.a. single-line mode), allowing the . to match newline characters.
If there's a possibility of two or more matches and you want to find them individually, you can replace the * with a non-greedy *?, like so:
"(?s)<!--(?:(?!-->).)*?<!DOCTYPE"
For example, applying that regex to the text of your question will find two matches, while the original regex will find one, longer match.

This seems like it is easily solved by using String.contains():
if (yourHtml.contains("-->")) {
// exclude
} else {
// extract the content you need
String content =
yourHtml.substring("<!--".length(), yourHtml.indexOf("<!DOCTYPE"));
}
I think you are looking too far into it.

\<!--([\s\S](?!--\>))*?(?=\<\!DOCTYPE)
this uses a negative lookahead to prevent the --> and a positive lookahead to find the <!DOCTYPE
Here's a good reference for atomic assertions (lookahead and behind).

I don't have a testing system handy so i can't give you the regex but you should look inside the Pattern documentation for something called negative lookahead assertion. This allows you to express rules of the form: Match this if not followed by that.
It should help you :)

A regular expression might not be the best answer to your problem. Have you tried splitting the first line away from everything else and seeing if it contains the -->?
Specifically, something like:
String htmlString;
String firstLine = htmlString.split("\r?\n")[0];
if(firstLine.contains("-->"))
;//no match
//match

Related

Regex with prefix and optional suffix

This is maybe the 100+1 question regarding regex optional suffixes on SO, but I didn't find any, that could help me :(
I need to extract a part of string from the common pattern:
prefix/s/o/m/e/t/h/i/n/g/suffix
using a regular expression. The prefix is constant and the suffix may not appear at all, so prefix/(.+)/suffix doesn't meet my requirements. Pattern prefix/(.+)(?:/suffix)? returns s/o/m/e/t/h/i/n/g/suffix. The part (?:/suffix)? must be somehow more greedy.
I want to get s/o/m/e/t/h/i/n/g from these input strings:
prefix/s/o/m/e/t/h/i/n/g/suffix
prefix/s/o/m/e/t/h/i/n/g/
prefix/s/o/m/e/t/h/i/n/g
Thanks in advance!
Try
prefix\/(.+?)\/?(?:suffix|$)
The regex need to know when the match is done, so match either suffix or end of line ($), and make the capture non greedy.
See it here at regex101.
Try prefix(.*?)(?:/?(?:suffix|$)) if there are characters allowed before prefix of after suffix.
This requires the match to be as short as possible (reluctant quantifier) and be preceeded by one of 3 things: a single slash right before the end of the input, /suffix or the end of the input. That would match /s/o/m/e/t/h/i/n/g in the test cases you provided but would match more for input like prefix/s/o/m/e/t/h/i/n/g/suff (which is ok IMO since you don't know whether /suff is meant to be part of the match or a typo in the suffix).

Replace string part with regex pattern

I would like to replace the following string.
img/s/430x250/
The problem is there are variations, like:
img/s/265x200/
or:
img/s/110x73/
So I would like to replace this part in whole, but the numbers are changeable, so how could I make a pattern that replaces it from a string?
Is your goal to match all three of those cases?
If so, this should work: img\/s\/\d+x\d+\/
It searches for img/s/[1 or more digits]x[1 or more digits]/
This regular expression will match your examples
img\/s\/\d+?x\d+?\/
the / matches /
the \d matches digits 0-9 and the + means 1 or more. The ? makes it lazy instead of greedy.
the img and s just match that literally
check out https://regex101.com/ to try out regular expressions. It's much easier than testing them by debugging code. Once you find an expression that works, you can move on to make sure your specific code will perform the same.

regex with lookbehind weird behavior

I have been trying to resolve this for the past 2 days...
Please help me in understanding why this is happening. My intention is to just select the <HDR> that has a <DTL1 val="92">.....</HDR>
This is my regular expression
(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>
And the input string is:
<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>
But this regular expression selects
abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>
Can anyone please help me?
A regex engine will give you always the leftmost match in a string (even if you use a non-greedy quantifier). This is exactly what you obtain.
So, a solution is to forbid the presence of another <HDR> in the parts described by .*? that is too permissive.
You have two technics to do that, you can replace the .*? with:
(?>[^<]+|<(?!/HDR))*
or with:
(?:(?!</HDR).)*+
Most of the time, the first technic is more performant, but if your string contains an high density of <, the second way can give good results too.
The use of a possessive quantifier or an atomic group can reduce the number of steps to obtain a result in particular when the subpattern fails.
Example:
With the first way:
(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>
or this variant:
(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>
With the second way:
(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
or this variant:
(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
Casimir et Hippolyte already gave you a couple of good solutions. I want to elaborate on a few things.
First, why your regex fails to do what you want: (?<=<HDR>).*? tells it to match any number of characters starting with the first character preceded by <HDR>, until it encounters what follows the non-greedy quantifier (<DTL1...). Well, the first character that's preceded by <HDR> is the first a, so it matches everything starting from there until the fixed string <DTL1\sval="3" is encountered.
Casimir et Hippolyte's solutions are for the generalized case, where the contents of the <HDR> tags can be anything other than nested <HDR>'s. You could also do it with a positive look-ahead:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
However, if the string is guaranteed to be in the structure shown, where the <HDR> tags only contain one or more <DTL1 val="##"> tags, so you know there won't be any closing tags within, you could do it more efficiently by replacing the first .*? with [^/]*:
(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>
A negated character class is more efficient than a zero-width assertion, and if you're using a negated character class, a greedy quantifier becomes more efficient than a lazy one.
Note also that by using a lookbehind to match the opening <HDR>, you're excluding it from the match, but you're including the closing </HDR>. Are you sure that's what you want? You're matching this...
<DTL1 val="3"><DTL2 val="4"></HDR>
...when presumably you want this...
<HDR><DTL1 val="3"><DTL2 val="4"></HDR>
...or this...
<DTL1 val="3"><DTL2 val="4">
So, in the fist case, don't use a lookbehind for the opening tag:
<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>
In the second case, use a look-ahead for the closing tag:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)

Java's regex confusion

Basically i want to match filename with .json extension but not file that start with . and excluding list.json.
This is what i come out with (without java string escapes)
(?i)^([^\.][^list].+|list.+)\.json$
I had use an online regex tester, Regexplanet to try my regex
http://fiddle.re/x9g86
Everything works fine with the regex tester, however when i tried it in Java. Everything that has the letter l,i,s,t will be excluded... which is very confusing for me.
Can anyone give me some clues?
Many thanks in advance.
I want to match filename with .json extension but not file that start with . and excluding list.json.
I am not sure you need regular expressions for this. I find the following much easier on the eye:
boolean match = s.endsWith(".json") && !s.startsWith(".") && !s.equals("list.json");
You're using a character exclusion class, [^list], which ignores character order and instead of excluding list, excludes any cases of l, i, s, or t.
Instead, you want to use a negative lookahead:
(?i)(?!^list\.json$)[^\.].*\.json
A negative look-ahead will do it.
(?i)(?!\.|list\.json$).*\.json
(?!\.|list\.json$) is a negative look-ahead checking that the characters following is not either list.json followed by the end of the string, or ..
Code:
String regex = "(?i)(?!\\.|list\\.json$).*\\.json";
System.out.println("list.json".matches(regex)); // false
System.out.println(".json".matches(regex)); // false
System.out.println("a.Json".matches(regex)); // true
System.out.println("abc.json".matches(regex)); // true
But NPE's more readable solution is probably preferred.

validating input string "RX-EZ12345678912345B" using regex

I need to validate input string which should be in the below format:
<2_upper_case_letters><"-"><2_upper_case_letters><14-digit number><1_uppercase_letter>
Ex: RX-EZ12345678912345B
I tried something like this ^[IN]-?[A-Z]{0,2}?\\d{0,14}[A-Z]{0,1} but its not giving the expected result.
Any help will be appreciated.
Thanks
Your biggest problem is the [IN] at the beginning, which matches only one letter, and only if it's I or N. If you want to match two of any letters, use [A-Z]{2}.
Once you fix that, your regex will still only match RX-E. That's because [A-Z]{0,2}? starts out trying to consume nothing, thanks to the reluctant quantifier, {0,2}?. Then \d{0,14} matches zero digits, and [A-Z]{0,1} greedily consumes the E.
If you want to match exactly 2 letters and 14 digits, use [A-Z]{2} and \d{14}. And since you're validating the string, you should end the regex with the end anchor, $. Result:
^[A-Z]{2}-[A-Z]{2}\d{14}[A-Z]$
...or, as a Java string literal:
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
As #nhahtdh observed, you don't really have to use the anchors if you're using Java's matches() method to apply the regex, but I recommend doing so anyway. It communicates your intent better, and it makes the regex portable, in case you have to use it in a different flavor/context.
EDIT: If the first two characters should be exactly IN, it would be
^IN-[A-Z]{2}\d{14}[A-Z]$
Simply translating your requirements into a java regex:
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
This will allow you to use:
if (!input.matches("^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$")) {
// do something because input is invalid
}
Not sure what you are trying to do at the beginning of your current regex.
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
The regex above will strictly match the input string as you specified. If you use matches function, ^ and $ may be omitted.
Since you want exact number of repetitions, you should specify it as {<number>} only. {<number>,<number>} is used for variable number of repetitions. And ? specify that the token before may or may not appear - if it must be there, then specifying ? is incorrect.
^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$
This should solve your purpose. You can confirm it from here
This should solve your problem. Check out the validity here
^[A-Z]{2}-[A-Z]{2}[0-9]{14}[A-Z]$
^([A-Z]{2,2}[-]{1,1}[A-Z]{2,2}[0-9]{14,14}[A-Z]{1,1}){1,1}$

Categories

Resources