Regex: Match any character a, specific character b and then a again - java

I am trying to implement an algorithm on java and I need a way to match a pattern where I find any character (lets name it a) then the character 'X' and then the same character a from before. Initial thought was regex, although after some time failing to find a way to do that I am thinking of iterating through all characters and checking them one by one...
But before that if anyone could help, I need something so that ( "AXA", "EXE", "RXR", etc) would match while ("AXB", "EXA", "TXX", etc) would not.
Tried using something like ".X." but of course failed as it matched anything before and after 'X'...
Is there a way to match something like that ?

Capture the leading char, and use a back reference:
(.)X\1
See live demo.
Note that in java you need to use 2 slashes to make a literal slash:
"AXA".matches("(.)X\\1") // true

Related

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Extract variable from a simple mathematical equation (Java, String)

I will be handling a bunch of strings that will be of the following format:
"2*salary"
"salary+2"
"2*salary/3"
My goal is to pull out just "salary". I do not however want to eliminate non-characters because I might have something like "2*id3", a mixture of characters and numbers as the variable name (note: it will never be all numbers). I currently use:
Pattern pattern = Pattern.compile("[\\w_]+");
However, for something like "2*salary" this results in "2" and "salary" being found.
You're probably looking for this:
Pattern.compile("[a-zA-Z]\\w+");
... in other words, match the sequence of characters that begins with a letter. That'll match 'salary', but won't match '2' (and '2salary' too).
If you in fact do need to match 2salary, use this:
Pattern.compile("[0-9]*[A-Za-z]\\w+");
(I have replaced [\w_] with just \w, it actually includes underscore).
That is because 2*salary matches twice your "word" character definition \w which is [a-zA-Z0-9_], the first is 2 and the and match is salary
In your case you need something like "[a-zA-Z][\w]*"

Java Regular Expression Pattern Matching

I want to create a regular expression, in Java, that will match the following:
*A*B
where A and B are ANY character except asterisk, and there can be any number of A characters and B characters. A(s) is/are preceded by asterisk, and B(s) is/are preceded by asterisk.
Will the following work? Seems to work when I run it, but I want to be absolutely sure.
Pattern.matches("\\A\\*([^\\*]{1,})\\*([^\\*]{1,})\\Z", someString)
It will work, however you can rewrite it as this (unquoted):
\A\*([^*]+)\*([^*]+)\Z
there is no need to quote the star in a character class;
{1,} and + are the same quantifier (once or more).
Note 1: you use .matches() which automatically anchors the regex at the beginning and end; you may therefore do without \A and \Z.
Note 2: I have retained the capturing groups -- do you actually need them?
Note 3: it is unclear whether you want the same character repeated between the stars; the example above assumes not. If you want the same, then use this:
\A\*(([^*])\2*)\*(([^*])\4*)\Z
If I got it correct.. it can be as simple as
^\\*((?!\\*).)+\\*((?!\\*).)+
If you want a match on *AAA*BBB but not on *ABC*DEF use
^\*([a-zA-Z])\1*\*([a-zA-Z])\2*$
This won't match on this either
*A_$-123*B<>+-321

Java RegExp: Capture part after a character but don't replace the character

I am using Java to parse through a JavaScript file. Because the scope is different than expected in the environment in which I am using it, I am trying to replace every instance of i.e.
test = value
with
window.test = value
Previously, I had just been using
writer.append(js.getSource().replaceAll("test", "window.test"));
which obviously isn't generalizable, but for a fixed dataset it was working fine.
However, in the new files I'm supposed to work with, an updated version of the old ones, I now have to deal with
window['test'] = value
and
([[test]])
I don't want to match test in either of those cases, and it seems like those are the only two cases where there's a new format. So my plan was to now do a regex to match anything except ' and [ as the first character. That would be ([^'\[])test; however, I don't actually want to replace the first character - just make sure it's not one of the two I don't want to match.
This was a new situation for me because I haven't worked with replacement with RegExps that much, just pattern matching. So I looked around and found what I thought was the solution, something called "non-capturing groups". The explanation on the Oracle page sounded like what I was looking for, but when I re-wrote my Regular Expression to be (?:[^'\\[])test, it just behaved exactly the same as if I hadn't changed anything - replacing the character preceding test. I looked around StackOverflow, but what I discovered just made me more confident that what I was doing should work.
What am I doing wrong that it's not working as expected? Am I misusing the pattern?
If you include an expression for the character in your regex, it will be part of what is matched.
The trick is to use what you match in the replacement String, so you replace that bit by itself.
try :
replaceAll("([^'\[])test", "$1window.test"));
the $1 in the replacement String is a back reference to what capturing group 1 matched. In this case that is the character preceding test
Why not simply test on "(test)(\s*)=(\s*)([\w\d]+)" ? That way you only match "test", then whitespace, followed by an '=' sign followed by a value (in this case consisting of digits and alphabetical letters and the underscore character). You can then use the groups (between parentheses) to copy the value -and even the whitespace if required - to your new text.

validating input string "RX-EZ12345678912345B" using regex

I need to validate input string which should be in the below format:
<2_upper_case_letters><"-"><2_upper_case_letters><14-digit number><1_uppercase_letter>
Ex: RX-EZ12345678912345B
I tried something like this ^[IN]-?[A-Z]{0,2}?\\d{0,14}[A-Z]{0,1} but its not giving the expected result.
Any help will be appreciated.
Thanks
Your biggest problem is the [IN] at the beginning, which matches only one letter, and only if it's I or N. If you want to match two of any letters, use [A-Z]{2}.
Once you fix that, your regex will still only match RX-E. That's because [A-Z]{0,2}? starts out trying to consume nothing, thanks to the reluctant quantifier, {0,2}?. Then \d{0,14} matches zero digits, and [A-Z]{0,1} greedily consumes the E.
If you want to match exactly 2 letters and 14 digits, use [A-Z]{2} and \d{14}. And since you're validating the string, you should end the regex with the end anchor, $. Result:
^[A-Z]{2}-[A-Z]{2}\d{14}[A-Z]$
...or, as a Java string literal:
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
As #nhahtdh observed, you don't really have to use the anchors if you're using Java's matches() method to apply the regex, but I recommend doing so anyway. It communicates your intent better, and it makes the regex portable, in case you have to use it in a different flavor/context.
EDIT: If the first two characters should be exactly IN, it would be
^IN-[A-Z]{2}\d{14}[A-Z]$
Simply translating your requirements into a java regex:
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
This will allow you to use:
if (!input.matches("^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$")) {
// do something because input is invalid
}
Not sure what you are trying to do at the beginning of your current regex.
"^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$"
The regex above will strictly match the input string as you specified. If you use matches function, ^ and $ may be omitted.
Since you want exact number of repetitions, you should specify it as {<number>} only. {<number>,<number>} is used for variable number of repetitions. And ? specify that the token before may or may not appear - if it must be there, then specifying ? is incorrect.
^[A-Z]{2}-[A-Z]{2}\\d{14}[A-Z]$
This should solve your purpose. You can confirm it from here
This should solve your problem. Check out the validity here
^[A-Z]{2}-[A-Z]{2}[0-9]{14}[A-Z]$
^([A-Z]{2,2}[-]{1,1}[A-Z]{2,2}[0-9]{14,14}[A-Z]{1,1}){1,1}$

Categories

Resources