Match a^xb^x with regex - java

To clarify, I want to match:
ab
aabb
aaabbb
...
This works in Perl:
if ($exp =~ /^(a(?1)?b)$/)
To understand this, look at the string as though it grows from the outside-in, not left-right:
ab
a(ab)b
aa(ab)bb
(?1) is a reference to the outer set of parentheses. We need the ? afterwards for the last case (going from outside in), nothing is left and ? means 0 or 1 of the preceding expression (so it essentially acts as our base case).
I posted a similar question asking what is the equivalent (?1) in Java? Today I found out that \\1 refers to the first capturing group. So, I assumed that this would work:
String pattern = "^(a(?:\\1)?b)$";
but it did not. Does anyone know why?
NB: I know there are other, better, ways to do this. This is strictly an educational question. As in I want to know why this particular way does not work and if there is a way to fix it.

The \\1 is a backreference and refers to the value of the group, not to the pattern as the recursion (?1) does in Perl. Unfortunately, Java regexes do not support recursion, but the pattern can be expressed using lookarounds and backrefs.

Related

validate special characters by negating unicode letters with regex pattern?

This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.
You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.
So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.

Java Regexp pattern check

Pattern
^\\d{1}-\\d{10}|\\d{1,9}|^TWC([0-9){12})$
should validate any of these
1-23232445
1-232323
1-009121212
12
12222
TWC12222
TWC1222324
When i test for TWC pattern doesn't match, I have added "|" to consider OR condition and then to have numbers from 0-9 but limiting to 12 digits. What am i missing ?
TWC([0-9)
I think this is where it might be not working??
You need
TWC([0-9]{12})
Complete answer...
(\d{1}-\d{1,12})|^TWC(\d{1,12})$
even nicer answer ..
^(\\d-|TWC|)(\\d{1,12})$ // this syntax i believe will match your needs.
tested :)
^([0-9]-|TWC|)([0-9]{1,12})$ // or
^(\d-|TWC|)(\d{1,12})$
breakdown
^
this denotes the start of the string
\d or [0-9]
denotes one character of the numbers 0 through 9 (note \d might not work in some lanagues or require different syntax!)
|
is essentially an OR
{1,12}
will only accept a particular pattern 1-12 times for instance in my code the patternw ould be \d or [0-9]
$
is the end of the line
this essentially checks if the line contains a [0-9] with a - after,TWC, or just a nothing space to account for nothing being there at the start then reads up to 12 digits. Should work for all your cases.
testing
edit code.
all unit tests. click on "java" if you want to see them :0
more testing.
NOTE:
YOU NEED TO LOOK AT THE SYNTAX OF WHAT YOU ARE USING IN SOME CASES YOU MIGHT NEED TO \ SOME THINGS IN ORDER FOR THEM TO WORK.. IN C++/C its 2 // IN ORDER FOR THESE TO WORK PLEASE BE VERY WARY ABOUT PARTICULAR SYNTAXES.
Sorry for all the confusion, and also for lying a whole bunch apparently. The issue you're having is that you are using exact quantifiers in a couple of places you don't mean to, namely the {10} and {12}. This requires exactly ten or twelve digits in those spots. What you presumably want is for those to be {1,10} and {1,12} respectively.
What I would do is something like this, using parentheses and quantifiers to clean everything up and repeating yourself as little as possible, to avoid confusion. You've got three possible prefixes (a digit and a dash, or "TWC", or nothing). I'd put those possibilities all together, and then add the rest. This makes the regex much easier to look at.
^(\\d-|TWC){0,1}\\d{1,12}$
The breakdown:
^ is at the beginning, always.
(\\d-|TWC){0,1} Next comes either a single digit followed by a dash, or the string "TWC". This prefix occurs either zero times (for no prefix) or one time.
\\d{1,12}$ Finally, there is a string of one to twelve digits, followed by the end of the line/input (depending on your DOTALL settings of course).
Of course you won't be able to simplify it quite this much if the different prefixes can only allow certain numbers of digits, but this is the basic idea.
You've also got what looks like a typo; TWC([0-9){12}) should be TWC([0-9]{12}). I'm guessing this was just a typo when writing out the question though, since what you have right now would blow up at runtime when you tried to use it otherwise, and it sounds like it's working for some of your inputs.

Regex index 0 how it exactly works

By compiling the following:
System.out.println(Pattern.matches(".?(\\d)$","3"));
It returns true because before 3 there is nothing and ? check for a one or zero.
However 3 is already the first character of the input which starts at 0 and end at 1. How can the jvm recognize that there is nothing before 3.
For example the following.
System.out.println(Pattern.matches(".*","hello");
It returns true as well but only the very last character gets matched with "nothing".
There should not be a "nothing" character at the beginning of a string, only at the end of it right?
This is not really about the JVM. This is about Java regular expressions.
The regular expression ".*" means "match 0 or more characters". It's easy to satisfy this, since a blank string has 0 characters, and therefore satisfies this. Whether Java regular expressions will choose to be lazy and match an empty string, or to be greedy and match the entire string depends on the implementation of Java regular expressions. If you read this excellent writeup (http://docs.oracle.com/javase/tutorial/essential/regex/quant.html) you can see that patterns like ".*" in Java are considered "reluctant" quantifiers and will prefer to take as little as possible.
Based on the information in that writeup, you can see that a pattern like ".{0,}" is a greedy version of the same expression. Perhaps you'd like to use that instead if this is truly a problem for you.
You are not interpreting your regex correctly. There is no such thing as a "nothing character" . Rather, your pattern reads: any charachter followed by a digit at the end of the string OR a digit at the end of the string.
And surely, "3" fits the second description very well.
matches method tries to match the input exactly.
so there's no need to use ^,$..

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)
You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}
The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.
Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

Regular Expression - Exclude list of words for a name

I'm trying to make a regular expression that accepts this:
Only a-z, 0-9, _ chars, with a minimum length of 3
admin, static, my and www are rejected.
For the first part, I already managed to do it with :
^[a-zA-Z0-9\\_]{3,}$
But I don't know how to exclude the words listed previously.
For example, that would mean :
static is not allowed (of course), but
statice is allowed
estatic is allowed
Using this regular expression :
^(?!static|my|admin|www).*$
doesn't work well : it excludes statice (and everything after the unauthorized word).
Do you know which regular expression will fit my need?
Try something like this:
^(?!static$|my$|admin$|www$).*$
This will disallow "static" but allow "statice", "statica", etc. By anchoring each blacklisted word to the end of the string you will only match them if they are standing alone without any trailing characters.
Edit: codeaddict has suggested a cleaner way to do basically the same thing:
^(?!(?:static|my|admin|www)$).*$
I'll answer my question to give the right answer to my question (a regexp that include both obligations), but I'll give the accepted answer to Andrew Hare that lead me to the correct way :)
Here's how to :
Allow only a-z, 0-9, _ chars, with a minimum length of 3
Exclude admin, static, my and www
Here is the regexp :
^(?!static$|my$|admin$|www$)[a-z0-9\_]{3,}$
Or, as Codaddict mentionned it, with a single end anchor :
^(?!(?:static|my|admin|www)$)[a-z0-9\_]{3,}$
Hope this helps in the future!

Categories

Resources