I am trying to write a program using regex. The format for an identifier, as I might have explained in another question of mine, is that it can only begin with a letter (and the rest of it can contain whatever). I have this part worked out for the most part.
However, anything within quotes cannot count as an identifier either.
Currently I am using Pattern pattern = Pattern.compile("[A-Za-z][_A-Za-z0-9]*"); as my pattern, which indicates that the first character can only be letters. So how can I edit this to check if the word is surrounded by quotations (and EXCLUSE those words)?
Use negative lookaround assertions:
"(?<!\")\\b[A-Za-z][_A-Za-z0-9]*\\b(?!\")"
Example:
Pattern pattern = Pattern.compile("(?<!\")\\b[A-Za-z][_A-Za-z0-9]*\\b(?!\")");
Matcher matcher = pattern.matcher("Foo \"bar\" baz");
while (matcher.find())
{
System.out.println(matcher.group());
}
Output:
Foo
baz
See it working online: ideone.
Use lookarounds.
"(?<![\"A-Za-z])[A-Z...
The (?<![\"A-Za-z]) part means "if the previous character is not a quotation mark or a letter".
Related
I am trying to match a word (Source Ip) where each letter can be small or capital letter so i have wrote a regex pattern down but my m.find() is showing false even for Correct Match...
Is there any wrong in my regex pattern or the way I have written is wrong?
String word = "Source Ip";
String pattern = "[S|s][O|o][U|u][R|r][C|c][E|e]\\s*[I|i][P|p]";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(word);
System.out.println(m.find());
You can simple use
String pattern = "SOURCE\\s*IP";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Pattern.CASE_INSENSITIVE will make the matching case insensitive.
You don't need to alternate all letters between upper case and lowercase (note, as mentioned by others, the character class idiom does not require | to alternate - adding it between square brackets will also match the | literal).
You can parametrize your Pattern initialization with the Pattern.CASE_INSENSITIVE flag (alternative inline usage would be (?i) before your pattern representation).
For instance:
Pattern.compile("(?i)source\\s*ip");
... or ...
Pattern.compile("source\\s*ip", Pattern.CASE_INSENSITIVE);
Note
Flag API here.
This solution has the problem of accepting sourceip as well.
source\s*ip
Debuggex Demo
Therefore the correct answer should be
source\s+ip
Debuggex Demo
to force the presence of at least one whitespace between the two words.
To use this expression in Java you have to escape the backslash and use something like:
Pattern.compile("source\\s+ip", Pattern.CASE_INSENSITIVE);
if you really want to use regex though for some reason and not pattern. methods then this regex should suit your needs and it works just fine in java for me
[s|S][o|O][U|u][r|R][c|C][e|E][ ]*[i|I][p|P]
your case of using
\\s*
won't identify spaces however this will
\s*
you had one slash too many :)
EDIT:
I demonstrate my ignorance, after checking regexpal I was wrong, [sS] is better than [s|S]
[sS][oO][Uu][rR][cC][eE][ ]*[iI][pP]
thank you Pshemo
and yes i completely forgot about escaping characters Mena thank you for pointing that out for me
I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.
Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)
Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..
How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.
I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}
i write a piece of program to fetch content from a string between ":"(may not have) and "#" and order guaranteed,for example a string like "url:123#my.com",the I fetch "123",or "123#my.com" then i fetch "123" ,too; so I write a regular expression to implement it ,but i can not work,behind is first version:
Pattern pattern = Pattern.compile("(?<=:?).*?(?=#)");
Matcher matcher = pattern.matcher("sip:+8610086#dmcw.com");
if (matcher.find()) {
Log.d("regex", matcher.group());
} else {
Log.d("regex", "not match");
}
it can not work because in the first case:"url:123#my.com" it will get the result:"url:123"
obviously not what i want:
so i write the second version:
Pattern pattern = Pattern.compile("(?<=:??).*?(?=#)");
but it get the error,somebody said java not support variable length in look behind;
so I try the third version:
Pattern pattern = Pattern.compile("(?<=:).*?(?=#)|.*?(?=#)");
and its result is same as the first version ,BUT SHOULD NOT THE FIRST CONDITION BE CONSIDERED FIRST?
it same as
Pattern pattern = Pattern.compile(".*?(?=#)|(?<=:).*?(?=#)");
not left to right! I consider I understood regular expression before ,but confused again.thanks in advance anyway.
Try this (slightly edited, see comments):
String test = "sip:+8610086#dmcw.com";
String test2 = "8610086#dmcw.com";
Pattern pattern = Pattern.compile("(.+?:)?(.+?)(?=#)");
Matcher matcher = pattern.matcher(test);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
matcher = pattern.matcher(test2);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
Output:
+8610086
8610086
Let me know if you need explanations on the pattern.
You really don't need any look-aheads or look-behinds here. What you need can be accomplished by using a a greedy quantifer and some alternation:
.*(?:^|:)([^#]+)
By default java regular expression quantifiers (*+{n}?) are all greedy (will match as many characters as possible until a match can't be found. They can be made lazy by using a question mark after the quantifier like so: .*?
You will want to output capture group 1 for this expression, outputting capture group 0 will return the entire match.
As you said, you can't do a variable lookbehind in java.
Then, you can do something like this, you don't need lookbehind or lookaround.
Regex: :?([^#:]*)#
Example In this example (forget about \n, its because of regex101) you will get in the first group what you need, and you don't have to do anything special. Sometimes the easiest solution is the best.
This is the format/example of the string I want to get data:
<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada </a></span><br> </div>
And this is the regular expression I'm using for it:
"pelicula/([0-9]*)'>([\\w\\s]*)</a>"
I tested this regular expression in RegexPlanet, and it turned out OK, it gave me the expected result:
group(1) = 18313
group(2) = Subtitulada
But when I try to implement that regular expression in Java, it won't match anything. Here's the code:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");
Matcher matcher = pattern.matcher(inputLine);
while(matcher.find()){
version = matcher.group(2);
}
}
What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!
_EDIT__
I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.
Your regex is correct but it seems \w does not match ñ.
I changed the regex to
"pelicula/([0-9]*)'>(.*?)</a>"
and it seems to match both the occurrences.
Here I've used the reluctant *? operator to prevent .* match all characters in between first <a> till last <\a>
See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.
#Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks
If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".
There are two way to do this:
Use the "dot matches newline" regex switch (?s) in your regex:
Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");
or use the Pattern.DOTALL flag in the call to Pattern.compile():
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);
I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")