This is the format/example of the string I want to get data:
<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada </a></span><br> </div>
And this is the regular expression I'm using for it:
"pelicula/([0-9]*)'>([\\w\\s]*)</a>"
I tested this regular expression in RegexPlanet, and it turned out OK, it gave me the expected result:
group(1) = 18313
group(2) = Subtitulada
But when I try to implement that regular expression in Java, it won't match anything. Here's the code:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");
Matcher matcher = pattern.matcher(inputLine);
while(matcher.find()){
version = matcher.group(2);
}
}
What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!
_EDIT__
I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.
Your regex is correct but it seems \w does not match ñ.
I changed the regex to
"pelicula/([0-9]*)'>(.*?)</a>"
and it seems to match both the occurrences.
Here I've used the reluctant *? operator to prevent .* match all characters in between first <a> till last <\a>
See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.
#Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks
If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".
There are two way to do this:
Use the "dot matches newline" regex switch (?s) in your regex:
Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");
or use the Pattern.DOTALL flag in the call to Pattern.compile():
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);
Related
Hi i have the following text in log file
projectId:1 name:John
projectId:63232 name:Sam
telno:0232453242323
The regex expression should only return
1
63232
Currently i've got the following regex projectId:\d* which unwantedly matches the 'projectId:'. How do i omit that from final matches?
Using the solution given in java
String term = "ProjectId:11414084 Title:Recherche partenariat";
String regex = "(?<=ProjectId:)\\d*";
Pattern r = Pattern.compile(regex);
Matcher m = r.matcher(term);
m.matches();
m.group();
The following throw exception
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
Just use the global identifier which does not return on first match.
Depending on the programming language you use, there are different ways to use the global flag. If you tell us more about the usage, I could give you further information on how to.
I see you updated your question.
For only retrieving the number use the positive lookbehind like this:
(?<=projectId:)\d*
Here a regex101 example
Use lookbehind:
(?<=projectId:)\d+
Look-aheads and look-behinds let you conditionally match items without becoming part of the match themselves.
Demo.
Rather than excluding anything from a match, you may capture some specific part of it using a pair of unescaped parentheses:
\bprojectId:(\d*)
^ ^
or - if there must be at least one digit:
\bprojectId:(\d+)
^
Now, the value you need is in Group 1. See the regex demo.
Note I added \b that is a word boundary, nonproject:234 won't get matched now.
See more about capturing groups here.
I am trying to match a word (Source Ip) where each letter can be small or capital letter so i have wrote a regex pattern down but my m.find() is showing false even for Correct Match...
Is there any wrong in my regex pattern or the way I have written is wrong?
String word = "Source Ip";
String pattern = "[S|s][O|o][U|u][R|r][C|c][E|e]\\s*[I|i][P|p]";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(word);
System.out.println(m.find());
You can simple use
String pattern = "SOURCE\\s*IP";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Pattern.CASE_INSENSITIVE will make the matching case insensitive.
You don't need to alternate all letters between upper case and lowercase (note, as mentioned by others, the character class idiom does not require | to alternate - adding it between square brackets will also match the | literal).
You can parametrize your Pattern initialization with the Pattern.CASE_INSENSITIVE flag (alternative inline usage would be (?i) before your pattern representation).
For instance:
Pattern.compile("(?i)source\\s*ip");
... or ...
Pattern.compile("source\\s*ip", Pattern.CASE_INSENSITIVE);
Note
Flag API here.
This solution has the problem of accepting sourceip as well.
source\s*ip
Debuggex Demo
Therefore the correct answer should be
source\s+ip
Debuggex Demo
to force the presence of at least one whitespace between the two words.
To use this expression in Java you have to escape the backslash and use something like:
Pattern.compile("source\\s+ip", Pattern.CASE_INSENSITIVE);
if you really want to use regex though for some reason and not pattern. methods then this regex should suit your needs and it works just fine in java for me
[s|S][o|O][U|u][r|R][c|C][e|E][ ]*[i|I][p|P]
your case of using
\\s*
won't identify spaces however this will
\s*
you had one slash too many :)
EDIT:
I demonstrate my ignorance, after checking regexpal I was wrong, [sS] is better than [s|S]
[sS][oO][Uu][rR][cC][eE][ ]*[iI][pP]
thank you Pshemo
and yes i completely forgot about escaping characters Mena thank you for pointing that out for me
I am using a regular expression for finding string in between two strings
Code:
Pattern pattern = Pattern.compile("EMAIL_BODY_XML_START_NODE"+"(.*)(\\n+)(.*)"+"EMAIL_BODY_XML_END_NODE");
Matcher matcher = pattern.matcher(part);
if (matcher.find()) {
..........
It works fine for texts but when text contains special characters like newline it's break
You need to compile the pattern such that . matches line terminaters as well. To do this you need to use the DOTALL flag.
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
edit: Sorry, it's been a while since I've had this problem. You'll also have to change the middle regex from (.*)(\\n+)(.*) to (.*?). You need to lazy quantifier (*?) if you have multiple EMAIL_BODY_XML_START_NODE elements. Otherwise the regex will match the start of the first element with the end of the last element rather than having separate matches for each element. Though I'm guessing this is unlikely to be the case for you.
I'm totally beginner in java.
In javascript i have this regex:
/[^0-9.,\-\ ]/gi
How can i do the same in java?
Have a look at this: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Theres quite a lot you can do in Java with Regex
If you want to match repeatedly against that regex, you would do:
Pattern p = Pattern.compile("(?i)[^0-9.,-\ ]");
Matcher m = p.matcher(targetString);
Then use the matcher methods in a loop to get the match you want. The "i" is a case insensitivity flag (which you actually don't need as there are no characters specified), but I'm not sure what the equivalent of the "g" flag is.. I think it's simply to attempt to apply the pattern repeatedly to the target string rather than to try and match the whole string, which is what the above code does.
Also, the pattern above will only match one character at a time, you may in fact want [^0-9.,-\ ]*, which will match against 0 or more characters, greedily. I would read the docs on the Pattern class if I were you.
I am trying to write a program using regex. The format for an identifier, as I might have explained in another question of mine, is that it can only begin with a letter (and the rest of it can contain whatever). I have this part worked out for the most part.
However, anything within quotes cannot count as an identifier either.
Currently I am using Pattern pattern = Pattern.compile("[A-Za-z][_A-Za-z0-9]*"); as my pattern, which indicates that the first character can only be letters. So how can I edit this to check if the word is surrounded by quotations (and EXCLUSE those words)?
Use negative lookaround assertions:
"(?<!\")\\b[A-Za-z][_A-Za-z0-9]*\\b(?!\")"
Example:
Pattern pattern = Pattern.compile("(?<!\")\\b[A-Za-z][_A-Za-z0-9]*\\b(?!\")");
Matcher matcher = pattern.matcher("Foo \"bar\" baz");
while (matcher.find())
{
System.out.println(matcher.group());
}
Output:
Foo
baz
See it working online: ideone.
Use lookarounds.
"(?<![\"A-Za-z])[A-Z...
The (?<![\"A-Za-z]) part means "if the previous character is not a quotation mark or a letter".