Pattern Matching for java using regex - java

I have a Long string that I have to parse for different keywords. For example, I have the String:
"==References== This is a reference ==Further reading== *{{cite book|editor1-last=Lukes|editor1-first=Steven|editor2-last=Carrithers|}} * ==External links=="
And my keywords are
'==References==' '==External links==' '==Further reading=='
I have tried a lot of combination of regex but i am not able to recover all the strings.
the code i have tried:
Pattern pattern = Pattern.compile("\\=+[A-Za-z]\\=+");
Matcher matcher = pattern.matcher(textBuffer.toString());
while (matcher.find()) {
System.out.println(matcher.group(0));
}

You don't need to escape the = sign. And you should also include a whitespace inside your character class.
Apart from that, you also need a quantifier on your character class to match multiple occurrences. Try with this regex:
Pattern pattern = Pattern.compile("=+[A-Za-z ]+=+");
You can also increase the flexibility to accept any characters in between two =='s, by using .+? (You need reluctant quantifier with . to stop it from matching everything till the last ==) or [^=]+:
Pattern pattern = Pattern.compile("=+[^=]+=+");
If the number of ='s are same on both sides, then you need to modify your regex to use capture group, and backreference:
"(=+)[^=]+\\1"

Related

Java: Regular Expression not matching?

I am trying to extract a special sequence out of a String using the following Regular Expression:
[(].*[)]
My Pattern should only match if the String contains () with text between them.
Somehow, i I create a new Pattern using Pattern#compile(myString) and then match the String using Matcher matcher = myPattern.matcher(); it doesn't find anything, even though I tried it on regexr.com and it worked there.
My Pattern is a static final Pattern object in another class (I directly used Pattern#compile(myString).
Example String to match:
save (xxx,yyy)
The likely problem here is your quantifier.
Since you're using greedy * with a combination of . for any character, your match will not delimit correctly as . will also match closing ).
Try using reluctant [(].*?[)].
See quantifiers in docs.
You can also escape parenthesis instead of using custom character classes, like so: \\( and \\), but that has nothing to do with your issue.
Also note (thanks esprittn)
The * quantifier will match 0+ characters, so if you want to restrict your matches to non-empty parenthesis, use .+? instead - that'll guarantee at least one character inside your parenthesis.
Hope the below code helps : its extracts the data between '(' & ')' including them .
String pattern = "\\(.*\\)";
String line = "save(xx,yy)";
Pattern TokenPattern = Pattern.compile(pattern);
Matcher m = TokenPattern.matcher(line);
while (m.find()) {
int start = m.start(0);
int end = m.end(0);
System.out.println(line.substring(start, end));
}
to remove the brackets change 'start' to 'start+1' and 'end' to 'end-1' to change the bounding indexes of the sub-string being taken.

Regex to match words after forward slash or in between

I have this code that needs to get words after / or in between this character.
Pattern pattern = Pattern.compile("\\/([a-zA-Z0-9]{0,})"); // Regex: \/([a-zA-Z0-9]{0,})
Matcher matcher = pattern.matcher(path);
if(matcher.matches()){
return matcher.group(0);
}
The regex \/([a-zA-Z0-9]{0,}) works but not in Java, what could be the reason?
You need to get the value of Group 1 and use find to get a partial match:
Pattern pattern = Pattern.compile("/([a-zA-Z0-9]*)");
Matcher matcher = pattern.matcher(path);
if(matcher.find()){
return matcher.group(1); // Here, use Group 1 value
}
Matcher.matches requires a full string match, only use it if your string fully matches the pattern. Else, use Matcher.find.
Since the value you need is captured into Group 1 (([a-zA-Z0-9]*), the subpattern enclosed with parentheses), you need to return that part.
You needn't escape the / in Java regex. Also, {0,} functions the same way as * quantifier (matches zero or more occurrences of the quantified subpattern).
Also, [a-zA-Z0-9] can be replaced with \p{Alnum} to match the same range of characters (see Java regex syntax reference. The pattern declaration will look like
"/(\\p{Alnum}*)"

How can i add multiple match conditions in a regex

I have a String like this : String x = "return function ('ABC','DEF')";
I am using this:
Pattern pattern = Pattern.compile("'(.*?)'");
Matcher matcher = pattern.matcher(formula);
while (matcher.find()) {
System.out.println("------> " + matcher.group();
}
to retrieve strings between single quotes.
My question is: how can i adapt this regex so that it will check for strings between single quotes AND strings like " ,'DEF' " (meaning which start with ,' and end with ')?
You can use this pattern:
'[^']+'|"[^"]+"
Just to match with empty quoted string change '+' to '*'.
See test.
This pattern should do what you want:
"(?:,\s*)?'[^']*'"
The ? means the first group will match zero or one times.
I used (?:...) because this is a non-capturing group. It is better to use when you don't need to capture that portion of the match.
Also, I replaced .*? with [^']*, meaning the single-quoted string contains anything that is not a single quote. This is more efficient and less likely to lead to mistakes in your regex than .*?.
(Note: this regex allows there to be space between the comma and the start of the string. At first looking at your example, I thought that was true of your example. But now I see that it is not. Still, that might be useful depending on what your data looks like).
You could use the regex pattern:
Pattern.compile(",?'(.*?)'");
,? means 0 or 1 commas. The ? is greedy, so if there is a comma, it will be included in the match.
So: This will match:
A comma, followed by a string enclosed in single quotes
OR.. only a string enclosed in single quotes

Matching several URLs in a string using regex

I'm trying to match a URL in a string, using regex from here: Regular expression to match URLs in Java
It works fine with one URL, but when I have two URLs in the string, it only matched the latter.
Here's the code:
Pattern pat = Pattern.compile(".*((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
// now matcher.groupCount() == 2, not 4
Edit: stuff I've tried:
// .* removed, now doesn't match anything // Another edit: actually works, see below
Pattern pat = Pattern.compile("((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
// .* made lazy, still only matches one
Pattern pat = Pattern.compile(".*?((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Any ideas?
It's because .* is greedy. It will just consume as much as possible (the whole string) and then backtrack. I.e. it will throw away one character at a time until the remaining characters can make up a URL. Hence the first URL will already be matched, but not captured. And unfortunately, matches cannot overlap. The fix should be simple. Remove the .* at the beginning of your pattern. Then you can also remove the outer parentheses from your pattern - there is no need to capture anything any more, because the whole match will be the URL you are looking for.
Pattern pat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
while (matcher.find()) {
System.out.println(matcher.group());
}
By the way, matcher.groupCount() does not tell you anything, because it gives you the number of groups in your pattern and not the number of captures in your target string. That's why your second approach (using .*?) did not help. You still have two capturing groups in the patter. Before calling find or anything, matcher does not know how many captures it will find in total.

Find string in between two strings using regular expression

I am using a regular expression for finding string in between two strings
Code:
Pattern pattern = Pattern.compile("EMAIL_BODY_XML_START_NODE"+"(.*)(\\n+)(.*)"+"EMAIL_BODY_XML_END_NODE");
Matcher matcher = pattern.matcher(part);
if (matcher.find()) {
..........
It works fine for texts but when text contains special characters like newline it's break
You need to compile the pattern such that . matches line terminaters as well. To do this you need to use the DOTALL flag.
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
edit: Sorry, it's been a while since I've had this problem. You'll also have to change the middle regex from (.*)(\\n+)(.*) to (.*?). You need to lazy quantifier (*?) if you have multiple EMAIL_BODY_XML_START_NODE elements. Otherwise the regex will match the start of the first element with the end of the last element rather than having separate matches for each element. Though I'm guessing this is unlikely to be the case for you.

Categories

Resources