java Pattern Matching issue

java Pattern Matching issue - java

I have an issue to write proper regex to match URL.
String input = "AAAhttp://www.gmail.comBBBBabc#gmail.com"
String regex = "www.*.com" // To match www.gmail.com URL
Pattern p = Pattern.compile(regex)
Matcher m = p.matcher(input)
while(m.find()){
}
Here I want to remove the Url www.gmail.com. However it matches till end of string to match email address also which ends with gmail.com.
Can someone help me to get proper regex to match only the URL?

.* does a greedy match. You have to add ? after * to does an reluctant match.
"www\\..*?\\.com"
Your code would be,
String s = "AAAhttp://www.gmail.comBBBBabc#gmail.com";
Pattern p = Pattern.compile("www\\..*?\\.com");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(0));
}
IDEONE

String regex = "www\\..*?\\.com"
Non-greedy repetition of the wildcard '.' and escape dot when literally

A negated character class is faster than .*?
Use this regex:
www\.[^.]+\.com
[^.]+ means any character that is not a dot.
In Java we need to escape some characters:
// for instance
Pattern regex = Pattern.compile("www\\.[^.]+\\.com");
// etc

Related

Android Java regexp pattern

I ping a host. In result a standard output. Below a REGEXP but it do not work correct. Where I did a mistake?
String REGEXP ="time=(\\\\d+)ms";
Pattern pattern = Pattern.compile(REGEXP);
Matcher matcher = pattern.matcher(result);
if (matcher.find()) {
result = matcher.group(1);
}

You only need \\d+ in your regex because
Matcher looks for the pattern (using which it is created) and then tries to find every occurance of the pattern in the string being matched.
Use while(matcher.group(1) in case of multiple occurances.
each () represents a captured group.

You have too many backslashes. Assuming you want to get the number from a string like "time=32ms", then you need:
String REGEXP ="time=(\\d+)ms";
Pattern pattern = Pattern.compile(REGEXP);
Matcher matcher = pattern.matcher(result);
if (matcher.find()) {
result = matcher.group(1);
}
Explanation: The search pattern you are looking for is "\d", meaning a decimal number, the "+" means 1 or more occurrences.
To get the "\" to the matcher, it needs to be escaped, and the escape character is also "\".
The brackets define the matching group that you want to pick out.
With "\\\\d+", the matcher sees this as "\\d+", which would match a backslash followed by one or more "d"s. The first backslash protects the second backslash, and the third protects the fourth.

Find string after last underscore before dot extension

I need to find 20140809T0000Z in this string:
PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc
I tried the following to keep the string before the .nc:
(?<=_)(.*)(?=.nc)
I have the following to start from the last underscore:
/_[^_]*$/
How can I find string after last underscore before dot extension, using a regex?

RegEx is not always the best solution... :)
String pattern="PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
int start=pattern.lastIndexOf("_") + 1;
int end=pattern.lastIndexOf(".");
if(start != 0 && end != -1 && end > start) {
System.out.println(pattern.substring(start,end);
}

You just need lookahead for this requirement.
You can use:
[^._]+(?=[^_]*$)
// matches and returns 20140809T0000Z
RegEx Demo

You could use the below regex,
(?<=_)[^_]*(?=\.nc)
In your pattern just replace .* with [^_]* so that it would match the inner string.
DEMO
String s = "PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
Pattern regex = Pattern.compile("(?<=_)[^_]*(?=\\.nc)");
Matcher regexMatcher = regex.matcher(s);
if (regexMatcher.find()) {
String ResultString = regexMatcher.group();
System.out.println(ResultString);
} //=> 20140809T0000Z

You could use a simpler pattern with a capturing group
.*_(.*)\.nc
By default the first .* will be "greedy" and consume as many characters as possible before the _, leaving just the desired string inside the (.*).
Demo: http://regex101.com/r/aI2xQ9/1
Java code:
String input = "PREVIMER_F2-MARS3D-MENOR1200_20140809T0000Z.nc";
Pattern pattern = Pattern.compile(".*_(.*)\\.nc");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
String group = matcher.group(1);
// ...
}

So, you need a sequence of non-underscore characters that immediately precede the period character.
Try [^_.]+(?=\.)
Demo: https://regex101.com/r/sLAnVs/2
Thanks to Cary Swoveland for pointing out that "no need to escape a period in a character class".

Java regexto match tuples

I need to extract tuples out of string
e.g. (1,1,A)(2,1,B)(1,1,C)(1,1,D)
and thought some regex like:
String tupleRegex = "(\\(\\d,\\d,\\w\\))*";
would work but it just gives me the first tuple. What would be proper regex to match all the tuples in the strings.

Remove the * from the regex and iterate over the matches using a java.util.regex.Matcher:
String input = "(1,1,A)(2,1,B)(1,1,C)(1,1,D)";
String tupleRegex = "(\\(\\d,\\d,\\w\\))";
Pattern pattern = Pattern.compile(tupleRegex);
Matcher matcher = pattern.matcher(input);
while(matcher.find()) {
System.out.println(matcher.group());
}
The * character is a quantifier that matches zero or more tuples. Hence your original regex would match the entire input string.

One line solution using String.split() method and here is the pattern (?!^\\()(?=\\()
Arrays.toString("(1,1,A)(2,1,B)(1,1,C)(1,1,D)".split("(?!^\\()(?=\\()"))
output:
[(1,1,A), (2,1,B), (1,1,C), (1,1,D)]
Here is DEMO as well.
Pattern explanation:
(?! look ahead to see if there is not:
^ the beginning of the string
\( '('
) end of look-ahead
(?= look ahead to see if there is:
\( '('
) end of look-ahead

Regex match repeatation punctuation in java

I have some punctuation [] punctuation = {'.', ',' , '!', '?'};. And I want create a regex that can match the word that was combined from those punctuations.
For example some string I want to find: "....???", "!!!!!......", "??.....!", so on.
Thanks for any advice.

Use String.matches() with the posix regex for "punctuation":
str.matches("\\p{Punct}+");
FYI according to the Pattern javadoc, \p{Punct} is one of
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Also, The ^ and $ aren't needed in the expression either, because matches() must matche the whole input to return true, so start and end are implied.

Try this, it should match and group all the symbols written between []:
([.,!?]+)
Tested it with
??..,..!fsdgsdfgsdfgsdfg
And output was
??..,..!
Also tested with this:
String s = "??.....!fsdgsdfgsdfgsdfg?.,!0000a";
Pattern p = Pattern.compile("([.,!?]+)");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1));
}
And output was
??.....!
?.,!

You can try with a Unicode category for punctuation and a while loop to match your input, as such:
String test = "!...abcd??...!!efgh....!!??abc!";
Pattern pattern = Pattern.compile("\\p{Punct}{2,}");
Matcher matcher = pattern.matcher(test);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
!...
??...!!
....!!??
Note: this has the advantage of matching any punctuation character sequence larger than 1 character (hence, the last "!" is not matched by design). To decide the minimum length of the punctuation sequence, just play with the {2,} part of the Pattern.

Pattern/Matcher group() to obtain substring in Java?

UPDATE: Thanks for all the great responses! I tried many different regex patterns but didn't understand why m.matches() was not doing what I think it should be doing. When I switched to m.find() instead, as well as adjusting the regex pattern, I was able to get somewhere.
I'd like to match a pattern in a Java string and then extract the portion matched using a regex (like Perl's $& operator).
This is my source string "s": DTSTART;TZID=America/Mexico_City:20121125T153000
I want to extract the portion "America/Mexico_City".
I thought I could use Pattern and Matcher and then extract using m.group() but it's not working as I expected. I've tried monkeying with different regex strings and the only thing that seems to hit on m.matches() is ".*TZID.*" which is pointless as it just returns the whole string. Could someone enlighten me?
Pattern p = Pattern.compile ("TZID*:"); // <- change to "TZID=([^:]*):"
Matcher m = p.matcher (s);
if (m.matches ()) // <- change to m.find()
Log.d (TAG, "looking at " + m.group ()); // <- change to m.group(1)

You use m.match() that tries to match the whole string, if you will use m.find(), it will search for the match inside, also I improved a bit your regexp to exclude TZID prefix using zero-width look behind:
Pattern p = Pattern.compile("(?<=TZID=)[^:]+"); //
Matcher m = p.matcher ("DTSTART;TZID=America/Mexico_City:20121125T153000");
if (m.find()) {
System.out.println(m.group());
}

This should work nicely:
Pattern p = Pattern.compile("TZID=(.*?):");
Matcher m = p.matcher(s);
if (m.find()) {
String zone = m.group(1); // group count is 1-based
. . .
}
An alternative regex is "TZID=([^:]*)". I'm not sure which is faster.

You are using the wrong pattern, try this:
Pattern p = Pattern.compile(".*?TZID=([^:]+):.*");
Matcher m = p.matcher (s);
if (m.matches ())
Log.d (TAG, "looking at " + m.group(1));
.*? will match anything in the beginning up to TZID=, then TZID= will match and a group will begin and match everything up to :, the group will close here and then : will match and .* will match the rest of the String, now you can get what you need in group(1)

You are missing a dot before the asterisk. Your expression will match any number of uppercase Ds.
Pattern p = Pattern.compile ("TZID[^:]*:");
You should also add a capturing group unless you want to capture everything, including the "TZID" and the ":"
Pattern p = Pattern.compile ("TZID=([^:]*):");
Finally, you should use the right API to search the string, rather than attempting to match the string in its entirety.
Pattern p = Pattern.compile("TZID=([^:]*):");
Matcher m = p.matcher("DTSTART;TZID=America/Mexico_City:20121125T153000");
if (m.find()) {
System.out.println(m.group(1));
}
This prints
America/Mexico_City

Why not simply use split as:
String origStr = "DTSTART;TZID=America/Mexico_City:20121125T153000";
String str = origStr.split(":")[0].split("=")[1];

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java Pattern Matching issue - java

String regex = "www\\..*?\\.com" Non-greedy repetition of the wildcard '.' and escape dot when literally

A negated character class is faster than .*? Use this regex: www\.[^.]+\.com [^.]+ means any character that is not a dot. In Java we need to escape some characters: // for instance Pattern regex = Pattern.compile("www\\.[^.]+\\.com"); // etc

Related

Android Java regexp pattern

Find string after last underscore before dot extension

Java regexto match tuples

Regex match repeatation punctuation in java

Pattern/Matcher group() to obtain substring in Java?

Categories

Resources