Java regular expression to match patterns and extract them

Java regular expression to match patterns and extract them - java

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseLinks {
public static void main(String[] args) {
String message = "This is a link- #www.google.com# and this is another #google.com#";
Pattern p = Pattern.compile("#.*#");
Matcher matcher = p.matcher(message);
while(matcher.find()) {
String result = matcher.group();
System.out.println(result);
}
}
}
This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?

#[^#]+#
Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.
The reason why your's does not work is the greediness of the star (from regular-expressions.info):
[The star] repeats the previous item
zero or more times. Greedy, so as many
items as possible will be matched
before trying permutations with less
matches of the preceding item, up to
the point where the preceding item is
not matched at all.

Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.
If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:
#[^#]*#

Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to
match a "#"
match as many characters as possible such that you can still ...
... match a "#"
What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

Related

Use a regex to find a pattern somewhere between two words

Given the following string
{"type":"PrimaryParty","name":"Karen","id":"456789-9996"},
{"type":"SecondaryParty","name":"Juliane","id":"345678-9996"},
{"type":"SecondaryParty","name":"Ellen","id":"001234-9996"}
I am looking for strings matching the pattern \d{6}-\d{4}, but only if they are following the string "SecondaryParty". The processor is Java-based
Using https://regex101.com/ I have come up with this, which works fine using the ECMAScript(JavaScript) Flavor.
(?<=SecondaryParty.*?)\d{6}-\d{4}(?=\"})
But as soon as I switch to Java, it says
* A quantifier inside a lookbehind makes it non-fixed width
? The preceding token is not quantifiable
When using it in java.util.regex, the error says
Look-behind group does not have an obvious maximum length near index 20 (?<=SecondaryParty.*?)\d{6}-\d{4}(?="}) ^
How do I overcome the "does not have an obvious maximum length" problem in Java?

You can get the value without using lookarounds by matching instead, and use a single capture group for the value that you want to get:
\"SecondaryParty\"[^{}]*\"(\d{6}-\d{4})\"
Explanation
\"SecondaryParty\" Match "SecondaryParty"
[^{}]*\" Match optional chars other than { and }
(\d{6}-\d{4}) Capture group 1, match 6 digits - 4 digits
\" Match "
See a regex101 demo and a Java demo.

You might use a curly braces quantifier as a workaround:
(?<=SecondaryParty.{0,255})\d{6}-\d{4}(?=\"})
The minimum and maximum inside curly braces quantifier are depend on your actual data.

You could use (?<=SecondaryParty)(.*?)(\d{6}-\d{4})(?=\"}) regex expression and take the value of the second group which will match the pattern \d{6}-\d{4}, but only if they are following the string "SecondaryParty".
Sample Java code
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class IdRegexMatcher {
public static void main(String[] args) {
String input ="{\"type\":\"PrimaryParty\",\"name\":\"Karen\",\"id\":\"456789-9996\"},\n" +
"{\"type\":\"SecondaryParty\",\"name\":\"Juliane\",\"id\":\"345678-9996\"},\n" +
"{\"type\":\"SecondaryParty\",\"name\":\"Ellen\",\"id\":\"001234-9996\"}";
Pattern pattern = Pattern.compile("(?<=SecondaryParty)(.*?)(\\d{6}-\\d{4})(?=\\\"})");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String idStr = matcher.group(2);
System.out.println(idStr);
}
}
}
which gives the output
345678-9996
001234-9996
One possible optimization in the above regex could be to use [^0-9]*? instead of .*? under the assumption that the name wouldn't contain numbers.

Why does this regex capture the excluded character?

I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.

Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)

Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..

How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.

I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}

RegEx to find the word between last Upper Case word and another word

My problem is to find a word between two words. Out of these two words one is an all UPPER CASE word which can be anything and the other word is "is". I tried out few regexes but none are helping me. Here is my example:
String :
In THE house BIG BLACK cat is very good.
Expected output :
cat
RegEx used :
(?<=[A-Z]*\s)(.*?)(?=\sis)
The above RegEx gives me BIG BLACK cat as output whereas I just need cat.

One solution is to simplify your regular expression a bit,
[A-Z]+\s(\w+)\sis
and use only the matched group (i.e., \1). See it in action here.
Since you came up with something more complex, I assume you understand all the parts of the above expression but for someone who might come along later, here are more details:
[A-Z]+ will match one or more upper-case characters
\s will match a space
(\w+) will match one or more word characters ([a-zA-Z0-9_]) and store the match in the first match group
\s will match a space
is will match "is"
My example is very specific and may break down for different input. Your question didn't provided many details about what other inputs you expect, so I'm not confident my solution will work in all cases.

Try this one:
String TestInput = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern
.compile(
"(?<=\\b\\p{Lu}+\\s) # lookbehind assertion to ensure a uppercase word before\n"
+ "\\p{L}+ # matching at least one letter\n"
+ "(?=\\sis) # lookahead assertion to ensure a whitespace is ahead\n"
, Pattern.COMMENTS); Matcher m = p.matcher(TestInput);
if(m.find())
System.out.println(m.group(0));
it matches only "cat".
\p{L} is a Unicode property for a letter in any language.
\p{Lu} is a Unicode property for an uppercase letter in any language.

You want to look for a condition that depends on several parts of infirmation and then only retrieve a specific part of that information. That is not possible in a regex without grouping. In Java you should do it like this:
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+\\s(\\w+)\\sis");
Matcher matcher = pattern.matcher("In THE house BIG BLACK cat is very good.");
if (matcher.find())
System.out.println(matcher.group(1));
}
}
}
The group(1) is the one with brackets around it. In this case w+. And that's your word. The return type of group() is String so you can use it right away

The following part has a extrange behavior
(?<=[A-Z]*\s)(.*?)
For some reason [A-Z]* is matching a empty string. And (.*?) is matching BIG BLACK. With a little tweaks, I think the following will work (but it still matches some false positives):
(?<=[A-Z]+\s)(\w+)(?=\sis)
A slightly better regex would be:
(?<=\b[A-Z]+\s)(\w+)(?=\sis)
Hope it helps

String m = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern.compile("[A-Z]+\\s\\w+\\sis");
Matcher m1 = p.matcher(m);
if(m1.find()){
String group []= m1.group().split("\\s");// split by space
System.out.println(group[1]);// print the 2 position
}

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}

I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.

The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.

Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

Regex to match four repeated letters in a string using a Java pattern

I want to match something like aaaa, aaaad, adjjjjk. Something like ([a-z])\1+ was used to match the repeated characters, but I am not able to figure this out for four letters.

You want to match a single character and then that character repeated three more times:
([a-z])\1{3}
Note: In Java you need to escape the backslashes inside your regular expressions.
Update: The reason why it isn't doing what you want is because you are using the method matches which requires that the string exactly matches the regular expression, not just that it contains the regular expression. To check for containment you should instead use the Matcher class. Here is some example code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Program
{
public static void main(String[] args)
{
Pattern pattern = Pattern.compile("([a-z])\\1{3}");
Matcher matcher = pattern.matcher("asdffffffasdf");
System.out.println(matcher.find());
}
}
Result:
true

Not knowing about the finite repetition syntax, your own problem solving skill should lead you to this:
([a-z])\1\1\1
Obviously it's not pretty, but:
It works
It exercises your own problem solving skill
It may lead you to deeper understanding of concepts
In this case, knowing the desugared form of the finite repetition syntax
I have a concern:
"ffffffff".matches("([a-z])\\1{3,}") = true
"fffffasdf".matches("([a-z])\\1{3,}") = false
"asdffffffasdf".matches("([a-z])\\1{3,}") = false
What can I do for the bottom two?
The problem is that in Java, matches need to match the whole string; it is as if the pattern is surrounded by ^ and $.
Unfortunately there is no String.containsPattern(String regex), but you can always use this trick of surrounding the pattern with .*:
"asdfffffffffasf".matches(".*([a-z])\\1{3,}.*") // true!
// ^^ ^^

You can put {n} after something to match it n times, so:
([a-z])\1{3}

General regex pattern for predefinite repetition is {4}.
Thus here ([a-z])\1{3} should match your 4 chars.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regular expression to match patterns and extract them - java

Related

Use a regex to find a pattern somewhere between two words

Why does this regex capture the excluded character?

RegEx to find the word between last Upper Case word and another word

How can I make a Java regex all or nothing?

Regex to match four repeated letters in a string using a Java pattern

Categories

Resources