Why regex group doesn't work - java

I've tried the following regx (java string format):
^(.*(iOS\\s+[\\d\\.]+|Android\\s+[\\d\\.]+)?.*)$
String to match is :
Some Money 2.6.2; iOS 5.1.1
It supposes to return three groups :
group[0] :Some Money 2.6.2; iOS 5.1.1
group[1] :Some Money 2.6.2; iOS 5.1.1
group[2] :iOS 5.1.1
but it actually returns these:
group[0] :Some Money 2.6.2; iOS 5.1.1
group[1] :Some Money 2.6.2; iOS 5.1.1
group[2] :null
when i change regex as below
^(.*(iOS\\s+[\\d\\.]+|Android\\s+[\\d\\.]+).*)$
but it can't match string like
whatever iS 5.1.1 whatever
What i want to achieve is the regex returns three groups no matter what string likes.The first and second group always to be the entire string . The third group is the substring that matches '(iOS|Android) [\d.]*' if string does contains that part and is null or empty if it doesn't contain.

Maybe you can use the ; delimiter as indication that your iOS 5.1.1 part starts?
Then a pattern may look like .+;\\s+(.+).
.+; consumes everything up to the semi-colon
\\s+ consumes the spaces between semi-colon and the start of the version string
(.+) consumes everything up to the end
If you really only want to match iOS or Android then you might want to add a non capturing group within the (.+) part.
A regexp then would look like this: ".+;\\s+((?:iOS|Android).+)".
And here a executable example what a solution may look like. It shows the behaviour of both pattern variants I explained above.
public static void main(String[] args) {
String input1 = "Some Money 2.6.2; iS 5.1.1 ";
String input2 = "Some Money 2.6.2; iOS 5.1.1 ";
String input3 = "Some Money 2.6.2; Android 5.1.1 ";
String pattern1 = ".+;\\s+(.+)";
String pattern2 = ".+;\\s+((?:iOS|Android).+)";
System.out.println(pattern1);
matchPattern(input1, pattern1);
matchPattern(input2, pattern1);
matchPattern(input3, pattern1);
System.out.println();
System.out.println(pattern2);
matchPattern(input1, pattern2);
matchPattern(input2, pattern2);
matchPattern(input3, pattern2);
}
private static void matchPattern(String input, String pattern) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
if(m.matches()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
if(m.groupCount() > 1) {
System.out.println(m.group(2));
}
}
}
Update: Since the target of the question got clearer due to some edits by the author, I feel the need to update my answer. If it is about always getting three groups, the following might be better than working out all possible notation variants:
public static void main(String[] args) {
String input1 = "Some Money 2.6.2; iS 5.1.1";
String input2 = "Some Money 2.6.2; iOS 5.1.1";
String input3 = "Some Money 2.6.2; Android 5.1.1";
String input4 = "Some Money 2.6.2 iOS 5.1.1";
String input5 = "Some Money 2.6.2 iOS";
String input6 = "Some Money 2.6.2";
String pattern1 = "(.*?((?:iOS|Android)(?:\\s+[0-9\\.]+)?.*)?)";
System.out.println(pattern1);
matchPattern(input1, pattern1);
matchPattern(input2, pattern1);
matchPattern(input3, pattern1);
matchPattern(input4, pattern1);
matchPattern(input5, pattern1);
matchPattern(input6, pattern1);
}
private static void matchPattern(String input, String pattern) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
if(m.matches()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println();
}
}
Here the pattern is (.*?(?:((?:iOS|Android)(?:\\s+[0-9\\.]+)?).*)?).
.*? consumes everything before the version string. If no version string is available at all it matches the whole input. The Reluctant quantifier is needed here. It takes the shortest match that still matches and so avoids that the whole input is consumed.
(?:((?:iOS|Android)(?:\\s+[0-9\\.]+)?).*)? consumes the whole version string and everything that is following.
((?:iOS|Android)(?:\\s+[0-9\\.]+)?) is the group(2) output. It just matches the OS string, iOS or Android, with an optional version suffix consisting of numbers and dot.

please refer this topic about "How a RegEx engine works".
Those based on back-tracking. These often compile the pattern into byte-code, resembling machine instructions. The engine then executes the code, jumping from instruction to instruction. When an instruction fails, it then back-tracks to find another way to match the input.
Your regular expression have many way to match the input. And sadly, it return the other way (not your expected matches).
By removing "?" quantifier from the 2nd group, it becomes "required".
Your returned maches will match all required groups.

I finally solved the problem by regex as below.
(.*((?:iOS|Android)\\s+[0-9\\.]+).*|.*)

Related

How to parse a string to get array of #tags out of the string?

so I have this string like
"#tag1 #tag2 #tag3 not_tag1 not_tag2 #tag4" (the space between tag2 and tag4 is to indicate there can be many spaces). From this string I want to parse just a tag1, tag2 and so on. They are similar to #tags we see on LinkedIn or any other social media. Is there any easy way to do this using regex or any other function in Java. Or should I do it hard way(i.e. using loops and conditions).
Tag format should be "#" (to indicate tag is starting) and space " "(to indicate end of tag). In between there can be character or numbers but start should be a character only.
example,
input : "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4"
output : ["tag1", "tag2", "tag3", "tag4"]
split by regex: "#\w+"
EDIT: this is the correct regex, but split is not the right method.
same solution as javadev suggested, but use instead:
String input = "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4";
Matcher matcher = Pattern.compile("#\\w+").matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
output with # as expected.
Maybe something like:
public static void main(String[] args ) {
String input = "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4";
Pattern pattern = Pattern.compile("#([A-z][A-z0-9]*) *");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
worked for me :)
Output:
tag1
tag2
tag3
tag4

Regex to remove line break within double quote in CSV

Hi I have a csv file with an error in it.so i want it to correct with regular expression, some of the fields contain line break, Example as below
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre Pkwy
California",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
the above two lines should be in one line
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre PkwyCalifornia",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
I tried to use the below regex but it didnt help me
%s/\\([^\"]\\)\\n/\\1/
Try this:
public static void main(String[] args) {
String input = "\"AHLR150\",\"CDS\",\"-1\",\"MDCPBusinessRelationshipID\","
+ ",,\"Investigating\",\"1600 Amphitheatre Pkwy\n"
+ "California\",,\"Mountain View\",,\"United\n"
+ "States\",,\"California\",,,\"94043-1351\",\"9958\"\n";
Matcher matcher = Pattern.compile("\"([^\"]*[\n\r].*?)\"").matcher(input);
Pattern patternRemoveLineBreak = Pattern.compile("[\n\r]");
String result = input;
while(matcher.find()) {
String quoteWithLineBreak = matcher.group(1);
String quoteNoLineBreaks = patternRemoveLineBreak.matcher(quoteWithLineBreak).replaceAll(" ");
result = result.replaceFirst(quoteWithLineBreak, quoteNoLineBreaks);
}
//Output
System.out.println(result);
}
Output:
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre Pkwy California",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
Create a RegEx surrounding the text you want to keep by parentheses and that will create a group of matched characters. Then replace the string using the group index to compose as you wish.
String test = "\"AHLR150\",\"CDS\",\"-1\",\"MDCPBusinessRelationshipID\","
+ ",,\"Investigating\",\"1600 Amphitheatre Pkwy\n"
+ "California\",,\"Mountain View\",,\"United\n"
+ "States\",,\"California\",,,\"94043-1351\",\"9958\"\n";
System.out.println(test.replaceAll("(\"[^\"]*)\n([^\"]*\")", "$1$2"));
So when we replace the matching string ("United\nStates") by $1$2 we are removing the line break because it not belongs to any group:
$1 => the first group (\"[^\"]*) that will match "United
$2 => the second group ([^\"]*\")" that will match States"
Based on this you can try with:
/\r?\n|\r/
I checked it here and seems to be fine

regex to find email address from a String

My intention is to get email address from a web page. I have the page source. I am reading the page source line by line. Now I want to get email address from the current line I am reading. This current line may or may not have email. I saw a lot of regexp examples. But most of them are for validating email address. I want to get the email address from a page source not validate. It should work as http://emailx.discoveryvip.com/ is working
Some examples input lines are :
1)<p>Send details to neeraj#yopmail.com</p>
2)<p>Interested should send details directly to www.abcdef.com/abcdef/. Should you have any questions, please email neeraj#yopmail.com.
3)Note :- Send your queries at neeraj#yopmail.com for more details call Mr. neeraj 012345678901.
I want to get neeraj#yopmail.com from examples 1,2 and 3.
I am using java and I am not good in rexexp. Help me.
You can validate e-mail address formats as according to RFC 2822, with this:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
and here's an explanation from regular-expressions.info:
This regex has two parts: the part before the #, and the part after the #. There are two alternatives for the part before the #: it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not appear consecutively or at the start or end of the email address. The other alternative requires the part before the # to be enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with backslashes.
And you can check this out here: Rubular example.
The correct code is
Pattern p = Pattern.compile("\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b",
Pattern.CASE_INSENSITIVE);
Matcher matcher = p.matcher(input);
Set<String> emails = new HashSet<String>();
while(matcher.find()) {
emails.add(matcher.group());
}
This will give the list of mail address in your long text / html input.
You need something like this regex:
".*(\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*"
When it matches, you can extract the first group and that will be your email.
String regex = ".*(\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("your text here");
if (m.matches()) {
String email = m.group(1);
//do somethinfg with your email
}
This is a simple way to extract all emails from input String using Patterns.EMAIL_ADDRESS:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Regular Expression Search On String

I am having great issues searching a string for particular parameters that are needed in my application, I am under the assumption that the only real way to do this is using regular expressions however they are giving me a huge headache! I don't usually write them myself but get them off other websites however what i need isn't simple enough to be included :(
Here is the string:
10 50 u E2U+pstn:tel "!^(.*)$!tel:\\1;spn=42180;mcc=234;mnc=33!" .
I need to extract the spn, mcc, and the mnc from this string. Unfortunately the api i call changes the location of these on the string for some requests which makes indexing the string difficult. I really need to list what i need to grab the spn= for example then follow off and read the number but everything i try never works.
I wouldn't use regex but simply splitting :
String[] tokens = str.split(";");
for (int i=0; i<tokens.length; i++) {
if (tokens[i].startsWith("spn=")) {
spn = Integer.parseInt(tokens[i].substring("spn=".length()));
}
}
Of course you could objectify this a little, or use constants for "spn=".
A solution using Pattern and Matcher:
String s = "10 50 u E2U+pstn:tel \"!^(.*)$!tel:\\\\1;spn=42180;mcc=234;mnc=33!\"";
Pattern p = Pattern.compile("^.*spn=([0-9]+);mcc=([0-9]*);mnc=([0-9]*)!.*$");
Matcher matcher = p.matcher(s);
matcher.matches(); // true
String spn = matcher.group(1); // 42180
String mcc = matcher.group(2); // 234
String mnc = matcher.group(3); // 33
Edit: You can use named-capturing groups, too:
Pattern p =
Pattern.compile("^.*spn=(?<spn>[0-9]+);mcc=(?<mcc>[0-9]*);mnc=(?<mnc>[0-9]*)!.*$");
Matcher matcher = p.matcher(s);
matcher.matches(); // true
String spn = matcher.group("spn");
String mcc = matcher.group("mcc");
String mnc = matcher.group("mnc");

android : extract uk postcode

Hello I am trying to extract a uk postcode from a string i.e. "the person's house is at SS9 8ID we'll be there at 8pm" so I can extract the "SS9 8ID" bit. I've tried the following code but it's not working for some reason...any ideas???
String pc1="^([A-PR-UWYZ](([0-9](([0-9]|[A-HJKSTUW])?)?)|([A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?)) [0-9][ABD-HJLNP-UW-Z]{2})|GIR 0AA$";
String test="the person's house is at SS9 8ID we'll be there at 8pm";
Pattern pattern = Pattern.compile(pc1);
Matcher matcher = pattern.matcher(test.toUpperCase());
if (matcher.matches()) {
//Log.d("pccode:::", matcher.group(1) );
Log.d("pccode:::", matcher.group());
} else { Log.d("NO","NO PCODE"); }
The matches method matches the whole string, you should use find instead. And don't use ^ and $ in the expression.
Also the SS9 8ID doesn't match the regexp, because ABD-HJLNP-UW-Z doesn't include letter I which is in the postcode.

Categories

Resources