regex- Extracting in different strings

regex- Extracting in different strings - java

I have this String :
Date Description Amount Price Charge Shares Owned
04/30/13 INCOME REINVEST 0.0245 $24.66 $12.34 1.998 1,008.369
05/31/13 INCOME REINVEST 0.0228 $22.99 $12.22 1.881 1,010.250
06/28/13 INCOME REINVEST 0.0224 $22.63 $11.97 1.891 1,012.141
I want to extract The dates in a string say "matchedDate" similarly description which in this case are "INCOME REINVEST", "INCOME REINVEST" "INCOME REINVEST"
Amount in a array which happen to be : "0.0245","0.0228","0.0224"
Price in a array :"24.66", "22.99", "22.63"
Charge in a array :"12.34","12.22","11.97"
Shares in a array :"1.998","1.881","1.891"
I don't need the last part "Owned" that corresponds to 1,008.369,1,010.250 and 1,012.141
So far I am able to successfully extract dates by this:
String regex="[0-9]{2}/[0-9]{2}/[0-9]{2}";
Pattern dateMatch = Pattern.compile(regex);
Matcher m = dateMatch.matcher(regString);
while (m.find()) {
String[] matchedDate=new String[] {m.group()};
for(int count=0;count<matchedDate.length;count++){
sysout(matchedDate[count]
}
regString being the string i am trying to do a match on i.e the table i explained in the first block.
I don't need the $ sign's so we can store the numbers in integer arrays. I think we have to identify some kind of pattern of spaces and dollar to do this.
Any help would be appreciated

This should match the parts you need:
(\d{1,2}/\d{1,2}/\d{1,2}).+?([\d.]+)\s\$(\S+)\s\$(\S+)\s(\S+)
Explained:
(\d{1,2}/\d{1,2}/\d{1,2}) - capture date
.+? - match anything up to next number
([\d.]+)\s - capture Amount but match space following it
$(\S+)\s - capture Price but match space following it
$(\S+)\s - capture Charge but match space following it
(\S+) - capture Shares

String regString = "04/30/13 INCOME REINVEST 0.0245 $24.66 $12.34 1.998 1,008.36";
String regex="([0-9]{2}/[0-9]{2}/[0-9]{2})\\s*([\\w ]+)\\s*(\\d+(\\.\\d+)?)\\s*\\$(\\d+(\\.\\d+)?)\\s*\\$(\\d+(\\.\\d+)?)\\s*(\\d+(\\.\\d+)?)\\s*(\\d+(,\\d{3})*(\\.\\d+)?)";
Pattern match = Pattern.compile(regex);
Matcher m = match.matcher(regString);
while (m.find()) {
System.out.println(m.group(1)); //04/30/13
System.out.println(m.group(2)); //INCOME REINVEST
System.out.println(m.group(3)); //0.0245
System.out.println(m.group(5)); //24.66
System.out.println(m.group(7)); //12.34
System.out.println(m.group(9)); //1.998
System.out.println(m.group(11)); //1,008.86
}
Demo
Regex Breakdown:
([0-9]{2}/[0-9]{2}/[0-9]{2}) - Your date regex.
([\\w ]+) - Description - 1+ Word characters and spaces.
(\\d+(\\.\\d+)?) (used 4 times) - Amount, Price, Charge, Shares - 1+ number potentially followed by a . and at least 1 more number.
(\\d+(,\\d{3})*(\\.\\d+)?) - 1+ number, followed potentially by sequences of a , and 3 numbers, followed potentially by a . and at least 1 more number.

String r = "([0-9]{2}/[0-9]{2}/[0-9]{2}).+?\\$((?:(?:\\d+|\\d+,\\d+)\\.\\d+\\s\\$?){3})";
String list = "04/30/13 INCOME REINVEST 0.0245 $24.66 $12.34 1.998 1,008.369";
Matcher m = Pattern.compile(r).matcher(list);
while (m.find())
{
String myData = m.group(1) + " " + m.group(2).replace("$", "");
String[] data = myData.split(" ");
for(String s : data)
System.out.println(s);
}
Outputs:
04/30/13
24.66
12.34
1.998
.+?\\$: non-greedy to ensure that we don't take a '$'--basically skips everything until '$'
((?:(?:\\d+|\\d+,\\d+)\\.\\d+\\s\\$?){3} uses a capturing group to get the three numbers of interest, but with one of the '$', which is removed via .replace() You could do this with .replace(), but the expression would be fairly long.
(?:\\d+|\\d+,\\d+) says "group, but do not capture" a number or #,#
\\.\\d+\\s\\$? says a '.' followed by a #, followed by whitespace and an optional '$'
Here's a general tutorial on Regular Expressions. Here's the section on capturing groups. Good luck!

This should give you what you need and it will also run for any number of similar records on your input string ...
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static Pattern PATTERN = Pattern.compile("([0-9]{2}/[0-9]{2}/[0-9]{2})\\s+([a-zA-Z]+\\s[a-zA-Z]+)\\s+(\\d{1}\\.\\d{0,4})\\s+\\$(\\d{1,2}\\.\\d{0,2})\\s+\\$(\\d{1,2}\\.\\d{0,2})\\s+(\\d{1,2}\\.\\d{0,3})\\s+");
public static void main(String a[] ) {
String regString = "04/30/13 INCOME REINVEST 0.0245 $24.66 $12.34 1.998 1,008.369 " +
"05/31/13 INCOME REINVEST 0.0228 $22.99 $12.22 1.881 1,010.250 " +
"06/28/13 INCOME REINVEST 0.0224 $22.63 $11.97 1.891 1,012.141 ";
ArrayList<String> date = new ArrayList<String>();
ArrayList<String> desc = new ArrayList<String>();
ArrayList<String> amt = new ArrayList<String>();
ArrayList<String> price = new ArrayList<String>();
ArrayList<String> charge = new ArrayList<String>();
ArrayList<String> share = new ArrayList<String>();
Matcher m = PATTERN.matcher(regString);
while(m.find()) {
date.add(m.group(1));
desc.add(m.group(2));
amt.add(m.group(3));
price.add(m.group(4));
charge.add(m.group(5));
share.add(m.group(6));
}
System.out.println("DATE : " + date.toString());
System.out.println("DESC : " + desc.toString());
System.out.println("AMOUNT : " + amt.toString());
System.out.println("PRICE : " + price.toString());
System.out.println("CHARGE : " + charge.toString());
System.out.println("SHARES : " + share.toString());
}
}
The output of the above program is as below,
DATE : [04/30/13, 05/31/13, 06/28/13]
DESC : [INCOME REINVEST, INCOME REINVEST, INCOME REINVEST]
AMOUNT : [0.0245, 0.0228, 0.0224]
PRICE : [24.66, 22.99, 22.63]
CHARGE : [12.34, 12.22, 11.97]
SHARES : [1.998, 1.881, 1.891]

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Want to capture the string after the last slash and before either a (; sid=) word or a (?) character.
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;
Expecting the below output:
1. itemSummary
2. itemList
3. ''(empty string)
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
Url=.*\/(.*)(; sid|\?)
Could you please help me to improve the regex to get desired output?
Thanks in advance!

You may use this regex in Java with a greedy match after Url=:
\bUrl=\S+/([^?;/]+)(?=; sid|\?)
RegEx Demo
RegEx Demo:
\b: Word boundary
Url=: Match text Url=
\S+/: Match 1+ non-whitespace characters followed by a /
([^?;/]+): Match 1+ of a character that not ? and ; and /
(?=; sid|\?): Lookahead to assert that we have ; sid or ? ahead

Alternative solution:
Used regex:
"^Url=.*/(\\w+|)$"
Regex in test bench and context:
public static void main(String[] args) {
String input1 = "sessionId=30a793b1-ed7e-464a-a630; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemSummary; "
+ "sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;";
String input2 = "sessionId=sfdsdfsd-ba57-4e21-a39f-34; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; "
+ "sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=123;";
String input3 = "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; "
+ "Url=https://www.example.com/mybook/order/newbooking/; "
+ "sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;";
List<String> inputList = Arrays.asList(input1, input2, input3);
// Pre-compiled Patterns should not be in loops - that is why they are placed outside the loops
Pattern replaceWithNewLinePattern = Pattern.compile(";?\\s|\\?");
Pattern extractWordFromUrlPattern = Pattern.compile("^Url=.*/(\\w+|)$", Pattern.MULTILINE);
int count = 0;
for(String input : inputList) {
String inputWithNewLines = replaceWithNewLinePattern.matcher(input).replaceAll("\n");
// System.out.println(inputWithNewLines); // Check the change...
Matcher matcher = extractWordFromUrlPattern.matcher(inputWithNewLines);
while (matcher.find()) {
System.out.printf( "%d. '%s'%n", ++count, matcher.group(1));
}
}
}
Output:
1. 'itemSummary'
2. 'itemList'
3. ''

How to extract data from string value using regex?

Hello I have the following string:
Country number Time Status USA B30111 11:15 ARRIVED PARIS NC0120 14:40 ON TIME DUBAI RA007 14:45 ON TIME
I need to extract following info:
country = USA
number = B30111
time = 11:15
status = ARRIVED
country = PARIS
number = NC0120
time = 14:40
status = ON TIME
How can I use regex to extract the above data from it?

You can try this:
(?: (\w+) ([\w\d]+) (\d+\:\d+) (ARRIVED|ON TIME))
Explanation
As status can hold more than one word therefore it is not possible to distinct it from the next country that appears, therefore you must append all the possible status as or| in the regex
Java Source:
final String regex = "(?: (\\w+) ([\\w\\d]+) (\\d+\\:\\d+) (ARRIVED|ON TIME))";
final String string = "Country number Time Status USA B30111 11:15 ARRIVED PARIS NC0120 14:40 ON TIME DUBAI RA007 14:45 ON TIME\n\n\n";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("country =" + matcher.group(1));
System.out.println("number =" + matcher.group(2));
System.out.println("time =" + matcher.group(3));
System.out.println("status =" + matcher.group(4));
System.out.println("");
}
output
country =USA
number =B30111
time =11:15
status =ARRIVED
country =PARIS
number =NC0120
time =14:40
status =ON TIME
country =DUBAI
number =RA007
time =14:45
status =ON TIME

If you create an array based on split function, you will have each words in that array.
String[] splitted = str.split(" ");
Then to check, try this:-
for(String test:splitted){
System.out.println(test);
}
This looks more like a CSV file.

Regular expression for mobile number vaidation?

I have following regular expression for following mobile numbers:
^(([+]|[0]{2})([\\d]{1,3})([\\s-]{0,1}))?([\\d]{10})$
Valid numbers are:
+123-9854875847
00123 9854875847
+123 9854875847
9878757845
Above expression will not validate if user enter 9 or 11 digit mobile number but if user enter 9 or 11 digit number with +123 or +91 respectively then it is getting validate because in this part of expression ([\\d]{1,3}) last two digits are optional.
So, any way to make this part ([\\s-]{0,1}))?([\\d]{10}) not to get combine with this part ([\\d]{1,3})?

The question is somewhat unclear, but I presume you want to split the number and the country code.
This is quite easy to do by extracting groups. group(i) is the i-th thing in brackets.
I also applied these simplifications: [\\d] = \\d, {0,1} = ?, [+] = \\+, [0]{2} = 00.
Code:
String regex = "^((\\+|00)(\\d{1,3})[\\s-]?)?(\\d{10})$";
String str = "+123-9854875847";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
if (m.matches())
{
System.out.println("Country = " + m.group(3));
System.out.println("Data = " + m.group(4));
}
Output:
Country = 123
Data = 9854875847
Alternative using non-matching groups (?:): (so you can use group(1) and group(2))
String regex = "^(?:(?:\\+|00)(\\d{1,3})[\\s-]?)?(\\d{10})$";
String str = "+123-9854875847";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
if (m.matches())
{
System.out.println("Country = " + m.group(1));
System.out.println("Data = " + m.group(2));
}
Reference.
Related test.

As long as the extension is always separated from the rest of the phone number, your regex will work fine. If there is no such separation, there is no way to correctly validate a phone number.
Also keep in mind that both extensions and phone numbers can vary in length from country to country, so there is no regex that will solve all cases. If you can produce a list of allowed extensions, you can work that into the regex and get better matches, but for many groups of arbitrary length of digits you will get many wrong matches.
I have simplified your regex a bit, so oyu can see #Dukeling's suggestions in practice. Your regex on top, mine on the bottom.
^(([+]|[0]{2})([\\d]{1,3})([\\s-]{0,1}))?([\\d]{10})$
^( (\\+|00) \\d{1,3} [\\s-]?)? \\d{10} $

try {
String mobile_number="india number +919979045000\n" +
"india number 9979045000\n" +
"china number +86 591 2123654\n" +
"Brazil number +55 79 2012345\n" +
"it is test all string get mobile number all country"+
"Ezipt +20 10 1234567\n" +
"France +33 123456789\n" +
"Hong Kong +852 1234 5456\n" +
"Mexico +52 55 12345678"+
"thanks";
Pattern p = Pattern.compile("\\(?\\+[0-9]{1,3}\\)? ?-?[0-9]{1,3} ?-?[0-9]{3,5} ?-?[0-9]{5}( ?-?[0-9]{3})? ?(\\w{1,10}\\s?\\d{1,6})?");
List<String> numbers = new ArrayList<String>();
//mobile_number= mobile_number.replaceAll("\\-", "");
Matcher m = p.matcher("" + mobile_number);
while (m.find()) {
numbers.add(m.group());
}
p = Pattern.compile("\\(?\\+[0-9]{1,3}\\)? ?-?[0-9]{1,3} ?-?[0-9]{3,5} ?-?[0-9]{4}( ?-?[0-9]{3})? ?(\\w{1,10}\\s?\\d{1,6})?");
m = p.matcher("" + mobile_number);
while (m.find()) {
numbers.add(m.group());
}
p = Pattern.compile("((?:|\\+)([0-9]{5})(?: |\\-)(0\\d|\\([0-9]{5}\\)|[1-9]{0,5}))");
m = p.matcher("" + mobile_number);
while (m.find()) {
numbers.add(m.group());
}
p = Pattern.compile("[0-9]{10}|\\(?\\+[0-9]{1,3}\\)?-?[0-9]{3,5} ?-?[0-9]{4}?");
m = p.matcher("" + mobile_number);
while (m.find()) {
numbers.add(m.group());
}
String numberArray=numbers.toString();
System.out.print(""+numberArray);
// final result
/* [+919979045000, +86 591 2123654, +33 123456789, +52 55 12345678, +919979045000, +86 591 2123654, +55 79 2012345, +20 10 1234567, +33 123456789, +852 1234 5456, +52 55 12345678, +919979045000, 9979045000] */
} catch (Exception e) {
e.printStackTrace();
}

Best way to take input in two parts i.e country code and mobile number.
In that case you can easily validate it (both country code and mobile number) with regex.

regex; for to capture a specific group which is repeated number of times

Compare how you would accomplish the two tasks mentioned below with and without regular expressions. The problem:
The format for an SMS-based food delivery will be:
PABUSOG slash or comma repeated an infinite number of times #
// The quantity can only be numeric. For simplicity, assume that quantity is always an integer
e.g. PABUSOG STRFRY_SMAI/2 HSHBRWN_BRGR/1 COFEEFLT/1 #En311
it will capture the following:
STRFRY_SMAI - 2
HSHBRWN_BRGR - 1
COFEEFLT - 1
this is my sample code: // doing with regex
String message = "PABUSOG ASD_ASD/1 ASD_ASA/2";
Pattern pattern = Pattern.compile("PABUSOG(\\s+([A-Z]+_[A-Z]+)(/|,)([0-9]))+"
,Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(message);
try
{
if (m.matches())
{
String food = m.group(2);
String quantity = m.group(4);
System.out.println(food + " -- " + quantity + "\\n");
}
}
catch (NullPointerException e)
{
}
it displays the ASD_ASA -- 2, it overrides the 1st one which is ASD_ASD/1.
it must display
ASD_ASD -- 1
ASD_ASA -- 2

You cannot accomplish that with a single regex giving you all the data inside groups. And there's no great need for complex regex either. But still if you prefer regex try searching for pattern iteratively.
if (!message.startsWith("PABUSOG")) {
return;
}
Pattern pattern = Pattern.compile("([A-Z_]+)[/,]([0-9])+", Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(message);
while (m.find()) {
String food = m.group(1);
String quantity = m.group(2);
System.out.println(food + " -- " + quantity);
}
Without complex regex you can do the following by using String API:
// Check for correct header
if (!message.startsWith("PABUSOG")) {
return;
}
// split by whitespaces
String[] items = message.split("\\s+");
// skip header and iterate over remaining items
for (String item : Arrays.asList(items).subList(1, items.length)) {
// split each item by / or ,
String[] foodQuantity = item.split("[/,]");
assert foodQuantity.length == 2;
String food = foodQuantity[0];
String quantity = foodQuantity[1];
System.out.println(food + " -- " + quantity);
}
To skip items started with # you can either add
if (item.startsWith("#")) {
break; // or continue if it can be not the last
}
inside loop or limit subList in the following way if you sure that such item is always present and terminates the sequence: Arrays.asList(items).subList(1, items.length - 1).
By the way, your pattern [A-Z]+_[A-Z]+ won't match COFEEFLT from your example.

Regex not matching words delimited by whitespace

I have an input string that will follow the pattern /user/<id>?name=<name>, where <id> is alphanumeric but must start with a letter, and <name> is a letter-only string that can have multiple spaces. Some examples of matches would be:
/user/ad?name=a a
/user/one111?name=one ONE oNe
/user/hello?name=world
I came up with the following regex:
String regex = "/user/[a-zA-Z]+\\w*\\?name=[a-zA-Z\\s]+";
All of the above examples match the regex, but it only looks at the first word in <name>. Shouldn't the sequence \s allow me to have white spaces?
The code that I made to test what it is doing is:
String regex = "/user/[a-zA-Z]+\\w*\\?name=[a-zA-Z\\s]+";
// Check to see that input matches pattern
if(Pattern.matches(regex, str) == true){
str = str.replaceFirst("/user/", "");
str = str.replaceFirst("name=", "");
String[] tokens = str.split("\\?");
System.out.println("size = " + tokens.length);
System.out.println("tokens[0] = " + tokens[0]);
System.out.println("tokens[1] = " + tokens[1]);
} else
System.out.println("Didn't match.");
So for example, one test might look like:
/user/myID123?name=firstName LastName
size = 2
tokens[0] = myID123
tokens[1] = firstName
whereas the desired output would be
tokens[1] = firstName LastName
How can I change my regex to do this?

Not sure what you think is the problem in your code. tokens[1] will indeed contain firstName LastName in your example.
Here's an ideone.com demo showing this.
However, have you considered using capturing groups for the id and the name.
If you write it like
String regex = "/user/(\\w+)\\?name=([a-zA-Z\\s]+)";
Matcher m = Pattern.compile(regex).matcher(input);
you can get hold of myID123 and firstName LastName through m.group(1) and m.group(2)

I don't find any fault in your code but you may capture group like this:
String str = "/user/myID123?name=firstName LastName ";
String regex = "/user/([a-zA-Z]+\\w*)\\?name=([a-zA-Z\\s]+)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
if(m.find()) {
System.out.println(m.group(1) + ", " + m.group(2));
}

The problem is that * is greedy by default (it matches the whole string), so you need to modify your regex by adding a ? (making it reluctant):
List<String> str = Arrays.asList("/user/ad?name=a a", "/user/one111?name=one ONE oNe", "/user/hello?name=world");
String regex = "/user/([a-zA-Z]+\\w*?)\\?name=([a-zA-Z\\s]+)";
for (String s : str) {
Matcher matcher = Pattern.compile(regex).matcher(s);
if (matcher.matches()) {
System.out.println("user: " + matcher.group(1));
System.out.println("name: " + matcher.group(2));
}
}
Output:
user: ad
name: a a
user: one111
name: one ONE oNe
user: hello
name: world

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex- Extracting in different strings - java

Related

Regex to capture the staring with specific word or character and ending with either one of the word

How to extract data from string value using regex?

Regular expression for mobile number vaidation?

regex; for to capture a specific group which is repeated number of times

Regex not matching words delimited by whitespace

Categories

Resources