Parsing number information from an ingredient string using regex - java

I am trying to extract the quantity information from an ingredient string where the unit has already been removed.
175 risotto rice
a little hot vegetable stock (optional)
1 coriander
salt pepper
1 0.5 extra virgin olive oil
1 mild onion
300 split red lentils
1.7 well-flavoured vegetable stock
4 carrots
1 head celery
100 stilton cheese
4 snipped chives
salt pepper
225 dried flageolet beans
These are examples of the strings I am parsing, and the results should look like:
175
1
1 0.5
1
300
1.7
4
1
100
4
225
My current thinking is using [0-9]+[ ]*[0-9]*.?[0-9]* as the regex, however this is picking up the first character after the numerical values, for example 175 risotto rice is returning "175 r"

The problem here is that you are not escaping the .? into a literal \.?. The exact behaviour is still somewhat unclear to me, but using your pattern and escaping the . in it should already provide you with the desired matching behavior.
Note that you can shorten [0-9] into \d:
^\d+\s*\d*\.?\d*
If you wanted to separately access each number group, you'd need capture groups to correctly deal with that

In your regex you match .? which will match an optional character (any character except a newline character) and in your data what will be for example the r in risotto or c in coriander.
You could use an anchor to assert the start of the string and match 1+ digits followed by an optional part that matches a dot and 1+ digits.
After that match you could add the same optional pattern with a leading 1+ spaces or tabs:
^\d+(?:\.\d+)?(?:[ \t]+\d+(?:\.\d+))?
In Java
String regex = "^\\d+(?:\\.\\d+)?(?:[ \\t]+\\d+(?:\\.\\d+))?";
That will match
^ Start of the string
\d+(?:\.\d+)? Match 1+ digits followed with an optional part ? that matches a dot and 1+ digits
(?: Non capturing group
[ \t]+\d+(?:\.\d+) match 1+ times a space or tab, 1+ digits and again followed with an optional part that matches a dot and 1+ digits
)? Close non capturing group and make it optional
Note that if you want to match the second pattern 0+ times instead of making it optional you could use * instead of ?
Regex demo | Java demo

Related

Regex to get amount from sentence not working for one scenario

I am new to regex. I need to extract amount from sentence:
The watches are INR 2,550 Only
Kidswear under INR 399.59 Only
Cricket bat INR590 Only
I have created a regex which extracts the first two amounts and tried for 3rd one but still it is not working. Can someone please help.
My Regex - (?i)\\b(\\d+(?:[.,]\\d+)?)
The word boundary at the beginning prevents the INR590 from matching. But if you omit that word boundary, you would match digits at a lot more places.
For the example string, you could make the pattern a bit more specific instead and add a word boundary at the end of the pattern.
(?i)\bINR\h*(\d+(?:[.,]\d+)?)\b
Regex demo
In Java:
String regex = "(?i)\\bINR\\h*(\\d+(?:[.,]\\d+)?)\\b";
You could also for example assert that what is directly to the left is either a space or another allowed character:
(?<=\h|[A-Z])\d+(?:[.,]\d+)?\b
Regex demo
/\s*INR\s*(\d+[,.]{0,1}\d+)/g
\s* match any number of whitespace character.
INR match "INR"
\d+ match any number
[,.]{0,1}\d+ match between 0 comma or point followed by a number.
(\d+[,.]{0,1}\d+) match group number such as "10", "10.1", "10,1".
And it's between bracket, you can get exactly the data between.
For greater number wrote as this model:
1.000.000,00
Replace (\d+[,.]{0,1}\d+) by (\d+(.\d+)*([,.]{0,1}\d+))
The following regular expression matches representations of dollar values that satisfy the conventional required format:
(?<=\bINR ?)(?:[1-9]\d{1,2}|\d)(?:,\d{3})*(?:\.\d{2})?(?![\d.,])
Java 8 regex demo
The link tests the following strings.
The watches are INR 2,550 Kidswear under INR 2,399.59 Cricket bat INR590
^^^^^ ^^^^^^^^ ^^^
Cement mixers are INR 25,34,128 B767 are INR 23,401,798,261.35
^^^^^^^^^^^^^^^^^
Cough drops are INR 3.241 Bubble gum is INR 01.23
The four matches are indicated by the party hats. 25,34,128 was rejected because there are other than three digits between each successive pair of commas, 3.241 was passed over because it has other than two digits to the right of the decimal point and 01.23 failed the cut because of the leading zero.
The regular expression can be broken down as follows.
(?<= # begin positive lookbehind
\bINR ? # match a word break followed by 'INR'
) # end positive lookbehind
(?: # begin non-capture group
[1-9] # match a digit other than 0
\d{1,2} # match 1 or 2 digits
| # or
\d # match 1 digit
) # end non-capture group
(?: # begin non-capture group
,\d{3} # match ',' followed by 3 digits
)* # end non-capture group and execute it 0 or more times
(?: # begin non-capture group
\.\d{2} # match '.' then 2 digits
)? # end non-capture group and execute it 0 or 1 times
(?! # begin negative lookahead
[\d.,] # match a digit, '.' or ','
) # end negative lookahead
I don't know Java but was a bit surprised to find that the lookbehind ((?<=\bINR ?)) could contain an optional character (a space). If a version of Java is used that does not support that, the lookbehind could be replaced with the following:
(?:(?<=\bINR )|(?<=\bINR))

Regex help in android

I have two lines in Array list which contains number
line1 1234 5694 7487
line2 10/02/1992 or 1992
I used different regex to get both the line, but the problem is when I use the regex ([0-9]{4}//s?)([0-9]{4}//s?)([0-9]{4}//n) . It gets the first line cool.
But for checking the line2 I used ([0-9]{2}[/-])?([0-9]{2}[/-])?([0-9]{4}).
this regex instead of returning the last line its returning first 4 numbers of the line1.
As stated in the comments below you are using .matches which returns true if the whole string can be matched.
In your pattern ([0-9]{2}[/-])?([0-9]{2}[/-])?([0-9]{4}) it would also match only 4 digits as the first 2 groups ([0-9]{2}[/-])?([0-9]{2}[/-])? are optional due to the question mark ? leaving the 3rd group ([0-9]{4}) able to match 4 digits.
What you might do instead is to use an alternation to either match a date like format where the first 2 parts including the delimiter are optional. Or match 3 times 4 digits.
.*?(?:(?:[0-9]{2}[/-]){2}[0-9]{4}|[0-9]{4}(?:\h[0-9]{4}){2}).*
Explanation
.*? Match any character except a newline non greedy
(?: Non capturing groupo
(?:[0-9]{2}[/-]){2} Repeat 2 times matching 2 digits and / or -
[0-9]{4} Match 4 digits
| Or
[0-9]{4} Match 4 digits
(?:\\h[0-9]{4}){2} Repeat 2 times matching a horizontal whitespace char and 4 digits
) Close non capturing group
.* Match 0+ times any character except a newline
Regex demo | Java demo
For example
List<String> list = Arrays.asList(
new String[]{
"10/02/1992 or 1992",
"10/02/1992",
"10/1992",
"02/1992",
"1992",
"1234 5694 7487"
}
);
String regex = ".*?(?:(?:[0-9]{2}[/-]){2}[0-9]{4}|[0-9]{4}(?:\\h[0-9]{4}){2}).*";
for (String str: list) {
if (str.matches(regex)){
System.out.println(str);
}
}
Result
10/02/1992 or 1992
10/02/1992
1234 5694 7487
Note that in your first pattern I think you mean \\s instead of //s.
The \\s will also match a newline. If you want to match a single space you could just match that or use \\h to match a horizontal whitespace character.

Regex for currency formatting - java

I want to add the filter to my EditText with accepts different currency values like,
US currency format: 123,456.00
Spanish currency format: 123.456,00
Also, I want to keep maximum 10 digits before the decimal point and max 2 digits after the decimal.
My regex for filtering EditText value is (([0-9|(,.)]{0,13})?)?((,.)[0-9]{0,2})?
But this regex accepts values like ,,,,,,, or .......
How to change this regex which strictly accepts both currency format with the same pattern?
Any help is appreciated. Thank you in advance.
Your pattern could match repeating dots or repeating comma's only because all the parts are optional due to the question mark. It could also match an empty string.
You could use an alternation with a repeating group that starts with a dot or comma followed by 3 or 2 digits to prevent consecutive dots and commas:
Explanation
^(?:(?![,0-9]{14})\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?|(?![.0-9]{14})\d{1,3}(?:\.\d{3})*(?:\,\d{1,2})?)$
^ Start of string
(?: Non capturing group
(?![,0-9]{14}) Negative lookahead, assert not repeating 14 times a comma or digit
\d{1,3}(?:,\d{3})*(?:\.\d{1,2})? Match 1-3 digits, repeat 0+ times matching a comma followed by 3 digits, optionally match a dot and 1-2 digits
| Or
(?![.0-9]{14}) Negative lookahead, assert not repeating 12 times a dot or digit
\d{1,3}(?:\.\d{3})*(?:\,\d{1,2})? Match 1-3 digits, repeat 0+ times matching a dot followed by 3 digit, optionally match a comma and 1-2 digits
) Close non capturing group
$ Assert end of string
Regex demo
NumberFormat's getCurrencyInstance method has a Locale parameter. This is the standard way to handle your problem of formatting currencies.

Regular Expression inner Digits check

I have the following regex but it fails (the inner digits with the points) :
([0-9]{1,3}\.?[0-9]{1,3}\.?[0-9]{1,3})
I want that it covers the following cases:
123 valid
123.4 valid
123.44 valid
123.445 valid
123.33.3 not ok (regex validates it as true)
123.3.3 not ok (regex validates it as true)
123.333.3 valid
123.333.34 valid
123.333.344 valid
Can you please help me?
You have multiple case, I would like to use | the or operator like this :
^([0-9]{1,3}|[0-9]{1,3}\.[0-9]{1,3}|[0-9]{1,3}\.[0-9]{3}\.[0-9]{1,3})$
^ ^ ^ ^
you can check the regex demo
details
The regex match three cases :
case 1
[0-9]{1,3}
this will match one or more digit
case 2
[0-9]{1,3}\.[0-9]{1,3}
this will match one or more digit followed by a dot then one or more digits
case 3
[0-9]{1,3}\.[0-9]{3}\.[0-9]{1,3}
this will match one or more digit followed by a dot then three digits then a dot then one or three digits
Note you can replace [0-9] with just \d your regex can be :
^(\d{1,3}|\d{1,3}\.\d{1,3}|\d{1,3}\.\d{3}\.\d{1,3})$
How about this one (demo at Regex101). It's pretty short and straightforward Regex:
(^\d{3}\.\d{3}\.\d{1,3}$)|(^\d{3}\.\d{1,3}$)|(^\d{3}$)
This recognizes three valid separate groups.
(^\d{3}\.\d{3}\.\d{1,3}$) as a group which must have 3 digits, a dot, 3 more digits, a dot and 1-3 digits.
(^\d{3}\.\d{1,3}$) as a group which must have 3 digits, a dot and 1-3 digits.
(^\d{3}$) as a group which must have 1-3 digits.
These groups split with the or (|) statement.
However, since you have tagged java, why don't let Java to take some responsibility and help Regex where isn't strong? I would rather match the format ((?:\d{1,3}\.?)+) and check programmatically whether the count of numbers is valid.
Use the following expression with .matches:
s.matches("\\d{1,3}(?:\\.\\d{3})?(?:\\.\\d{1,3})?")
See the regex demo
Details
^ - implicit, not necessary as the pattern is used in .matches that requires a full string match
\d{1,3} - 1 to 3 digits
(?:\.\d{3})? - an optional . and 3 digits
(?:\.\d{1,3})? - an optional sequence of . and 1 to 3 digits
$ - implicit, not necessary since the pattern is used in .matches that requires a full string match

Regex to allow only 10 or 16 digit comma separated number

I want to validate a textfield in a Java based app where I want to allow only comma separated numbers and they should be either 10 or 16 digits. I have a regex that ^[0-9,;]+$ to allow only numbers, but it doesn't work for 10 or 16 digits only.
You can use {n,m} to specify length.
So matching one number with either 10 or 16 digits would be
^(\d{10}|\d{16})$
Meaning: match for exactly 10 or 16 digits and the stuff before is start-of-line and the stuff behind is end-of-line.
Now add separator:
^((\d{10}|\d{16})[,;])*(\d{10}|\d{16})$
Some sequences of 10-or-16 digit followed by either , or ; and then one sequece 10-or-16 with end-of-line.
You need to escape those \ in java.
public static void main(String[] args) {
String regex = "^((\\d{10}|\\d{16})[,;])*(\\d{10}|\\d{16})$";
String y = "0123456789,0123456789123456,0123456789";
System.out.println(y.matches(regex)); //Should be true
String n = "0123456789,01234567891234567,0123456789";
System.out.println(n.matches(regex)); //should be false
}
I would probably use this regex:
(\d{10}(?:\d{6})?,?)+
Explanation:
( - Begin capture group
\d{10} - Matching at least 10 digits
(?: - Begin non capture group
\d{6} - Match 6 more digits
)? - End group, mark as optional using ?
,? - optionally capture a comma
)+ - End outer capture group, require at least 1 or more to exist? (mabye change to * for 0 or more)
The following inputs match this regex
1234567890123456,1234567890
1234567890123456
1234567890
these inputs do not match
123,1234567890
12355
123456789012
You need to have both anchors and word boundaries:
/^(?:\b(?:\d{10}|\d{16})\b,?)*$/
The anchors are necessary so you don't get false positives for partial matches and the word boundaries are necessary so you don't get false positives for 20, 26, 30, 32 digit numbers.
Here is my version
(?:\d+,){9}\d+|(?:\d+,){15}\d+
Let's review it. First of all there is a problem to say: 10 or 16. So, I have to create actually 2 expressions with | between them.
Second, the expression itself. Your version just says that you allow digits and commas. However this is not what you really want because for example string like ,,, will match your regex.
So, the regex should be like (?:\d+,){n}\d+ that means: sequence of several digits terminated by comma and then sequence of several digits, e.g. 123,45,678 (where 123,45 match the first part and 678 match the second part)
Finally we get regex that I have written in the beginning of my answer:
(?:\d+,){9}\d+|(?:\d+,){15}\d+
And do not forget that when you write regex in you java code you have to duplicate the back slash, like this:
Pattern.compile("\\d+,{9}\\d+|\\d+,{15}\\d+")
EDIT: I have just added non-capturing group (?: ...... )

Categories

Resources