Regular expression to identify all numerics, across all localization formats

Regular expression to identify all numerics, across all localization formats - java

I'm scanning a text with a Scanner object, let's say lineScanner. Here are the declarations:
String myText= "200,00/100,00/28/65.36/21/458,696/25.125/4.23/6.3/4,2/659845/4524/456,65/45/23.495.254,3";
Scanner lineScanner = new Scanner(myText);
With that Scanner, I would like to find the first BigDecimal, and after the second one, and so on. I declared a BIG_DECIMAL_PATTERN to match any case.
Here are the rules I defined:
Thousands separator is always followed by exactly 3 digits
There is always exactly 1 or 2 digits after the decimal point.
If the thousands separator is the comma symbol, so the decimal point is the dot symbol and conversely
Thousands separator is optional, as decimal part of the number
String nextBigDecimal = lineScanner.findInLine(BIG_DECIMAL_PATTERN);
Now, here is the BIG_DECIMAL_PATTERN I declared:
private final String BIG_DECIMAL_PATTERN=
"\\d+(\\054\\d{3}+)?(\\056\\d{1,2}+)?|\\d+(\\056\\d{3}+)?(\\054\\d{1,2}+)?)";
\\054 is the ASCII octal representation of ","
\\056 is the ASCII octal representation of "."
My problem is that it doesn't work well because when the pattern of the first part is found, the second part (after the |) is not checked and in my example
the first match will be 200 and not 200,00. So I can try this:
private final String BIG_DECIMAL_PATTERN=\\d+([.,]\\d{3}+)?([,.]\\d{1,2}+)?
But there is a new problem: comma and dot are not exclusive, I mean if one is the thousands separator, the decimal point should be the other one.
Thanks for helping.

I believe a variant of your 2nd RegEx will work for you. Consider this regex:
^\\d+(?:([.,])\\d{3})*(?:(?!\\1)[.,]\\d{1,2})?$
Live Demo: http://www.rubular.com/r/vHlEdBMhO9
Explanation: What it does is to first capture the comma or dot in capture group # 1. And then later makes sure same capture group # 1 doesn't appear at decimal point using negative lookahead. Which in other words ensures that if comma comes first then dot will come later and viceversa.

Could you do an either-or regular expression? E.g. something like:
private final String BIG_DECIMAL_PATTERN
= "\\d+((\\.\\d{3}+)?(,\\d{1,2}+)?|(,\\d{3}+)?(\\.\\d{1,2}+)?)"
Note - I haven't checked whether your regex actually works - and suspect this may not be the best way of achieving what you are trying to do. All I'm doing here to get you up and running is suggesting you could try using (regex1|regex2) where regex1 is dots followed by commas and regex2 is commas followed by dots.

Related

Regex for matching different float formats

I'm looking for a regex in scala to match several floats:
9,487,346 -> should match
9.487.356,453->should match
38,4 -> match
-38,4 -> should match
-38.5
-9,487,346.76
-38 -> should match
So basically it should match a number that:
Numbered lists are easy
possibly gave thousand separators (either comma or dot)
possibly are decimal again with either comma or dot as separator
Currently I'm stuck with
val pattern="\\d+((\\.\\d{3}+)?(,\\d{1,2}+)?|(,\\d{3}+)?(\\.\\d{1,2}+)?)"
Edit: I'm mostly concered with European Notation.
Example where the current pattern not matches: 1,052,161
I guess it would be close enough to match that the String only contains numbers,sign, comma and dot

If, as your edit suggests, you are willing to accept a string that simply "contains numbers, sign, comma and dot" then the task is trivial.
[+-]?\d[\d.,]*
update
After thinking it over, and considering some options, I realize that your original request is possible if you'll allow for 2 different RE patterns, one for US-style numbers (commas before dot) and one for Euro-style numbers (dots before comma).
def isValidNum(num: String): Boolean =
num.matches("[+-]?\\d{1,3}(,\\d{3})*(\\.\\d+)?") ||
num.matches("[+-]?\\d{1,3}(\\.\\d{3})*(,\\d+)?")
Note that the thousand separators are not optional, so a number like "1234" is not evaluated as valid. That can be changed by adding more RE patterns: || num.matches("[+-]?\\d+")

Based on your rules,
It should match a number that:
Numbered lists are easy
possibly gave thousand separators (either comma or dot)
possibly are decimal again with either comma or dot as separator
Regex:
^[+-]?\d{1,3}(?:[,.]\d{3})*(?:[,.]\d+)?$
[+-]? Allows + or - or nothing at the start
\d{1,3} allows one to 3 digits
([,.]\d{3}) allows . or , as thousands separator followed by 3 digits (* allows unlimited such matches)
(?:[,.]\d+)? allows . or , as decimal separator followed by at least one digit.
This matches all of the OP's example cases. Take a look at the demo below for more:
Regex101 Demo
However one limitation is it allows . or , as thousand separator and as decimal separator and doesn't validate that if , is thousands separator then . should be decimal separator. As a result the below cases incorrectly show up as matches:
201,350,780,88
211.950.266.4
To fix this as well, the previous regex can have 2 alternatives - one to check for a notation that has , as thousands separator and . as decimal, and another one to check vice-versa. Regex:
^[+-]?\d{1,3}(?:(?:(?:\.\d{3})*(?:\,\d+)?)|(?:(?:\,\d{3})*(?:\.\d+)?))$
Regex101 Demo
Hope this helps!

Regex + sign followed by numbers

Hi i want to find Strings like "+19" in Java
so a + sign followed by infinite amount of numbers.
How do i do this?
Tried "+[0123456789]"
and "\+[0123456789]"
thank you :)

This is the regex you want to use:
\\+\\d+
Two kinds of plus are being used here. The first is escaped with two backslashes because it is treated as a literal. The second one means match 1 of more times (i.e. match any digit one or more times).
Code:
String input = "+19";
if (input.matches("\\+\\d+")) {
System.out.println("input string matches");
}

Yes, to match a plus you need to escape it with two backslashes in a C string literal that Java uses. A literal plus needs to be either escaped or put into a character class, [+]. If you just use a plus symbol, it becomes a quantifier that matches the previous symbol or group one or more number of times.
Also, note that the \d shorthand digit class can match more than just ASCII digits if Pattern.UNICODE_CHARACTER_CLASS flag is passed to Pattern.compile (or embedded (?U) flag is added at the start of the pattern). It is advised to use unambiguous patterns in case the code might be maintained or enhanced/adjusted by different developers later.
Most people prefer patterns without escaping backslashes if possible since that allows to avoid issues like the one you faced.
Here is a version of the regex that does not require any escaping:
"[+][0-9]+"
Also, the plus quantifier does not match an infinite number of digits, only MAX_UINT number of times.

Which is the right regular expression to use for Numbers and Strings?

I am trying to create simple IDE and coloring my JTextPane based on
Strings (" ")
Comments (// and /* */)
Keywords (public, int ...)
Numbers (integers like 69 and floats like 1.5)
The way i color my source code is by overwritting the insertString and removeString methods inside the StyledDocument.
After much testing, i have completed comments and keywords.
Q1: As for my Strings coloring, I color my strings based on this regular expression:
Pattern strings = Pattern.compile("\"[^\"]*\"");
Matcher matcherS = strings.matcher(text);
while (matcherS.find()) {
setCharacterAttributes(matcherS.start(), matcherS.end() - matcherS.start(), red, false);
}
This works 99% of the time except for when my string contains a specific kind of string where there is a "\ inside the code. This messes up my whole color coding.
Can anyone correct my regular expression to fix my error?
Q2: As for Integers and Decimal coloring, numbers are detected based on this regular expression:
Pattern numbers = Pattern.compile("\\d+");
Matcher matcherN = numbers.matcher(text);
while (matcherN.find()) {
setCharacterAttributes(matcherN.start(), matcherN.end() - matcherN.start(), magenta, false);
}
By using the regular expression "\d+", I am only handling integers and not floats. Also, integers that are part of another string are matched which is not what i want inside an IDE. Which is the correct expression to use for integer color coding?
Below is a screenshot of the output:
Thank you for any help in advance!

For the strings, this is probably the fastest regex -
"\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""
Formatted:
" [^"\\]*
(?: \\ . [^"\\]* )*
"
For integers and decimal numbers, the only foolproof expression I know of is
this -
"(?:\\d+(?:\\.\\d*)?|\\.\\d+)"
Formatted:
(?:
\d+
(?: \. \d* )?
| \. \d+
)
As a side note, If you're doing each independently from the start of
the string you could be possibly overlapping highlights.

Try with:
\\b\\d+(\\.\\d+)?\\b for int, float and double,
"(?<=[{(,=\\s+]+)".+?"(?=[,;)+ }]+)" for Strings,

For Integer go with
(?<!(\\^|\\d|\\.))[+-]?(\\d+(\\.\\d+)?)(?!(x|\\d|\\.))

Match a String ignoring the \" situations
".*?(?<!\\)"
The above will start a match once it sees a " and it will continue matching on anything until it gets to the next " which is not preceded by a \. This is achieved using the lookbehind feature explained very well at http://www.regular-expressions.info/lookaround.html
Match all numbers with & without decimal points
(\d+)(\.\d+)? will give you at least one digit followed by a point and any number of other digits greater than 1.
The question of matching numbers inside strings can be achieved in 2 ways :
a Modifying the above so that they have to exist with whitespace on either side \W(\d+)(\.\d+)?\W, which I don't think will be satisfactory in mathematical situations (ie 10+10) or at the end of an expression (ie 10;).
b Making this a matter of precedence. If the String colouring is checked after the numbers then that part of the string will be coloured pink at first but then immediately overwritten with red. String colouring takes precedence.

R1: I believe there is no regex-based answer to non-escaped " characters in the middle of an ongoing string. You'd need to actively process the text to eliminate or circumvent the false-positives for characters that are not meant to be matched, based on your specific syntax rules (which you didn't specify).
However:
If you mean to simply ignore escaped ones, \", like java does, then I believe you can simply include the escape+quote pair in the center as a group, and the greedy * will take care of the rest:
\"((\\\\\")|[^\"])*\"
R2: I believe the following regex would work for finding both integers and fractions:
\\d+(\.\\d+)?
You can expand it to find other kinds of numerals too. For example, \\d+([\./]\\d+)?, would additionally match numerals like "1/4".

How can I express such requirement using Java regular expression?

I need to check that a file contains some amounts that match a specific format:
between 1 and 15 characters (numbers or ",")
may contains at most one "," separator for decimals
must at least have one number before the separator
this amount is supposed to be in the middle of a string, bounded by alphabetical characters (but we have to exclude the malformed files).
I currently have this:
\d{1,15}(,\d{1,14})?
But it does not match with the requirement as I might catch up to 30 characters here.
Unfortunately, for some reasons that are too long to explain here, I cannot simply pick a substring or use any other java call. The match has to be in a single, java-compatible, regular expression.

^(?=.{1,15}$)\d+(,\d+)?$
^ start of the string
(?=.{1,15}$) positive lookahead to make sure that the total length of string is between 1 and 15
\d+ one or more digit(s)
(,\d+)? optionally followed by a comma and more digits
$ end of the string (not really required as we already checked for it in the lookahead).
You might have to escape backslashes for Java: ^(?=.{1,15}$)\\d+(,\\d+)?$
update: If you're looking for this in the middle of another string, use word boundaries \b instead of string boundaries (^ and $).
\b(?=[\d,]{1,15}\b)\d+(,\d+)?\b
For java:
"\\b(?=[\\d,]{1,15}\\b)\\d+(,\\d+)?\\b"
More readable version:
"\\b(?=[0-9,]{1,15}\\b)[0-9]+(,[0-9]+)?\\b"

Java: how to parse double from regex

I have a string that looks like "A=1.23;B=2.345;C=3.567"
I am only interested in "C=3.567"
what i have so far is:
Matcher m = Pattern.compile("C=\\d+.\\d+").matcher("A=1.23;B=2.345;C=3.567");
while(m.find()){
double d = Double.parseDouble(m.group());
System.out.println(d);
}
the problem is it shows the 3 as seperate from the 567
output:
3.0
567.0
i am wondering how i can include the decimal so it outputs "3.567"
EDIT: i would also like to match C if it does not have a decimal point:
so i would like to capture 3567 as well as 3.567
since the C= is built into the pattern as well, how can i strip it out before parsing the double?

I may be mistaken on this part, but the reason it's separating the two is because group() will only match the last-matched subsequence, which is whatever gets matched by each call to find(). Thanks, Mark Byers.
For sure, though, you can solve this by placing the entire part you want inside a "capturing group", which is done by placing it in parentheses. This makes it so that you can group together matched parts of your regular expression into one substring. Your pattern would then look like:
Pattern.compile("C=(\\d+\\.\\d+)")
For the parsing 3567 or 3.567, your pattern would be C=(\\d+(\\.\\d+)?) with group 1 representing the whole number. Also, do note that since you specifically want to match a period, you want to escape your . (period) character so that it's not interpreted as the "any-character" token. For this input, though, it doesn't matter
Then, to get your 3.567, you would you would call m.group(1) to grab the first (counting from 1) specified group. This would mean that your Double.parseDouble call would essentially become Double.parseDouble("3.567")
As for taking C= out of your pattern, since I'm not that well-versed with RegExp, I might recommend that you split your input string on the semi-colons and then check to see if each of the splits contain the C; then you could apply the pattern (with the capturing groups) to get the 3.567 from your Matcher.
Edit For the more general (and likely more useful!) cases in gawi's comment, please use the following (from http://www.regular-expressions.info/floatingpoint.html)
Pattern.compile("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?")
This has support for optional sign, either optional integer or optional decimal parts, and optional positive/negative exponents. Insert capturing groups where desired to pick out parts individually. The exponent as a whole is in its own group to make it, as a whole, optional.

Your regular expression is only matching numeric characters. To also match the decimal point too you will need:
Pattern.compile("\\d+\\.\\d+")
The . is escaped because this would match any character when unescaped.
Note: this will then only match numbers with a decimal point which is what you have in your example.

To match any sequence of digits and dots you can change the regular expression to this:
"(?<=C=)[.\\d]+"
If you want to be certain that there is only a single dot you might want to try something like this:
"(?<=C=)\\d+(?:\\.\\d+)?"
You should also be aware that this pattern can match the 1.2 in ABC=1.2.3;. You should consider if you need to improve the regular expression to correctly handle this situation.

if you need to validate decimal with dots, commas, positives and negatives:
Object testObject = "-1.5";
boolean isDecimal = Pattern.matches("^[\\+\\-]{0,1}[0-9]+[\\.\\,][0-9]+$", (CharSequence) testObject);
Good luck.

if you want a regex for an input which might be double or just integer without any *.0 thing you can use this:Pattern.compile("(-?\d+\.?\d*)")

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression to identify all numerics, across all localization formats - java

Related

Regex for matching different float formats

Regex + sign followed by numbers

Which is the right regular expression to use for Numbers and Strings?

How can I express such requirement using Java regular expression?

Java: how to parse double from regex

Categories

Resources