Extracting numbers from a String in Java by splitting on a regex - java

I want to extract numbers from Strings like this:
String numbers[] = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34".split(PATTERN);
From such String I'd like to extract these numbers:
0.286
-3.099
-0.44
-2.901
-0.436
123
0.123
.34
That is:
There can be garbage characters like "M", "c", "c"
The "-" sign is to include in the number, not to split on
A "number" can be anything that Float.parseFloat can parse, so .34 is valid
What I have so far:
String PATTERN = "([^\\d.-]+)|(?=-)";
Which works to some degree, but obviously far from perfect:
Doesn't skip the starting garbage "M" in the example
Doesn't handle consecutive garbage, like the ,,, in the middle
How to fix PATTERN to make it work?

You could use a regex like this:
([-.]?\d+(?:\.\d+)?)
Working demo
Match Information:
MATCH 1
1. [1-6] `0.286`
MATCH 2
1. [6-12] `-3.099`
MATCH 3
1. [12-17] `-0.44`
MATCH 4
1. [18-24] `-2.901`
MATCH 5
1. [25-31] `-0.436`
MATCH 6
1. [34-37] `123`
MATCH 7
1. [38-43] `0.123`
MATCH 8
1. [44-47] `.34`
Update
Jawee's approach
As Jawee pointed in his comment there is a problem for .34.34, so you can use his regex that fix this problem. Thanks Jawee to point out that.
(-?(?:\d+)?\.?\d+)
To have graphic idea about what happens behind this regex you can check this Debuggex
image:
Engine explanation:
1st Capturing group (-?(?:\d+)?\.?\d+)
-? -> matches the character - literally zero and one time
(?:\d+)? -> \d+ match a digit [0-9] one and unlimited times (using non capturing group)
\.? matches the character . literally zero and one time
\d+ match a digit [0-9] one and unlimited times

Try this one (-?(?:\d+)?\.?\d+)
Example as below:
Demo Here
Thanks a lot for nhahtdh's comments. That's true, we could update as below:
[-+]?(?:\d+(?:\.\d*)?|\.\d+)
Updated Demo Here
Actually, if we take all possible float input String format (e.g: Infinity, -Infinity, 00, 0xffp23d, 88F), then it could be a little bit complicated. However, we still could implement it as below Java code:
String sign = "[-+]?";
String hexFloat = "(?>0[xX](((\\p{XDigit}+)\\.?)|((\\p{XDigit}*)\\.(\\p{XDigit}+)))[pP]([-+])?(\\p{Digit}+)[fFdD]?)";
String nan = "(?>NaN)";
String inf = "(?>Infinity)";
String dig = "(?>\\d+(?:\\.\\d*)?|\\.\\d+)";
String exp = "(?:[eE][-+]?\\d+)?";
String suf = "[fFdD]?";
String digFloat = "(?>" + dig + exp + suf + ")";
String wholeFloat = sign + "(?>" + hexFloat + "|" + nan + "|" + inf + "|" + digFloat + ")";
String s = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123d,.34d.34.34M24.NaNNaN,Infinity,-Infinity00,0xffp23d,88F";
Pattern floatPattern = Pattern.compile(wholeFloat);
Matcher matcher = floatPattern.matcher(s);
int i = 0;
while (matcher.find()) {
String f = matcher.group();
System.out.println(i++ + " : " + f + " --- " + Float.parseFloat(f) );
}
Then the output is as below:
0 : 0.286 --- 0.286
1 : -3.099 --- -3.099
2 : -0.44 --- -0.44
3 : -2.901 --- -2.901
4 : -0.436 --- -0.436
5 : 123 --- 123.0
6 : 0.123d --- 0.123
7 : .34d --- 0.34
8 : .34 --- 0.34
9 : .34 --- 0.34
10 : 24. --- 24.0
11 : NaN --- NaN
12 : NaN --- NaN
13 : Infinity --- Infinity
14 : -Infinity --- -Infinity
15 : 00 --- 0.0
16 : 0xffp23d --- 2.13909504E9
17 : 88F --- 88.0

You can do it in one line (but with one less step than aioobe's answer!):
String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
.replaceAll("^[^.\\d-]+|[^.\\d-]+$", "") // remove junk from start/end
.split("[^.\\d-]+"); // split on anything not part of a number
Although less calls are made, aioobe's answer is easier to read and understand, which makes his better code.

Using the regex you crafted yourself you can solve it as follows:
String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
.replaceAll(PATTERN, " ")
.trim()
.split(" +");
On the other hand, if I were you, I'd do the loop instead:
Matcher m = Pattern.compile("[.-]?\\d+(\\.\\d+)?").matcher(input);
List<String> matches = new ArrayList<>();
while (m.find())
matches.add(m.group());

I think this is exactly what you want:
String pattern = "[-+]?[0-9]*\\.?[0-9]+";
String line = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
List<String> numbers=new ArrayList<String>();
while(m.find()) {
numbers.add(m.group());
}

Its nice you put a bounty on this.
Unfortunately, as you probably already know, this can't be done using
Java's string split method directly.
If it can't be done directly, there is no reason to kludge it as it is, well .. a kludge.
The reasons are many, some related, some not.
To start off, you need to define a good regex as a base.
This is the only regex I know that will validate and extract a proper form:
# "((?=[+-]?\\d*\\.?\\d)[+-]?\\d*\\.?\\d*)"
( # (1 start)
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
) # (1 end)
So, looking at this base regex, its clear you want this form that it matches.
In the case of split, you don't want the form that this matches, because that's
where you want the breaks to be.
As I look at Java's split, I see that no matter what it matches, it will be excluded
from the resulting array.
So, presuming split usage, the first thing to match (and consume) is all the stuff that is not
this. That part will be something like this:
(?:
(?!
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
.
)+
Since the only thing left is valid decimal numbers, the next break will be somewhere
between valid numbers. This part, added to the first part, will be something like this:
(?:
(?!
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
.
)+
| # or,
(?<=
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
(?=
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
And all of a sudden, we have a problem .. a variable length lookbehind assertion
So, its game over for the whole thing.
Lastly and unfortunately, Java does not (as far as I can see) have a provision to include capture
group contents (matched in the regex) as an element in the resulting array.
Perl does, but I can't find that ability in Java.
If Java had that provision, the break sub expressions could be combined to do a seamless split.
Like this:
(?:
(?!
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)
.
)*
(
(?= [+-]? \d* \.? \d )
[+-]? \d* \.? \d*
)

Related

Match all occurrences Regex Java

i'd like to recognize all sequences of "word-number-word" of a string with Regex Java API.
For example, if i have "ABC-122-JDHFHG-456-MKJD", i'd like the output : [ABC-122-JDHFHG, JDHFHG-456-MKJD].
String test = "ABC-122-JDHFHG-456-MKJD";
Matcher m = Pattern.compile("(([A-Z]+)-([0-9]+)-([A-Z]+))+")
.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
The code above return only "ABC-122-JDHFHG".
Any ideas ?
The last ([A-Z]+) matches and consumes JDHFHG, so the regex engine only "sees" -456-MKJD after the first match, and the pattern does not match this string remainder.
You want to get "whole word" overlapping matches.
Use
String test = "ABC-122-JDHFHG-456-MKJD";
Matcher m = Pattern.compile("(?=\\b([A-Z]+-[0-9]+-[A-Z]+)\\b)")
.matcher(test);
while (m.find()) {
System.out.println(m.group(1));
} // => [ ABC-122-JDHFHG, JDHFHG-456-MKJD ]
See the Java demo
Pattern details
(?= - start of a positive lookahead that matches a position that is immediately followed with
\\b - a word boundary
( - start of a capturing group (to be able to grab the value you need)
[A-Z]+ - 1+ ASCII uppercase letters
- - a hyphen
[0-9]+ - 1+ digits
- - a hyphen
[A-Z]+ - 1+ ASCII uppercase letters
) - end of the capturing group
\\b - a word boundary
) - end of the lookahead construct.
Here you go, overlap the last word.
Make an array out of capture group 1.
Basically, find 3 consume 2. This makes the next match position start
on the next possible known word.
(?=(([A-Z]+-\d+-)[A-Z]+))\2
https://regex101.com/r/Sl5FgT/1
Formatted
(?= # Assert to find
( # (1 start), word,num,word
( # (2 start), word,num
[A-Z]+
-
\d+
-
) # (2 end)
[A-Z]+
) # (1 end)
)
\2 # Consume word,num

Split String using Pattern and Matcher until first occurance of ','

I want to split the below string in three parts
(1) Number
(2) String until first occurance of ','
(3) Rest of the string
Like if the string is "12345 - electricity, flat no 1106 , Palash H , Pune"
Three parts should be
(1) 12345
(2) electricity
(3) flat no 1106 , Palash H , Pune
I am able to split into 12345 and rest of the string using below code. but not able to break 2 and 3rd part as required
Map<String, String> strParts= new HashMap<String, String>();
String text = "12345 - electricity, flat no 1106 , Palash 2E , Pune";
Pattern pttrnCrs = Pattern.compile("(.*)\\s\\W\\s(.*)");
Matcher matcher = pttrnCrs.matcher(text);
if (matcher.matches()) {
strParts.put("NUM", matcher.group(1));
StrParts.put("REST", matcher.group(2));
}
Can any one help ?
You need to use a regex with 3 capturing groups:
^(\d+)\W*([^,]+)\h*,\h*(.*)$
RegEx Demo
In Java use:
final String regex = "(\\d+)\\W*([^,]+)\\h*,\\h*(.*)";
No need to use anchors in Java if you are using Matcher#matches() method that implicitly anchors the regex.
RegEx Breakup:
^ # start
(\d+) # match and group 1+ digits in group #1
\W* # match 0 or more non-word characters
([^,]+) # Match and group 1+ character that are not comma in group #2
\h*,\h* # Match comma surrounded by optional whitespaces
(.*) # match and group remaining characters in string in group #3
$ # end

Java: Extracting a specific REGEXP pattern out of a string

How is it possible to extract only a time part of the form XX:YY out of a string?
For example - from a string like:
sdhgjhdgsjdf12:34knvxjkvndf, I would like to extract only 12:34.
( The surrounding chars can be spaces too of course )
Of course I can find the semicolon and get two chars before and two chars after, but it is bahhhhhh.....
You can use this look-around based regex for your match:
(?<!\d)\d{2}:\d{2}(?!\d)
RegEx Demo
In Java:
Pattern p = Pattern.compile("(?<!\\d)\\d{2}:\\d{2}(?!\\d)");
RegEx Breakup:
(?<!\d) # negative lookbehind to assert previous char is not a digit
\d{2} # match exact 2 digits
: # match a colon
\d{2} # match exact 2 digits
(?!\d) # negative lookahead to assert next char is not a digit
Full Code:
Pattern p = Pattern.compile("(?<!\\d)\\d{2}:\\d{2}(?!\\d)");
Matcher m = pattern.matcher(inputString);
if (m.find()) {
System.err.println("Time: " + m.group());
}

Matching a whitespace or emptry string using regex in Java

I have this regex in java
String pattern = "(\\s)(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})(\\s)";
It works as intended but I have a new problem to get some valid dates:
1st problem:
If I have this String It was at 22-febrero-1999 and 10-enero-2009 and 01-diciembre-2000 I should get another string as febrero-enero-diciembre and I only get febrero-enero
2nd problem
If I have a single date in a String like 12-octubre-1989 I get an emptry String.
Why I have in my pattern to have whitespaces in the start and end of any date? because I have to catch only valid months in a String like adsadasd 12-validMonth-2999 asd 11-validMonth-1989 I should get both validMonth, then never get a validMonth in a String like asdadsad12-validMonth-1989 asdadsad 23-validMonth-1989 in the last one I only should get the last validMonth
PD: My java code is
String resultado = "";
String pattern = "(\\s)(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})(\\s)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(fecha);
while (m.find()) {
resultado += m.group().split("-")[1] + "-";
}
return (resultado.compareTo("") == 0 ? "" : resultado.substring(0, resultado.length() - 1));
You might want to use a word boundary instead:
\\b(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})\\b
And I believe some of the months can be optimized a little bit (it could reduce readability unfortunately, but should speed things up by a notch):
\\b(\\d{2}-)((?:en|febr)ero|ma(?:rz|y)o|abril|ju[ln]io|agosto|(?:septiem|octu|noviem|diciem)bre)(-\\d{4})\\b
Perhaps try using a \b instead of \s:
String pattern = "\\b(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})\\b";
This will only match strings where the first digit is not preceded by another word character (digit, letter, or underscore), and the last digit is not followed by a word character. I've also removed the capturing groups around the \b, because it would always be a zero-length string, if matched.
I wouldn't use a word boundry as a delimeter.
I'd suggest to use either whitespace or NOT digit,
or no delimeter and put in a validation range of numbers for day/year.
This way you may catch more embeded dates that are in close
proximity (adjacent) to letters and underscore.
Something like:
# "(?<!\\d)\\d{2}-(?:enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)-\\d{4}(?!\\d)"
(?<! \d ) # Not a digit before us
\d{2} - # Two digits followed by dash
(?: # A month
enero
| febrero
| marzo
| abril
| mayo
| junio
| julio
| agosto
| septiembre
| octubre
| noviembre
| diciembre
)
- \d{4} # Dash followed by four digits
(?! \d ) # Not a digit after us

extracting long/float numbers from a string in java

I have input like this ==>
2 book at 12.99
4 potato chips at 3.99
I want to extract the numeric values from each line and store them in variables
for example in the line.. 2 book at 12.99 i want to extract Qauntity =2 and Price =12.99
from the given string
You can use:
Pattern p = Pattern.compile("(\\d+)\\D+(\\d+(?:.\\d+)?)");
Matcher mr = p.matcher("4 potato chips at 3.99");
if (mr.find()) {
System.out.println( mr.group(1) + " :: " + mr.group(2) );
}
OUTPUT:
4 :: 3.99
Regex
(\d+)[^\d]+([+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?)
Debuggex Demo
Description (Example)
/^(\d+)[^\d]+([+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?)$/gm
^ Start of line
1st Capturing group (\d+)
\d 1 to infinite times [greedy] Digit [0-9]
Negated char class [^\d] 1 to infinite times [greedy] matches any character except:
\d Digit [0-9]
2nd Capturing group ([+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?)
Char class [+-] 0 to 1 times [greedy] matches:
+- One of the following characters +-
Char class [0-9] 1 to 3 times [greedy] matches:
0-9 A character range between Literal 0 and Literal 9
(?:,?[0-9]{3}) Non-capturing Group 0 to infinite times [greedy]
, 0 to 1 times [greedy] Literal ,
Char class [0-9] 3 times [greedy] matches:
0-9 A character range between Literal 0 and Literal 9
(?:\.[0-9]{2}) Non-capturing Group 0 to 1 times [greedy]
\. Literal .
Char class [0-9] 2 times [greedy] matches:
0-9 A character range between Literal 0 and Literal 9
$ End of line
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Capture Group 1: Contains the Quantity
Capture Group 2: Contains the Amount
Java
try {
Pattern regex = Pattern.compile("(\\d+)[^\\d]+([+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
for (int i = 1; i <= regexMatcher.groupCount(); i++) {
// matched text: regexMatcher.group(i)
// match start: regexMatcher.start(i)
// match end: regexMatcher.end(i)
}
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Note: This java is just an example, I don't code in Java
You can use MessageFormat class. Below is the working example:
MessageFormat f = new MessageFormat("{0,number,#.##} {2} at {1,number,#.##}");
try {
Object[] result = f.parse("4 potato chips at 3.99");
System.out.print(result[0]+ ":" + (result[1]));
} catch (ParseException ex) {
// handle parse error
}

Categories

Resources