Extracting numbers into a string array - java

I have a string which is of the form
String str = "124333 is the otp of candidate number 9912111242.
Please refer txn id 12323335465645 while referring blah blah.";
I need 124333, 9912111242 and 12323335465645 in a string array. I have tried this with
while (Character.isDigit(sms.charAt(i)))
I feel that running the above said method on every character is inefficient. Is there a way I can get a string array of all the numbers?

Use a regex (see Pattern and matcher):
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(<your string here>);
while (m.find()) {
//m.group() contains the digits you want
}
you can easily build ArrayList that contains each matched group you find.
Or, as other suggested, you can split on non-digits characters (\D):
"blabla 123 blabla 345".split("\\D+")
Note that \ has to be escaped in Java, hence the need of \\.

You can use String.split():
String[] nbs = str.split("[^0-9]+");
This will split the String on any group of non-numbers digits.

And this works perfectly for your input.
String str = "124333 is the otp of candidate number 9912111242. Please refer txn id 12323335465645 while referring blah blah.";
System.out.println(Arrays.toString(str.split("\\D+")));
Output:
[124333, 9912111242, 12323335465645]
\\D+ Matches one or more non-digit characters. Splitting the input according to one or more non-digit characters will give you the desired output.

Java 8 style:
long[] numbers = Pattern.compile("\\D+")
.splitAsStream(str)
.mapToLong(Long::parseLong)
.toArray();
Ah if you only need a String array, then you can just use String.split as the other answers suggests.

Alternatively, you can try this:
String str = "124333 is the otp of candidate number 9912111242. Please refer txn id 12323335465645 while referring blah blah.";
str = str.replaceAll("\\D+", ",");
System.out.println(Arrays.asList(str.split(",")));
\\D+ matches one or more non digits
Output
[124333, 9912111242, 12323335465645]

First thing comes into my mind is filter and split, then i realized that it can be done via
String[] result =str.split("\\D+");
\D matches any non-digit character, + says that one or more of these are needed, and leading \ escapes the other \ since \D would be parsed as 'escape character D' which is invalid

Related

What is the Regex for decimal numbers in Java?

I am not quite sure of what is the correct regex for the period in Java. Here are some of my attempts. Sadly, they all meant any character.
String regex = "[0-9]*[.]?[0-9]*";
String regex = "[0-9]*['.']?[0-9]*";
String regex = "[0-9]*["."]?[0-9]*";
String regex = "[0-9]*[\.]?[0-9]*";
String regex = "[0-9]*[\\.]?[0-9]*";
String regex = "[0-9]*.?[0-9]*";
String regex = "[0-9]*\.?[0-9]*";
String regex = "[0-9]*\\.?[0-9]*";
But what I want is the actual "." character itself. Anyone have an idea?
What I'm trying to do actually is to write out the regex for a non-negative real number (decimals allowed). So the possibilities are: 12.2, 3.7, 2., 0.3, .89, 19
String regex = "[0-9]*['.']?[0-9]*";
Pattern pattern = Pattern.compile(regex);
String x = "5p4";
Matcher matcher = pattern.matcher(x);
System.out.println(matcher.find());
The last line is supposed to print false but prints true anyway. I think my regex is wrong though.
Update
To match non negative decimal number you need this regex:
^\d*\.\d+|\d+\.\d*$
or in java syntax : "^\\d*\\.\\d+|\\d+\\.\\d*$"
String regex = "^\\d*\\.\\d+|\\d+\\.\\d*$"
String string = "123.43253";
if(string.matches(regex))
System.out.println("true");
else
System.out.println("false");
Explanation for your original regex attempts:
[0-9]*\.?[0-9]*
with java escape it becomes :
"[0-9]*\\.?[0-9]*";
if you need to make the dot as mandatory you remove the ? mark:
[0-9]*\.[0-9]*
but this will accept just a dot without any number as well... So, if you want the validation to consider number as mandatory you use + ( which means one or more) instead of *(which means zero or more). That case it becomes:
[0-9]+\.[0-9]+
If you on Kotlin, use ktx:
fun String.findDecimalDigits() =
Pattern.compile("^[0-9]*\\.?[0-9]*").matcher(this).run { if (find()) group() else "" }!!
Your initial understanding was probably right, but you were being thrown because when using matcher.find(), your regex will find the first valid match within the string, and all of your examples would match a zero-length string.
I would suggest "^([0-9]+\\.?[0-9]*|\\.[0-9]+)$"
There are actually 2 ways to match a literal .. One is using backslash-escaping like you do there \\., and the other way is to enclose it inside a character class or the square brackets like [.]. Most of the special characters become literal characters inside the square brackets including .. So use \\. shows your intention clearer than [.] if all you want is to match a literal dot .. Use [] if you need to match multiple things which represents match this or that for example this regex [\\d.] means match a single digit or a literal dot
I have tested all the cases.
public static boolean isDecimal(String input) {
return Pattern.matches("^[-+]?\\d*[.]?\\d+|^[-+]?\\d+[.]?\\d*", input);
}

Regex including date string, email, number

I have this regex expression:
String patt = "(\\w+?)(:|<|>)(\\w+?),";
Pattern pattern = Pattern.compile(patt);
Matcher matcher = pattern.matcher(search + ",");
I am able to match a string like
search = "firstName:Giorgio"
But I'm not able to match string like
search = "email:giorgio.rossi#libero.it"
or
search = "dataregistrazione:27/10/2016"
How I should modify the regex expression in order to match these strings?
You may use
String pat = "(\\w+)[:<>]([^,]+)"; // Add a , at the end if it is necessary
See the regex demo
Details:
(\w+) - Group 1 capturing 1 or more word chars
[:<>] - one of the chars inside the character class, :, <, or >
([^,]+) - Group 2 capturing 1 or more chars other than , (in the demo, I added \n as the demo input text contains newlines).
You can use regex like this:
public static void main(String[] args) {
String[] arr = new String[]{"firstName:Giorgio", "email:giorgio.rossi#libero.it", "dataregistrazione:27/10/2016"};
String pattern = "(\\w+[:|<|>]\\w+)|(\\w+:\\w+\\.\\w+#\\w+\\.\\w+)|(\\w+:\\d{1,2}/\\d{1,2}/\\d{4})";
for(String str : arr){
if(str.matches(pattern))
System.out.println(str);
}
}
output is:
firstName:Giorgio
email:giorgio.rossi#libero.it
dataregistrazione:27/10/2016
But you have to remember that this regex will work only for your format of data. To make up the universal regex you should use RFC documents and articles (i.e here) about email format. Also this question can be useful.
Hope it helps.
The Character class \w matches [A-Za-z0-9_]. So kindly change the regex as (\\w+?)(:|<|>)(.*), to match any character from : to ,.
Or mention all characters that you can expect i.e. (\\w+?)(:|<|>)[#.\\w\\/]*, .

How to remove empty results after splitting with regex in Java?

I want to find all numbers from a given string (all numbers are mixed with letters but are separated by space).I try to split the input String but when check the result array I find that there are a lot of empty Strings, so how to change my split regex to remove this empty spaces?
Pattern reg = Pattern.compile("\\D0*");
String[] numbers = reg.split("asd0085 sa223 9349x");
for(String s:numbers){
System.out.println(s);
}
And the result:
85
223
9349
I know that I can iterate over the array and to remove empty results. But how to do it only with regex?
If you are using java 8, you can do it in 1 statement like this:
String[] array = Arrays.asList(s1.split("[,]")).stream().filter(str -> !str.isEmpty()).collect(Collectors.toList()).toArray(new String[0]);
Don't use split. Use find method which will return all matching substrings. You can do it like
Pattern reg = Pattern.compile("\\d+");
Matcher m = reg.matcher("asd0085 sa223 9349x");
while (m.find())
System.out.println(m.group());
which will print
0085
223
9349
Based on your regex it seems that your goal is also to remove leading zeroes like in case of 0085. If that is true, you can use regex like 0*(\\d+) and take part matched by group 1 (the one in parenthesis) and let leading zeroes be matched outside of that group.
Pattern reg = Pattern.compile("0*(\\d+)");
Matcher m = reg.matcher("asd0085 sa223 9349x");
while (m.find())
System.out.println(m.group(1));
Output:
85
223
9349
But if you really want to use split then change "\\D0*" to \\D+0* so you could split on one-or-more non-digits \\D+, not just one non-digit \\D, but with this solution you may need to ignore first empty element in result array (depending if string will start with element which should be split on, or not).
You can try with Pattern and Matcher as well.
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher("asd0085 sa223 9349x");
while (m.find()) {
System.out.println(m.group());
}
The method i think to solve this problem is,
String urStr = "asd0085 sa223 9349x";
urStr = urStr.replaceAll("[a-zA-Z]", "");
String[] urStrAry = urStr.split("\\s");
Replace all alphabets from the string.
Then split it by whitespace (\\s).
Pattern reg = Pattern.compile("\\D+");
// ...
results in:
0085
223
9349
You may try this:
reg.split("asd0085 sa223 9349x").replace("^/", "")
Using String.split(), you get an empty string as array element, when you have back to back delimiter in your string, on which you're splitting.
For e.g, if you split xyyz on y, the 2nd element will be an empty string. To avoid that, you can just add a quantifier to delimiter - y+, so that split happens on 1 or more iteration.
In your case it happens because you've used \\D0* which will match each non-digit character, and split on that. Thus you've back to back delimiter. You can of course use surrounding quantifier here:
Pattern reg = Pattern.compile("(\\D0*)+");
But what you really need is: \\D+0* there.
However, if what you only want is the numeric sequence from your string, I would use Matcher#find() method instead, with \\d+ as regex.

StringTokenizer -How to ignore spaces within a string

I am trying to use a stringtokenizer on a list of words as below
String sentence=""Name":"jon" "location":"3333 abc street" "country":"usa"" etc
When i use stringtokenizer and give space as the delimiter as below
StringTokenizer tokens=new StringTokenizer(sentence," ")
I was expecting my output as different tokens as below
Name:jon
location:3333 abc street
country:usa
But the string tokenizer tries to tokenize on the value of location also and it appears like
Name:jon
location:3333
abc
street
country:usa
Please let me know how i can fix the above and if i need to do a regex what kind of the expression should i specify?
This can be easily handled using a CSV Reader.
String str = "\"Name\":\"jon\" \"location\":\"3333 abc street\" \"country\":\"usa\"";
// prepare String for CSV parsing
CsvReader reader = CsvReader.parse(str.replaceAll("\" *: *\"", ":"));
reader.setDelimiter(' '); // use space a delimiter
reader.readRecord(); // read CSV record
for (int i=0; i<reader.getColumnCount(); i++) // loop thru columns
System.out.printf("Scol[%d]: [%s]%n", i, reader.get(i));
Update: And here is pure Java SDK solution:
Pattern p = Pattern.compile("(.+?)(\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)|$)");
Matcher m = p.matcher(str);
for (int i=0; m.find(); i++)
System.out.printf("Scol[%d]: [%s]%n", i, m.group(1).replace("\"", ""));
OUTPUT:
Scol[0]: [Name:jon]
Scol[1]: [location:3333 abc street]
Scol[2]: [country:usa]
Live Demo: http://ideone.com/WO0NK6
Explanation: As per OP's comments:
I am using this regex:
(.+?)(\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)|$)
Breaking it down now into smaller chunks.
PS: DQ represents Double quote
(?:[^\"]*\") 0 or more non-DQ characters followed by one DQ (RE1)
(?:[^\"]*\"){2} Exactly a pair of above RE1
(?:(?:[^\"]*\"){2})* 0 or more occurrences of pair of RE1
(?:(?:[^\"]*\"){2})*[^\"]*$ 0 or more occurrences of pair of RE1 followed by 0 or more non-DQ characters followed by end of string (RE2)
(?=(?:(?:[^\"]*\"){2})*[^\"]*$) Positive lookahead of above RE2
.+? Match 1 or more characters (? is for non-greedy matching)
\\s+ Should be followed by one or more spaces
(\\s+(?=RE2)|$) Should be followed by space or end of string
In short: It means match 1 or more length any characters followed by "a space OR end of string". Space must be followed by EVEN number of DQs. Hence space outside double quotes will be matched and inside double quotes will not be matched (since those are followed by odd number of DQs).
StringTokenizer is too simple-minded for this job. If you don't need to deal with quote marks inside the values, you can try this regex:
String s = "\"Name\":\"jon\" \"location\":\"3333 abc street\" \"country\":\"usa\"";
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Output:
Name
jon
location
3333 abc street
country
usa
This won't handle internal quote marks within values—where the output should be, e.g.,
Name:Fred ("Freddy") Jones
You can use Json, Its looks like You are using Json kind of schema.
Do a bit google and try to implement Json.
String sentence=""Name":"jon" "location":"3333 abc street" "country":"usa"" etc
Will be key, value pair in Json like name is key and Jon is value. location is key and 3333 abc street is value. and so on....
Give it a try.
Here is one link
http://www.mkyong.com/java/json-simple-example-read-and-write-json/
Edit:
Its just a bit silly answer, But You can try something like this,
sentence = sentence.replaceAll("\" ", "");
StringTokenizer tokens=new StringTokenizer(sentence,"");

split strings with uppercase

I have some strings that I want to split them word by word. They are in different formats like:
THIS-IS-MY-STRING
ThisIsMyString
This_Is_My_String
This is my string
I use:
String[] x = str1.split("(?=[A-Z])|[_]|[-]|[ ]");
But there are some problems:
some elements in x array will be empty
for the first string I want “THIS” but the result of split is “T”, “H”, “I”, “S”
How should I change split to reach my purpose? Could you please help me?
You need to include look-behind as well, here you go:
String[] x = str1.split("([-_ ]|(?<=[^-_ A-Z])(?=[A-Z]))");
[-_ ] means - or _ or space.
(?<=[^-_ A-Z]) means the previous character isn't a -, _, space, or A-Z.
(?=[A-Z]) means the next character is A-Z.
Reference.
EDIT:
Unfortunately there is no way (I know of) that you can use split to split _CITY_ABC while avoiding _CITY or an empty string.
You can however only process the first and last string if not empty, but this is not ideal.
For this I suggest Matcher:
String str1 = "_CityCITY_";
Pattern p = Pattern.compile("[A-Z][a-z]+(?=[A-Z]|$)|[A-Za-z]+(?=[-_ ]|$)");
Matcher m = p.matcher(str1);
while (m.find())
System.out.println(m.group());
Try Regex.Split(). The first param is the string to split and the second string would be your regular expression. Hope this helps.

Categories

Resources