Regex expression is not working correctly - java

I am trying to find in a string in which numbers are formatted as "4.97", but if they are smaller than 1, they are in the format .97, .80 etc. I want to find these kind of substrings in the String and replace them so that they would start with a 0.
It's working for the string
String str = "Rate is : .97";
Result : "Rate is : 0.97"
But not for the string:
String str = "Rate is : .97 . XXXXXXXXX do you want . to perform another calculation . ";
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(.*\\D)(.\\d\\d.*)";
System.out.println(str.matches("(.*\\D)(.\\d\\d.*)"));
str = str.replaceAll(pattern, "$10$2");
Why is this happening?

In your second example, the .* after the last \\d will match any character except a newline which will match the rest of the string.
You might do the replacement without a capturing group using a negative lookbehind (?<!\S) to check if what is on the left is not a non whitespace char.
(?<!\S)\.[0-9]
In the replacement use a zero followed by the full match.
Regex demo | Java demo
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(?<!\\S)\\.[0-9]";
System.out.println(str.replaceAll(pattern, "0$0"));
Output
Rate is : 0.97 . XXXXXXXXX do you want . 87 to perform another calculation .
If there should be a non digit before, you could make use of a positive lookbehind
(?<=\D)\.[0-9]
Regex demo
In Java
String regex = "(?<=\\D)\\.[0-9]";

It looks like you need to add some lazy matching to your regex.
? means it will attempt to match as few times as possible, in this case it's to only pick up the first number and not go onto the second.
^(.*?\D)(.\d\d.*?)
You can see this regex work here, with a more complete explanation.
I have also added the ^ start of string matcher so to make sure only one match it created and not repeated onto the second.

First of all, your regex pattern seems to be wrong. I think you can just use:
(\D)(\.\d+)
Find a character that is not a digit, followed by a dot and at least one digit.
Second, for replacing, you could use more low-level features, such as:
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
final Pattern regex = Pattern.compile("(\\D)(\\.\\d+)");
final Matcher m = regex.matcher(str);
if (m.find()) {
str = m.replaceFirst(m.group(1) + "0" + m.group(2));
}
System.out.println(str);
But of course, this works too:
str = str.replaceAll("(\\D)(\\.\\d+)", "$10$2");

You can do a positive lookahead so that way you also catch whitespaces between . and the number.
(.(?=.\d)|(\d+))+
would give you
Then in your code you can do whatever operation on group 1(blue) and group 2(red) as you wish.

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?
Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine
Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();
Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo
A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

What is the Regex for decimal numbers in Java?

I am not quite sure of what is the correct regex for the period in Java. Here are some of my attempts. Sadly, they all meant any character.
String regex = "[0-9]*[.]?[0-9]*";
String regex = "[0-9]*['.']?[0-9]*";
String regex = "[0-9]*["."]?[0-9]*";
String regex = "[0-9]*[\.]?[0-9]*";
String regex = "[0-9]*[\\.]?[0-9]*";
String regex = "[0-9]*.?[0-9]*";
String regex = "[0-9]*\.?[0-9]*";
String regex = "[0-9]*\\.?[0-9]*";
But what I want is the actual "." character itself. Anyone have an idea?
What I'm trying to do actually is to write out the regex for a non-negative real number (decimals allowed). So the possibilities are: 12.2, 3.7, 2., 0.3, .89, 19
String regex = "[0-9]*['.']?[0-9]*";
Pattern pattern = Pattern.compile(regex);
String x = "5p4";
Matcher matcher = pattern.matcher(x);
System.out.println(matcher.find());
The last line is supposed to print false but prints true anyway. I think my regex is wrong though.
Update
To match non negative decimal number you need this regex:
^\d*\.\d+|\d+\.\d*$
or in java syntax : "^\\d*\\.\\d+|\\d+\\.\\d*$"
String regex = "^\\d*\\.\\d+|\\d+\\.\\d*$"
String string = "123.43253";
if(string.matches(regex))
System.out.println("true");
else
System.out.println("false");
Explanation for your original regex attempts:
[0-9]*\.?[0-9]*
with java escape it becomes :
"[0-9]*\\.?[0-9]*";
if you need to make the dot as mandatory you remove the ? mark:
[0-9]*\.[0-9]*
but this will accept just a dot without any number as well... So, if you want the validation to consider number as mandatory you use + ( which means one or more) instead of *(which means zero or more). That case it becomes:
[0-9]+\.[0-9]+
If you on Kotlin, use ktx:
fun String.findDecimalDigits() =
Pattern.compile("^[0-9]*\\.?[0-9]*").matcher(this).run { if (find()) group() else "" }!!
Your initial understanding was probably right, but you were being thrown because when using matcher.find(), your regex will find the first valid match within the string, and all of your examples would match a zero-length string.
I would suggest "^([0-9]+\\.?[0-9]*|\\.[0-9]+)$"
There are actually 2 ways to match a literal .. One is using backslash-escaping like you do there \\., and the other way is to enclose it inside a character class or the square brackets like [.]. Most of the special characters become literal characters inside the square brackets including .. So use \\. shows your intention clearer than [.] if all you want is to match a literal dot .. Use [] if you need to match multiple things which represents match this or that for example this regex [\\d.] means match a single digit or a literal dot
I have tested all the cases.
public static boolean isDecimal(String input) {
return Pattern.matches("^[-+]?\\d*[.]?\\d+|^[-+]?\\d+[.]?\\d*", input);
}

Regex including date string, email, number

I have this regex expression:
String patt = "(\\w+?)(:|<|>)(\\w+?),";
Pattern pattern = Pattern.compile(patt);
Matcher matcher = pattern.matcher(search + ",");
I am able to match a string like
search = "firstName:Giorgio"
But I'm not able to match string like
search = "email:giorgio.rossi#libero.it"
or
search = "dataregistrazione:27/10/2016"
How I should modify the regex expression in order to match these strings?
You may use
String pat = "(\\w+)[:<>]([^,]+)"; // Add a , at the end if it is necessary
See the regex demo
Details:
(\w+) - Group 1 capturing 1 or more word chars
[:<>] - one of the chars inside the character class, :, <, or >
([^,]+) - Group 2 capturing 1 or more chars other than , (in the demo, I added \n as the demo input text contains newlines).
You can use regex like this:
public static void main(String[] args) {
String[] arr = new String[]{"firstName:Giorgio", "email:giorgio.rossi#libero.it", "dataregistrazione:27/10/2016"};
String pattern = "(\\w+[:|<|>]\\w+)|(\\w+:\\w+\\.\\w+#\\w+\\.\\w+)|(\\w+:\\d{1,2}/\\d{1,2}/\\d{4})";
for(String str : arr){
if(str.matches(pattern))
System.out.println(str);
}
}
output is:
firstName:Giorgio
email:giorgio.rossi#libero.it
dataregistrazione:27/10/2016
But you have to remember that this regex will work only for your format of data. To make up the universal regex you should use RFC documents and articles (i.e here) about email format. Also this question can be useful.
Hope it helps.
The Character class \w matches [A-Za-z0-9_]. So kindly change the regex as (\\w+?)(:|<|>)(.*), to match any character from : to ,.
Or mention all characters that you can expect i.e. (\\w+?)(:|<|>)[#.\\w\\/]*, .

How to build a Regex in java to detect a whitespace or end of a string?

I am trying to build a Regex to find and extract the string containing Post office box.
Here is two examples:
str = "some text p.o. box 12456 Floor 105 streetName Street";
str = "po box 1011";
str = "post office Box 12 Floor 105 Tallapoosa Street";
str = "leclair ryan pc p.o. Box 2499 8th floor 951 east byrd street";
str = "box 1 slot 3 building 2 136 harvey road";
Here is my pattern and code:
Pattern p = Pattern.compile("p.*o.*box \\d+(\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
It works with the second example and note for the first one!
If change my pattern to the following:
Pattern p = Pattern.compile("p.*o.*box \d+ ");
It works just for the first example.
The question is how to group the Regex for end of string "\z" and Regex for whitespace "\s" or " "?
New Pattern:
Pattern p = Pattern.compile("(?i)((p.*o.box\s\w\s*\d*(\z|\s*)|(box\s*\w\s*\d*(\z|\s*)) ))");
You can leverage the following code:
String str = "some text p.o. box 12456 Floor 105 streetName Street";
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)(?:\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match: "+m.group(0));
System.out.println("Digits: "+m.group(1));
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
To make the pattern case insensitive, just add Pattern.CASE_INSENSITIVE flag to the Pattern.compile declaration or pre-pend the inline (?i) modifier to the pattern.
Also, .* matches any characters other than a newline zero or more times, I guess you wanted to match . optionally. So, you need just ? quantifier and to escape the dot so as to match a literal dot. Note how I used (...) to capture digits into Group 1 (it is called a capturing group). The group where you match the end of the string or space is inside a non-capturing grouo ((?:...)) that is used for grouping only, not for storing its value in the memory buffer. Since you wanted to match a word boundary there, I suggest replacing (?:\\z|\\s) with a mere \\b:
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)\\b");
There are a couple items in your regex that look like they need work. From what I understand you want to extract the P.O. Box number from strings of such format that you've provided. Given that, the following regex will accomplish what you want, with a following explanation. See it in action here: https://regex101.com/r/cQ8lH3/2
Pattern p = Pattern.compile("p\.?o\.? box [^ \r\n\t]+");
Firstly, you need to use only ONE slash, for escape sequences. Also, you must escape the dots. If you do not escape the dots, regex will match . as ANY single character. \. will instead match a dot symbol.
Next, you need to change the * quantifier after the \. to a ?. Why? The * symbol will match zero or more of the preceding symbol while the ? quantifier will match only one or none.
Finally rethink how you're matching the box number. Instead of matching all characters AND THEN white space, just match everything that isn't a whitespace. [^ \r\n\t]+ will match all characters that are NOT a space (), carriage return (\r), newline (\n), or tab (\t). Therefore it will consume the box number and stop as soon as it hits any whitespace or end of file.
Some of these changes may not be necessary to get your code to work for the examples you gave, but they are the proper way to build the regex you want.

Regex : Match first 15 characters but the matcher.start() should point to the 16th character

Here I am using regex to catch first 15 characters to match, and while using substring I have to use (0,matcher.start()) only wherein I should get 15 only, kindly help me out.
String test = "hello world this is example";
Pattern p = Pattern.compile(".{15}");
//can't change the code below
//can only make changes to pattern
Matcher m=p.matches(test);
matcher.find(){
String string = test.substring(0, m.start());
}
//here it is escaping the first 15 characters but I need them
//the m.start() here is giving 0 but it should give 15
If possible, I agree with #jlordo's comment: Use String string = test.substring(0, 15);
If you are forced to pass through the bottom code snippet which you marked as unchangeable, there is a workaround.
(Well that depends... If you are stuck with an unchangeable code snippet does not even compile... You're gonna have a bad time)
If you really need a regex that always returns 15 for m.start() you can use the regex lookbehind concept.
String test = "hello world this is example";
Pattern p = Pattern.compile("(?<=.{15}).");
Matcher m=p.matcher(test);
m.find();
System.out.println(m.start());
Granted that the test input string is at least 16 characters long, this will always return 15 for m.start().
The regex is supposed to be read as "Any character (the last .) preceded by (the (?<=) lookbehind operator) exactly 15 any characters (the .{15})".
(?<=foo) is a lookbehind operator specifying that whatever regex comes after must be preceded by "foo".
E.g. the regex: (?<=foo)bar
Will match the bar in: foobar
But not the bar in: wunderbar
You should be using Matcher.end() instead of Matcher.start().
String test = "hello world this is example";
Pattern p = Pattern.compile(".{15}");
Matcher m=p.matcher(test);
if(m.find()){
String string = test.substring(0, m.end());
System.out.println(string);
}
From API:
Matcher.start() ---> Returns the start index of the previous match.
Matcher.end() ---> Returns the offset after the last character matched.

Categories

Resources