Java regex : Remove (double) negative look ahead and look behind - java

I have the following regex that matches a string to pattern:
(?i)(?<![^\\s\\p{Punct}]) : Look behind
(?![^\\s\\p{Punct}]) : Look ahead
Below is an example that demonstrates how I am using it:
public static void main(String[] args) {
String patternStart = "(?i)(?<![^\\s\\p{Punct}])", patternEnd = "(?![^\\s\\p{Punct}])";
String text = "this is some paragraph";
System.out.println(Pattern.compile(patternStart + Pattern.quote("some paragraph") + patternEnd).matcher(text).find());
}
It returns true which is expected result. However, as the regex uses double negative (i.e. negative look ahead/behind and ^), I thought removing both of the negatives should return the same result. So, I tried with the below:
String patternStart = "(?i)(?<=[\\s\\p{Punct}])", patternEnd = "(?=[\\s\\p{Punct}])";
However, it doesn't seem to be working as expected. I even tried adding ^ and/or $ in the end (of the square bracket) to match beginning/end of string, still, no luck.
Is it possible to convert these regexes into positive look-ups?

Yes, it is possible, but it is less efficient than what you have because in the positive lookarounds you need to use alternation:
String patternStart = "(?i)(?<=^|[\\s\\p{Punct}])", patternEnd = "(?=[\\s\\p{Punct}]|$)";
^^ ^^
The (?<=^|[\\s\\p{Punct}]) lookbehind requires the presence of either start of string (^) or | a whitespace or punctuation symbol ([\\s\\p{Punct}]). The positive lookahead (?=[\\s\\p{Punct}]|$) requires either a whitespace or punctuation, or the end of string.
If you just add ^ or $ into the character classes like [\\s\\p{Punct}^] and [\\s\\p{Punct}$], they will be parsed as literal caret and dollar symbols.

Related

Regex for extracting all heading digits from a string

I am trying to extract all heading digits from a string using Java regex without writing additional code and I could not find something to work:
"12345XYZ6789ABC" should give me "12345".
"X12345XYZ6789ABC" should give me nothing
public final class NumberExtractor {
private static final Pattern DIGITS = Pattern.compile("what should be my regex here?");
public static Optional<Long> headNumber(String token) {
var matcher = DIGITS.matcher(token);
return matcher.find() ? Optional.of(Long.valueOf(matcher.group())) : Optional.empty();
}
}
Use a word boundary \b:
\b\d+
See live demo.
If you strictly want to match only digits at the start of the input, and not from each word (same thing when the input contains only one word), use ^:
^\d+
Pattern DIGITS = Pattern.compile("\\b\\d+"); // leading digits of all words
Pattern DIGITS = Pattern.compile("^\\d+"); // leading digits of input
I'd think something like "^[0-9]*" would work. There's a \d that matches other Unicode digits if you want to include them as well.
Edit: removed errant . from the string.

What is the Regex for decimal numbers in Java?

I am not quite sure of what is the correct regex for the period in Java. Here are some of my attempts. Sadly, they all meant any character.
String regex = "[0-9]*[.]?[0-9]*";
String regex = "[0-9]*['.']?[0-9]*";
String regex = "[0-9]*["."]?[0-9]*";
String regex = "[0-9]*[\.]?[0-9]*";
String regex = "[0-9]*[\\.]?[0-9]*";
String regex = "[0-9]*.?[0-9]*";
String regex = "[0-9]*\.?[0-9]*";
String regex = "[0-9]*\\.?[0-9]*";
But what I want is the actual "." character itself. Anyone have an idea?
What I'm trying to do actually is to write out the regex for a non-negative real number (decimals allowed). So the possibilities are: 12.2, 3.7, 2., 0.3, .89, 19
String regex = "[0-9]*['.']?[0-9]*";
Pattern pattern = Pattern.compile(regex);
String x = "5p4";
Matcher matcher = pattern.matcher(x);
System.out.println(matcher.find());
The last line is supposed to print false but prints true anyway. I think my regex is wrong though.
Update
To match non negative decimal number you need this regex:
^\d*\.\d+|\d+\.\d*$
or in java syntax : "^\\d*\\.\\d+|\\d+\\.\\d*$"
String regex = "^\\d*\\.\\d+|\\d+\\.\\d*$"
String string = "123.43253";
if(string.matches(regex))
System.out.println("true");
else
System.out.println("false");
Explanation for your original regex attempts:
[0-9]*\.?[0-9]*
with java escape it becomes :
"[0-9]*\\.?[0-9]*";
if you need to make the dot as mandatory you remove the ? mark:
[0-9]*\.[0-9]*
but this will accept just a dot without any number as well... So, if you want the validation to consider number as mandatory you use + ( which means one or more) instead of *(which means zero or more). That case it becomes:
[0-9]+\.[0-9]+
If you on Kotlin, use ktx:
fun String.findDecimalDigits() =
Pattern.compile("^[0-9]*\\.?[0-9]*").matcher(this).run { if (find()) group() else "" }!!
Your initial understanding was probably right, but you were being thrown because when using matcher.find(), your regex will find the first valid match within the string, and all of your examples would match a zero-length string.
I would suggest "^([0-9]+\\.?[0-9]*|\\.[0-9]+)$"
There are actually 2 ways to match a literal .. One is using backslash-escaping like you do there \\., and the other way is to enclose it inside a character class or the square brackets like [.]. Most of the special characters become literal characters inside the square brackets including .. So use \\. shows your intention clearer than [.] if all you want is to match a literal dot .. Use [] if you need to match multiple things which represents match this or that for example this regex [\\d.] means match a single digit or a literal dot
I have tested all the cases.
public static boolean isDecimal(String input) {
return Pattern.matches("^[-+]?\\d*[.]?\\d+|^[-+]?\\d+[.]?\\d*", input);
}

How do I enter a "." 2 spaces before every "," in a Java string

I've got a string in my Java project which looks something like this
9201,92710,94500,920,1002
How can I enter a dot 2 places before the comma? So it looks like
this:
920.1,9271.0,9450.0,92.0,100.2
I had an attempt at it but I can't get the last number to get a dot.
numbers = numbers.replaceAll("([0-9],)", "\\.$1");
The result I got is
920.1,9271.0,9450.0,92.0,1002
Note: The length of the string is not always the same. It can be longer / shorter.
Check if string ends with ",". If not, append a "," to the string, run the same replaceAll, remove "," from end of String.
Split string by the "," delimiter, process each piece adding the "." where needed.
Just add a "." at numbers.length-1 to solve the issue with the last number
As your problem is not only inserting the dot before every comma, but also before end of string, you just must add this additional condition to your capturing group:
numbers = numbers.replaceAll("([0-9](,|$))", "\\.$1");
As suggested by Siguza, you could as well use a non-capturing group which is even more what a "human" would expect to be captured in the capturing group:
numbers = numbers.replaceAll("([0-9](?:,|$))", "\\.$1");
But as a non-capturing group is (although a really nice feature) not standard Regex and the overhead is not that significant here, I would recommend using the first option.
You could use word boundary:
numbers = numbers.replaceAll("(\\d)\b", ".$1");
Your solution is fine, as long as you put a comma at the end like dan said.
So instead of:
numbers = numbers.replaceAll("([0-9],)", "\\.$1");
write:
numbers = (numbers+",").replaceAll("([0-9],)", "\\.$1");
numbers = numbers.substring(0,numbers.size()-1);
You may use a positive lookahead to check for the , or end of string right after a digit and a zeroth backreference to the whole match:
String s = "9201,92710,94500,920,1002";
System.out.println(s.replaceAll("\\d(?=,|$)", ".$0"));
// => 920.1,9271.0,9450.0,92.0,100.2
See the Java demo and a regex demo.
Details:
\\d - exactly 1 digit...
(?=,|$) - that must be before a , or end of string ($).
A capturing variation (Java demo):
String s = "9201,92710,94500,920,1002";
System.out.println(s.replaceAll("(\\d)(,|$)", ".$1$2"));
You where right to go for the replaceAll method. But your regex was not matching the end of the string, the last set of numbers.
Here is my take on your problem:
public static void main(String[] args) {
String numbers = "9201,92710,94500,920,1002";
System.out.println(numbers.replaceAll("(\\d,|\\d$)", ".$1"));
}
the regex (\\d,|\\d$) matches a digit followed by a comma \d,, OR | a digit followed by the end of the string \d$.
I have tested it and found to work.
As others have suggested you could add a comma at the end, run the replace all and then remove it. But it seems as extra effort.
Example:
public static void main(String[] args) {
String numbers = "9201,92710,94500,920,1002";
//add on the comma
numbers += ",";
numbers = numbers.replaceAll("(\\d,)", "\\.$1");
//remove the comma
numbers = numbers.substring(0, numbers.length()-1);
System.out.println(numbers);
}

Regex matching word that is in the middle of any character except a letter

I'd like to know how to detect word that is between any characters except a letter from alphabet. I need this, because I'm working on a custom import organizer for Java. This is what I have already tried:
The regex expression:
[^(a-zA-Z)]InitializationEvent[^(a-zA-Z)]
I'm searching for the word "InitializationEvent".
The code snippet I've been testing on:
public void load(InitializationEvent event) {
It looks like adding space before the word helps... is the parenthesis inside of alphabet range?
I tested this in my program and it didn't work. Also I checked it on regexr.com, showing same results - class name not recognized.
Am I doing something wrong? I'm new to regex, so it might be a really basic mistake, or not. Let me know!
Lose the parentheses:
[^a-zA-Z]InitializationEvent[^a-zA-Z]
Inside [], parentheses are taken literally, and by inverting the group (^) you prevent it from matching because a ( is preceding InitializationEvent in your string.
Note, however, that the above regex will only match if InitializationEvent is neither at the beginning nor at the end of the tested string. To allow that, you can use:
(^|[^a-zA-Z])InitializationEvent([^a-zA-Z]|$)
Or, without creating any matching groups (which is supposed to be cleaner, and perform better):
(?:^|[^a-zA-Z])InitializationEvent(?:[^a-zA-Z]|$)
how to detect word that is between any characters except a letter from alphabet
This is the case where lookarounds come handy. You can use:
(?<![a-zA-Z])InitializationEvent(?![a-zA-Z])
(?<![a-zA-Z]) is negative lookbehind to assert that there is no alphabet at previous position
(?![a-zA-Z]) is negative lookahead to assert that there is no alphabet at next position
RegEx Demo
The parentheses are causing the problem, just skip them:
"[^a-zA-Z]InitializationEvent[^a-zA-Z]"
or use the predefined non-word character class which is slightly different because it also excludes numbers and the underscore:
"\\WInitializationEvent\\W"
But as it seems you want to match a class name, this might be ok because the remaining character are exactly those that are allowed in a class name.
I'm not sure about your application but from a regexp perspective you can use negative lookaheads and negative lookbehinds to define what cannot surround the String to specify a match.
I have added the negative lookahead (?![a-zA-Z]) and the negative lookbehind (?<![a-zA-Z]) in place of your [^(a-zA-Z)] originally supplied to create: (?<![a-zA-Z])InitializationEvent(?![a-zA-Z])
Quick Fiddle I created:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String pattern = "(?<![a-zA-Z])InitializationEvent(?![a-zA-Z])";
String sourceString = "public void load(InitializationEvent event) {";
String sourceString2 = "public void load(BInitializationEventA event) {";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(sourceString);
if (m.find( )) {
System.out.println("Found value of pattern in sourceString: " + m.group(0) );
} else {
System.out.println("NO MATCH in sourceString");
}
Matcher m2 = r.matcher(sourceString2);
if (m2.find( )) {
System.out.println("Found value of pattern in sourceString2: " + m2.group(0) );
} else {
System.out.println("NO MATCH in sourceString2");
}
}
}
output:
sh-4.3$ java -Xmx128M -Xms16M HelloWorld
Found value of pattern in sourceString: InitializationEvent
NO MATCH in sourceString2
You seem really close:
[^(a-zA-Z)]*(InitializationEvent)[^(a-zA-Z)]*
I think this is what you are looking for. The asterisk provides a match for zero or many of the character or group before it.
EDIT/UPDATE
My apologies on the initial response.
[^a-zA-Z]+(InitializationEvent)[^a-zA-Z]+
My regex is a little rusty, but this will match on any non-alphabet character one or many times prior to the InitializationEvent and after.

Java split regex non-greedy match not working

Why is non-greedy match not working for me? Take following example:
public String nonGreedy(){
String str2 = "abc|s:0:\"gef\";s:2:\"ced\"";
return str2.split(":.*?ced")[0];
}
In my eyes the result should be: abc|s:0:\"gef\";s:2 but it is: abc|s
The .*? in your regex matches any character except \n (0 or more times, matching the least amount possible).
You can try the regular expression:
:[^:]*?ced
On another note, you should use a constant Pattern to avoid recompiling the expression every time, something like:
private static final Pattern REGEX_PATTERN =
Pattern.compile(":[^:]*?ced");
public static void main(String[] args) {
String input = "abc|s:0:\"gef\";s:2:\"ced\"";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[abc|s:0:"gef";s:2, "]"
}
It is behaving as expected. The non-greedy match will match as little as it has to, and with your input, the minimum characters to match is the first colon to the next ced.
You could try limiting the number of characters consumed. For example to limit the term to "up to 3 characters:
:.{0,3}ced
To make it split as close to ced as possible, use a negative look-ahead, with this regex:
:(?!.*:.*ced).*ced
This makes sure there isn't a closer colon to ced.

Categories

Resources