I'm not understanding something about how Java's regex matching for \s works. In the simple class below, \s seems to match [at least] $ and *, which is worrisome. When I don't include \s, the last char of each word gets chopped. And, neither regex seems to catch the ending " in the string. Would somebody please explain what's going on? Or point me to a useful resource? Thanks.
public class SanitizeText {
public static void main(String[] args)
{
String s = "123. ... This is Evil !##$ Wicked %^&* _ Mean ()+<> and ;:' - Nasty. \\ =\"";
String t = "123. ... This is Evil !##$ Wicked %^&* _ Mean ()+<> and ;:' - Nasty. \\ =\"";
s = s.replaceAll(".[^\\w\\s.]", " "); // Does the \s match non-space chars? Sees like at least $ and * are matched.
s = s.replaceAll(" {2,}", " ");
t = t.replaceAll(".[^\\w.]", " "); // Why does this regex chopping the trailing char of each word ??
t = t.replaceAll(" {2,}", " ");
System.out.println ("s: " + s);
System.out.println ("t: " + t);
}
}
// produces:
// s: 123. ... This is Evil $ Wicked * _ Mean and Nasty . "
// t: 123 .. Thi i Evi Wicke Mea an Nast "
\\s does not match non-space chars.
The regex .[^\\w\\s.] will match Any character, followed by a non-word, non-space, non-period character.
It seems to work exactly like that to me.
Answer to Why does this regex chopping the trailing char of each word ??
.[^\\w.] is matching any character (the .) followed by a non word, non dot character and repaces it by a space. So it matches each last letter in a word and the following whitespace.
Answer to Does the \s match non-space chars? Sees like at least $ and * are matched.
No. You are matching a char (.) followed by a non word, non whitespace character. So two characters each time.
.[^\\w\\s.]
will match on
Wicked %^&* _
1. ^^
2. ^^
and the * is not matched, because there is a whitespace following, therefor it is not replaced.
Related
I have a string
string 1(excluding the quotes) -> "my car number is #8746253 which is actually cool"
conditions - The number 8746253, could be of any length and
- the number can also be immediately followed by an end-of-line.
I want to group-out 8746253 which should not be followed by a dot "."
I have tried,
.*#(\d+)[^.].*
This will get me the number for sure, but this will match even if there is a dot, because [.^] will match the last digit of the number(for example, 3 in the below case)
string 2(excluding the quotes) -> "earth is #8746253.Kms away, which is very far"
I want to match only the string 1 type and not the string 2 types.
To match any number of digits after # that are not followed with a dot, use
(?<=#)\d++(?!\.)
The ++ is a possessive quantifier that will make the regex engine only check the lookahead (?!\.) only after the last matched digit, and won't backtrack if there is a dot after that. So, the whole match will get failed if there is a dit after the last digit in a digit chunk.
See the regex demo
To match the whole line and put the digits into capture group #1:
.*#(\d++)(?!\.).*
See this regex demo. Or a version without a lookahead:
^.*#(\d++)(?:[^.\r\n].*)?$
See another demo. In this last version, the digit chunk can only be followed with an optional sequence of a char that is not a ., CR and LF followed with any 0+ chars other than line break chars ((?:[^.\r\n].*)?) and then the end of string ($).
This works like you have described
public class MyRegex{
public static void main(String[] args) {
Pattern patern = Pattern.compile("#(\\d++)[^\\.]");
Matcher matcher1 = patern.matcher("my car number is #8746253 which is actually cool");
if(matcher1.find()){
System.out.println(matcher1.group(1));
}
Matcher matcher2 = patern.matcher("earth is #8746253.Kms away, which is very far");
if(matcher2.find()){
System.out.println(matcher1.group(1));
}else{
System.out.println("No match found");
}
}
}
Outputs:
> 8746253
> No match found
I need to strip off all the leading and trailing characters from a string upto the first and last digit respectively.
Example : OBC9187A-1%A
Should return : 9187A-1
How do I achieve this in Java?
I understand regex is the solution, but I am not good at it.
I tried this replaceAll("([^0-9.*0-9])","")
But it returns only digits and strips all the alpha/special characters.
Here is a self-contained example of using regex and java to solve your problem. I would suggest looking at a regex tutorial of some kind here is a nice one.
public static void main(String[] args) throws FileNotFoundException {
String test = "OBC9187A-1%A";
Pattern p = Pattern.compile("\\d.*\\d");
Matcher m = p.matcher(test);
while (m.find()) {
System.out.println("Match: " + m.group());
}
}
Output:
Match: 9187A-1
\d matches any digit .* matches anything 0 or more times \d matches any digit. The reason we use \\d is to escape the \ for Java since \ is a special character...So this regex will match a digit followed by anything followed by another digit. This is greedy so it will take the longest/largest/greediest match so it will get the first and last digit and anything in between. The while loop is there because if there was more than 1 match it would loop through all matches. In this case there can only be 1 match so you can leave the while loop or change to if like this:
if(m.find())
{
System.out.println("Match: " + m.group());
}
This will strip leading and trailing non-digit characters from string s.
String s = "OBC9187A-1%A";
s = s.replaceAll("^\\D+", "").replaceAll("\\D+$", "");
System.out.println(s);
// prints 9187A-1
DEMO
Regex explanation
^\D+
^ assert position at start of the string
\D+ match any character that's not a digit [^0-9]
Quantifier: + Between one and unlimited times, as many times as possible
\D+$
\D+ match any character that's not a digit [^0-9]
Quantifier: + Between one and unlimited times, as many times as possible
$ assert position at end of the string
I have the following Java code:
public static void main(String[] args) {
String var = "ROOT_CONTEXT_MATCHER";
boolean matches = var.matches("/[A-Z][a-zA-Z0-9_]*/");
System.out.println("The value of 'matches' is: " + matches);
}
This prints: The value of 'matches' is: false
Why doesn't my var match the regex? If I am reading my regex correctly, it matches any String:
Beginning with an upper-case char, A-Z; then
Consisting of zero or more:
Lower-case chars a-z; or
Upper-case chars A-Z; or
Digits 0-9; or
An underscore
The String "ROOT_CONTEXT_MATCHER":
Starts with an A-Z char; and
Consists of 19 subsequent characters that are all uppper-case A-Z or are an underscore
What's going on here?!?
The issue is with the forward slash characters at the beginning and at the end of the regex. They don't have any special meaning here and are treated as literals. Simply remove them to get it fixed:
boolean matches = var.matches("[A-Z][a-zA-Z0-9_]*");
If you intended to use metacharacters for boundary matching, the correct characters are ^ for the beginning of the line, and $ for the end of the line:
boolean matches = var.matches("^[A-Z][a-zA-Z0-9_]*$");
although these are not needed here because String#matches would match the entire string.
You need to remove regex delimiers i.e. / from Java regex:
boolean matches = var.matches("[A-Z][a-zA-Z0-9_]*");
That can be further shortened to:
boolean matches = var.matches("[A-Z]\\w*");
Since \\w is equivalent of [a-zA-Z0-9_] (word character)
I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:
String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! # # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"
Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );
What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?
Some of results are:
dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG
The regex explanation is:
[^\p{L}+(\-\p{L}+)*\d]+
* Word separator will be:
* [^ ... ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.
If my understanding of your requirement is correct, this regex will match what you want:
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
It will match:
A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
Or several such sequences, hyphenated
Or a contiguous sequence of decimal digits (0-9)
The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.
Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to allow words with apostrophe ':
"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"
I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.
If the opening bracket of a character class is followed by a ^ then the characters listed inside the class are not allowed. So your regex allows anything except unicode letter,+,(,-,),* and digit occurring one or more times.
Note that characters like +,(,),* etc. don't have any special meaning inside a character class.
What pattern.split does is that it splits the string at patterns matching the regex. Your regex matches whitespace and hence split occurs at each occurrence of one or more whitespace. So result will be this.
For example consider this
Pattern pattern = Pattern.compile("a");
for (String s : pattern.split("sda a f g")) {
System.out.println("==>"+s);
}
Output will be
==>sd
==>
==> f g
A regular expression set description with [] can contain only letters, classes (\p{...}), sequences (e.g. a-z) and the complement symbol (^). You have to place the other magic characters you are using (+*()) outside the [ ] block.
I need to be able to return signed and unsigned integer constants with no
intervening symbols, possibly preceded by + or -. The only allowed digits are 3, 4, and 5.
I can't figure out a way to say that the expression must not contain a period before or after the integer.
This is what I have so far, but if I pass say "34.5 - 43" the string returned will be: "34 5 43".
All that needs to be returned is "43".
public String getInts(String toBeScanned){
String INT = "";
Pattern p = Pattern.compile("\\b[+-]?[3-5]+\\b");
Matcher m = p.matcher(toBeScanned);
if (m.matches() == true){
INT = toBeScanned;
}
else{
m = p.matcher(" " + toBeScanned);
while (m.find()){
INT = INT + m.group() + " ";
}
}
return INT;
}
Any thoughts or pushes in the right direction are appreciated. Is there a way to say it that the first and last character can be [\b and not .]
This is frustrating the heck out of me. Help!
You don't want a word boundary \b here. I think the best is to create your own assertion, try this
(?<![.\d])[+-]?[3-5]+(?![.\d])
See it here on Regexr
(?<![.\d]) is a negative lookbehind assertion, it says before the pattern is no dot and no digit allowed.
(?![.\d]) is a negative lookahead assertion, it says after the pattern is no dot and no digit allowed.
Improvement
to avoid that it matches stuff like "hf34" we can make it more strict
(?<![.\w])[+-]?[3-5]+(?![.\w])
See it on Regexr
The word boundary \b
\b matches on a change from a word character to a non word character. A word character is a letter or a digit or a _. That means you will also get problems with your \b before the [+-], because there is no \b between a space/start of the string and a [+-].
"\b[+-]?[3-5]+[.][3-5]+\b"
This pattern says that in order to match, there must be at least one number before, and one number after the decimal point.
Is there a way to say it that the first and last character can be [\b and not .]
[^\.\b]
matches \b but not '.'
Is that what you are looking for?
[^\.\b][+-]?[3-5]+[^\.\b]
Will match '43' but not '34.5'