JAVA regular expression with space

JAVA regular expression with space - java

I have a problem with the following expression:
String REGEX_Miasto_Dwu_Czlonowe="\D+\s\D+";
Pattern pat_Miasto = Pattern.compile(REGEX_Miasto_Dwu_Czlonowe);
Matcher mat_Miasto_Dwu_Czlonowe = pat_Miasto.matcher(adres);
Because the above pattern matches
"80-227 GDAŃSK DOSTUDZIENKI 666";
"83000 PRUSZCZ GDANSKI UL. TYSIACLECIA 666";
But it only should match this expression : "PRUSZCZ GDANSKI UL. TYSIACLECIA 666";
THX for help.

You got some problems in your regexp.
first you have to change all backslashes to double backslashes but if you have matches it coulde be a copy and paste error
\D matches non-digits. Was this your intention?
\D+\s\D+
Debuggex Demo
Therefore you match some non digits followed by one space followed by some non digits.
I think it is more or less by incident that your expression matches.
This could be a solution for your regexp:
^\d+\D+\d+$
Debuggex Demo
If you want to match your second line as a literal, so that only this line matches you could use:
Matcher.quoteReplacement(String s)
to build this kind of expression from a simple String. All control characters are well escaped.

Your regex matches any number of non-digit \D+ a space \s then any number of non-digit. So it matches the first string:
80-227 GDAŃSK DOSTUDZIENKI 666
// ^__________\D+_________^^____________\D+_________^
// |
// a space
I guess you want:
[^\s\d]+\s[^\s\d]+
Any number of non-space/non digit, a space then any number of non-space/non-digit.

Related

Regular expression to determine if the String consists of more than 4 numbers

I want to extract URL strings from a log which looks like below:
<13>Mar 27 11:22:38 144.0.116.31 AgentDevice=WindowsDNS AgentLogFile=DNS.log PluginVersion=X.X.X.X Date=3/27/2019 Time=11:22:34 AM Thread ID=11BC Context=PACKET Message= Internal packet identifier=0000007A4843E100 UDP/TCP indicator=UDP Send/Receive indicator=Snd Remote IP=X.X.X.X Xid (hex)=9b01 Query/Response=R Opcode=Q Flags (hex)=8081 Flags (char codes)=DR ResponseCode=NOERROR Question Type=A Question Name=outlook.office365.com
I am looking to extract Name text which contains more that 5 digits.
A possible way suggested is (\d.*?){5,} but does not seem to work, kindly suggest another way get the field.
Example of string match:
outlook12.office345.com
outlook.office12345.com

You can look for the following expression:
Name=([^ ]*\d{5,}[^ ]*)
Explanation:
Name= look for anything that starts with "Name=", than capture if:
[^ ]* any number of characters which is not a space
\d{5,} then 5 digits in a row
[^ ]* then again, all digits up to a white space

This regular expression:
(?<=Name=).*\d{5,}.*?(?=\s|$)
would extract strings like outlook.office365666.com (with 5 or more consecutive digits) from your example input.
Demo: https://regex101.com/r/YQ5l2w/1

Try this pattern: (?=\b.*(?:\d[^\d\s]*){5,})\S*
Explanation:
(?=...) - positive lookahead, assures that pattern inside it is matched somewhere ahead :)
\b - word boundary
(?:...) - non-capturing group
\d[^\d\s]* - match digit \d, then match zero or more of any characters other than whitespace \s or digit \d
{5,} - match preceeding pattern 5 or more times
\S* - match zero or more of any characters other than space to match the string if assertion is true, but I think you just need assertion :)
Demo
If you want only consecutive numbers use simplified pattern (?=\b.*\d{5,})\S*.
Another demo
Of course, you have to add positive lookbehind: (?<=Name=) to assert that you have Name= string preceeding

Try this regex
([a-z0-9]{5,}.[a-z0-9]{5,})+.com
https://regex101.com/r/OzsChv/3
It Groups,
outlook.office365.com
outlook12.office345.com
also all url strings

Java regex shortest match

I have the following string, (a.1) (b.2) (c.3) (d.4). I want to change it to (1) (2) (3) (4). I use the following method.
str.replaceAll("\(.*[.](.*)\)","($1)"). And I only get (4). What is the correct method?
Thanks

Couple things here. First, your escapes for the parentheses are incorrect. In Java string literals, backslash itself is an escape character, meaning you need to use \\( to represent \( in regex.
I think your question is how to do non-greedy matches in regex. Use ? to specify non-greedy matching; e.g. *? means "zero or more times, but as few times as possible".
This doesn't negate other answers, but they depend on your test input being as simple as it is in your question. This gives me the correct output without changing the spirit of your original regex (that only the parentheses and dot delimiter are known to be present):
String test = "(a.1) (b.2) (c.3) (d.4)";
String replaced = test.replaceAll("\\(.*?[.](.*?)\\)", "($1)");
System.out.println(replaced); // "(1) (2) (3) (4)"

Root cause
You want to match ()-delimited substrings, but are using .* greedy dot pattern that can match any 0 or more chars (other than line break chars). The \(.*[.](.*)\) pattern will match the first ( in (a.1) (b.2) (c.3) (d.4), then .* will grab the whole string, and backtracking will start trying to accommodate text for the subsequent obligatory subpatterns. [.] will find the last . in the string, the one before the last digit, 4. Then, (.*) will again grab all the rest of the string, but since the ) is required right after, due to backtracking the last (.*) will only capture 4.
Why is lazy / reluctant .*? not a solution?
Even if you use \(.*?[.](.*?)\), if there are (xxx) like substrings inside the string, they will get matched together with expected matches, as . matches any char but line break chars.
Solution
.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)")
See the regex demo. The [^()] will only match any char BUT a ( and ).
Details
\( - a ( char
[^()]* - a negated character class matching 0 or more chars other than ( and )
\. - a dot
([^()]*) - Group 1 (its value is later referred to with $1 from the replacement pattern): any 0+ chars other than ( and )
\) - a ) char.
Java demo:
List<String> strs = Arrays.asList("(a.1) (b.2) (c.3) (d.4)", "(a.1) (xxxx) (b.2) (c.3) (d.4)");
for (String str : strs)
System.out.println("\"" + str.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)") + "\"");
Output:
"(1) (2) (3) (4)"
"(1) (xxxx) (2) (3) (4)"

try this one, it will match any alphabets, . and " and replace them all with empty ""
str.replaceAll("[a-zA-Z\\.\"]", "")
Edit:
You can use also [^\\d)(\\s] to match all characters that are not number, space and )( and replace them all with empty "" string
String str = "(a.1) (b.2) (c.3) (d.4)";
System.out.println(str.replaceAll("[^\\d)(\\s]",""));

Try this
str.replaceAll("[A-Za-z0-9]+\.","");
[A-Za-z0-9] will match the upper case, lower case and digits. If you want to match anything before the dot(.) you can use .+ or .* in the place of [A-Za-z0-9]+

How to replace all non-digit charaters in a string?

I need to replace all non-digit charaters in the string. For instance:
String: 987sdf09870987=-0\\\`42
Replaced: 987**sdf**09870987**=-**0**\\\`**42
That's all non-digit char-sequence wrapped into ** charaters. How can I do that with String::replaceAll()?
(?![0-9]+$).*
the regex doesn't match what I want. How can I do that?

(\\D+)
You can use this and replace by **$1**.See demo.
https://regex101.com/r/fM9lY3/2

You can use a negated character class for a non-digit and use the 0th group back-reference to avoid overhead with capturing groups (it is minimal here, but still is):
String x = "987sdf09870987=-0\\\\\\`42";
x = x.replaceAll("[^0-9]+", "**$0**");
System.out.println(x);
See demo on IDEONE. Output: 987**sdf**09870987**=-**0**\\\`**42.
Also, in Java regex, character classes look neater than multiple escape symbols, that is why I prefer this [^0-9]+ pattern meaning match 1 or more (+) symbols other than (because of ^) digits from 0 to 9 ([0-9]).
A couple of words about your (?![0-9]+$).* regex. It consists of a negative lookahead (?![0-9]+$) that checks if from the current position onward there are no digits only (if there are only digits up to the end of string, the match fails), and .* matching any characters but a newline. You can see example of what it is doing here. I do not think it can help you since you need to actually match non-numbers, not just check if digits are absent.

Can I negate the dot?

The following regular expression matches the character a:
"a"
The following regular expression matches all characters except a:
"[^a]"
The following regular expression matches a ton of characters:
"."
How do I match everything that is not matched by "."? I can't use the same technique as above:
"[^.]"
because inside the brackets, the . changes meaning and only stands for the character . itself :(

The below negative lookahead will work.
(?:(?!.)[\S\s])
Java regex would be,
"(?:(?!.)[\\S\\s])"
DEMO
The idea behind the above regex is, it would match only \r or \n or \t or \f that is the characters which aren't matched by a dot (Multiline mode).

"[^\\.]"
use double backslash for regex used character. for example
\\.\\]\\[\\-\\)\\(\\?

What is the responsibility of (.*) in the Java String?

What is the responsibility of (.*) in the third line and how it works?
String Str = new String("Welcome to Tutorialspoint.com");
System.out.print("Return Value :" );
System.out.println(Str.matches("(.*)Tutorials(.*)"));

.matches() is a call to parse Str using the regex provided.
Regex, or Regular Expressions, are a way of parsing strings into groups. In the example provided, this matches any string which contains the word "Tutorials". (.*) simply means "a group of zero or more of any character".
This page is a good regex reference (for very basic syntax and examples).

Your expression matches any word prefixed and suffixed by any character of word Tutorial. .* means occurrence of any character any number of times including zero times.
The . represents regular expression meta-character which means any character.
The * is a regular expression quantifier, which means 0 or more occurrences of the expression character it was associated with.

matches takes regular expression string as parameter and (.*) means capture any character zero or more times greedily

.* means a group of zero or more of any character

In Regex:
.
Wildcard: Matches any single character except \n
for example pattern a.e matches ave in nave and ate in water
*
Matches the previous element zero or more times
for example pattern \d*\.\d matches .0, 19.9, 219.9

There is no reason to put parentheses around the .*, nor is there a reason to instantiate a String if you've already got a literal String. But worse is the fact that the matches() method is out of place here.
What it does is greedily matching any character from the start to the end of a String. Then it backtracks until it finds "Tutorials", after which it will again match any characters (except newlines).
It's better and more clear to use the find method. The find method simply finds the first "Tutorials" within the String, and you can remove the "(.*)" parts from the pattern.
As a one liner for convenience:
System.out.printf("Return value : %b%n", Pattern.compile("Tutorials").matcher("Welcome to Tutorialspoint.com").find());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JAVA regular expression with space - java

Related

Regular expression to determine if the String consists of more than 4 numbers

Java regex shortest match

How to replace all non-digit charaters in a string?

Can I negate the dot?

What is the responsibility of (.*) in the Java String?

Categories

Resources