I'm using the following regex to match city:
[a-zA-Z]+(?:[ '-][a-zA-Z]+)*
The problem is it not only matches the city but also part of the street name.
How can I make it match only the city (such as Brooklyn and Columbia City)?
UPDATE:
The data is represented in 1 line of text (each address will be fed to regex engine separately):
2778 Ray Ridge Pkwy,
Brooklyn NY 1194-5954
1776 99th St,
Brooklyn NY 11994-1264
1776 99th St,
Columbia City OR 11994-1264
I suggest the following approach: match the words from the beginning of the string till the first occurrence of 2 uppercase letters followed with the ZIP (see the look-ahead (?=\s+[A-Z]{2}\s+\d{5}-\d{4}) below):
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\s+\d+-\d+)
See demo
The regex:
^ - then starts looking from the beginning
[A-Za-z]+ - matches a word
(?:[\s'-]+[A-Za-z]+)* - matches 0 or more words that...
(?=\s+[A-Z]{2}\s+\d+-\d+) - are right before a space + 2 uppercase letters, space, 1 or more digits, hyphen and 1 or more digits.
If the ZIP (or whatever the numbers stand for) is optional, you may just rely on the 2 uppercase letters:
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\b)
Note that \b in \s+[A-Z]{2}\b is a word boundary that will force a non-word (space or punctuation or even end of string) to appear after 2 uppercase letters.
Just do not forget to use double backslash in Java to escape regex special metacharacters.
Here is a Java code demo:
String s = "Brooklyn NY 1194-5954";
Pattern pattern = Pattern.compile("^[A-Za-z]+(?:[\\s'-]+[A-Za-z]+)*(?=\\s+[A-Z]{2}\\b)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
OK.. I think I got it after hrs of tweaking and testing. May be helpful for someone else. This did the trick:
(?<=\n)[a-zA-Z]+(?:[ '-][a-z]+)* ?[A-Z]?[a-z]+
In case all your data is like in your example in the question, the pattern in your data is everything from the comma after the street to minimum 2 uppercase letters which represents the state.
This pattern matches the pattern as described and selects a group which should represent the city:
,\s+([a-zA-Z\s]*)[A-Z]{2,}?\s+
Related
I want to get the name word between the first id and before the second number.
I want to do this in Java Regex.
e.g. Car Care or Car Electronics & Accessories
# Name Id Child nodes
1 Car Care 15718271 Browse
2 Car Electronics & Accessories 2230642011 Browse
3 Exterior Accessories 15857511 Browse
I tried splitting the line with .split(" ")[1] but then it splits the words with spaces. Only gives one word within a phrase e.g. Car
Try this one:^\d*[a-zA-Z &+-]*(\d*)[a-zA-Z ]*
In the match '(\d*)' you will find the wished number.
If the strings before and after the number have special characters add them to the appropiate [] sections.
Explaination: '^' says start from the beginning, '\d*' take the first digit one or multiple times, [a-zA-Z &+-] take a string with these characters, (\d*) specifies the wished number, [a-zA-Z ] again takes a string after the number. Use a regex editor for trying this out.
You can use
(?m)^\d+\s+(.*?)\s+\d{6,}\s
See the regex demo. Details:
(?m) - a multiline option
^ - start of a line
\d+ - one or more digits
\s+ - one or more whitespaces
(.*?) - Group 1: zero or more chars other than line break chars as few as possible
\s+ - one or more whitespaces
\d{6,} - six or more digits
\s - a whitespace.
See the Java demo:
String s = "# Name Id Child nodes\n1 Car Care 15718271 Browse \n2 Car Electronics & Accessories 2230642011 Browse\n3 Exterior Accessories 15857511 Browse";
Pattern pattern = Pattern.compile("(?m)^\\d+\\s+(.*?)\\s+\\d{6,}\\s");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
Output:
Car Care
Car Electronics & Accessories
Exterior Accessories
I have strings like
patric NY abc other
patric ny
Expected output: patric ny and patric NY.
So, patric ny is varying statement which could be address. And abc MIGHT be there.
So, I want to retrieve random address like whatever we have before ABC
and if ABC is not present , give the complete string.
I tried
(.+?(?=abc))
It gives me result for patric NY abc other but not for patric ny.
Any help would be gratefully appreciated.
Extracting approach
You may use
^(.*?)(?:\s+abc\b.*)?$
See the regex demo.
Details
^ - start of string
(.*?) - Capturing group 1: any 0+ chars other than line break chars, as few as possible
(?:\s+abc\b.*)? - an optional non-capturing group that matches 1+ whitespaces, abc, a word boundary and any 0+ chars other than line break chars, as many as possible
$ - end of string.
Replacing approach
You may just remove 1+ whitespaces, abc and the rest from your string:
String result = input.replaceFirst("(?s)\\s+abc.*", "");
Or, if abc is a whole word:
String result = input.replaceFirst("(?s)\\s+abc\\b.*", "");
See the regex demo.
The replaceFirst() matches the first occurrence of the pattern and removes it.
Pattern details
(?s) - DOTALL flag making . match any char
\s+ - 1+ whitespaces
abc - an abc substring
\b - a word boundary
.* - the rest of the string
you can try this:
intput.replaceFirst("(patric (?:NY|ny)) ((?:abc|ABC).*)","$1")
I'm trying to write a code of java allowing to show a list of cities depending on the name of the city or its postal code:
I wrote many expressions but they didn't work 100%.
This is my last expression:
([A-Z_]+)(:)([0-9]+)
The expression should match a city name : it could be : Lonéy' ed or its code postal 57000
Does anyone have an idea how to improve my expression?
Thanx.
Since Java7 you can do the following :
Pattern.compile("([\\p{Alpha} '-_]+):(\\d{5})", Pattern.UNICODE_CHARACTER_CLASS)
Keep adding connecting characters (here [ '-_]) to cater for all your needs.
The pattern doesn't make any assumptions about the case of the name of a place as in some non-Latin scripts there are no cases.
EDIT: added 5 digits postal code detection and a SPACE for name detection
I suggest using
"(?U)(\\p{Lu}[\\p{L}\\p{M}\\s'-]*):(\\d{5})\\b"
It means:
(?U) - a Pattern.UNICODE_CHARACTER_CLASS inline flag that makes \b word bounsary and \d digit character class Unicode aware in the pattern
(\\p{Lu}[\\p{L}\\p{M}\\s'_-]*) - Group 1 capturing:
\\p{Lu} - an uppercase Unicode letter
[\\p{L}\\p{M}\\s'_-]* - 0 or more characters that are either Unicode letters (\\p{L}), diacritics (\\p{M}), whitespace (\\s), ', _ or - (NOTE that the hyphen must be at the end of the character class so that it could be treated as a literal hyphen)
: - a literal colon
(\\d{5}) - (Group 2) five digits
\\b - a word boundary so that we only match 5 digits not followed with a word char (not 5 digits in a 110 digit substring), can be replaced with "(?!\\d)"
See Java demo:
String s = "Lonéy' ed:57000";
Pattern pattern = Pattern.compile("(?U)(\\p{Lu}[\\p{L}\\p{M}\\s'-]*):(\\d{5})\\b");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
I have a single input where users should enter name and surname. The problem is i need to use checking regEx. There's a list of a requirements:
The name should start from Capital Letter (not space)
There can't be space stacks
It's obligate to support these Name and Surname (all people are able to write theirs first/name). Example:
John Smith
and
Armirat Bair Hossan
And the last symbol shouldn't be space.
Please help,
ATM i have regex like
^\\p{L}\\[p{L} ,.'-]+$
but it denies ALL input, which is not good
Thanks for helping me
UPDATE:
CORRECT INPUT:
"John Smith"
"Alberto del Muerto"
INCORRECT
" John Smith "
" John Smith"
You can use
^[\p{Lu}\p{M}][\p{L}\p{M},.'-]+(?: [\p{L}\p{M},.'-]+)*$
or
^\p{Lu}\p{M}*+(?:\p{L}\p{M}*+|[,.'-])++(?: (?:\p{L}\p{M}*+|[,.'-])++)*+$
See the regex demo and demo 2
Java declaration:
if (str.matches("[\\p{Lu}\\p{M}][\\p{L}\\p{M},.'-]+(?: [\\p{L}\\p{M},.'-]+)*")) { ... }
// or if (str.matches("\\p{Lu}\\p{M}*+(?:\\p{L}\\p{M}*+|[,.'-])++(?: (?:\\p{L}\\p{M}*+|[,.'-])++)*+")) { ... }
The first regex breakdown:
^ - start of string (not necessary with matches() method)
[\p{Lu}\p{M}] - 1 Unicode letter (incl. precomposed ones as \p{M} matches diacritics and \p{Lu} matches any uppercase Unicode base letter)
[\p{L}\p{M},.'-]+ - matches 1 or more Unicode letters, a ,, ., ' or - (if 1 letter names are valid, replace + with - at the end here)
(?: [\p{L}\p{M},.'-]+)* - 0 or more sequences of
- a space
[\p{L}\p{M},.'-]+ - 1 or more characters that are either Unicode letters or commas, or periods, or apostrophes or -.
$ - end of string (not necessary with matches() method)
NOTE: Sometimes, names contain curly apostrophes, you can add them to the character classes ([‘’]).
The 2nd regex is less effecient but is more accurate as it will only match diacritics after base letters. See more about matching Unicode letters at regular-expressions.info:
To match a letter including any diacritics, use \p{L}\p{M}*+.
Try this one
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- ']))[A-Za-z- ']{2,}$
There is also an interactive Demo of this pattern available at an external website.
You made a typo: the second \\ should be in front of p.
However even then there is a check missing for a trailing space
"^\\p{L}[\\p{L} ,.'-]+$"
For a .matches the following would suffice
"\\p{L}[\\p{L} ,.'-]*[\\p{L}.]"
Names like "del Rey, Hidalgo" do not require an initial capital.
Also I would advise to simply .trim() the input; imagine a user regarding at the input being rejected for a spurious blank.
Try this
^[A-Z][a-z]+(([\s][A-Z])?[a-z]+){1,2}$
but use \\ instead \ for java
I'm trying to clean up this very Noisy (due to OCR) dataset of names and email addresses and one problem is multiple names in one entry, for example
"Fenner, Robert: Fishbume, Howard" should be "Fenner, Robert" and "Fishbume, Howard"
or "Fendrich, Karen N., Ricci, Vincent" should be "Fendrich, Karen N." and "Ricci, Vincent"
How could I use regex to find entries where to strings are separated by a comma or colon, that are themselves separated by a comma and then split the string?
other variations of the problem:
"'Emily Phaup ' Ryan, Thomas M" -> "Emily Phaup", "Ryan, Thomas M"
"A Lilly, Alisia Rudd, Andrew McComb, Daniel Lisbon, David Compton"
->"A Lilly", "Alisia Rudd", "Andrew McComb", "Daniel Lisbon", "David Compton"
"Abigail.Perlmangus.pm.com Jay.Poole#us.pm.com" -> "Abigail.Perlmangus.pm.com", "Jay.Poole#us.pm.com"
and a couple more.
I know that it might not be possible to separate all of these occurences (especially without accidentally sepperating correct names) but separating some of them would definitely help
EDIT: I guess my question is a bit too broad, so I'll narrow it down a bit:
Is there a way to find Strings with the format "string1,string2, string3,string4" (the strings can contain any kind of chars and whitespaces) and split them into two seperate strings: "string1,string2" and "string3,string4"?
and could someone give me some pointers on how to do it, because I'm quite inexperienced with regex.
Well i would have try something like that
public static void main(String[] args) throws URISyntaxException, IOException {
String regex = "(\\w+(,|:|$)\\s*\\w+)(,|:|$)";
Pattern pattern = Pattern.compile(regex);
String [] tests = {
"Fenner, Robert: Fishbume, Howard"
,"string1, string2, string3, string4"
};
for (String test : tests) {
Matcher matcher = pattern.matcher(test);
while(matcher.find()){
System.out.println(matcher.group(1));
}
}
}
Output :
Fenner, Robert
Fishbume, Howard
string1, string2
string3, string4
This won't work for all your cases, but answer to your last edit
What i've done, is searching any word characters (\w+) followed by either , or : or being at the end of string. Followed by any space and other word characters followed again by , or : or end of line.
Regex detail
(\w+(,|:|$)\s*\w+)(,|:|$)
1st Capturing group (\w+(,|:|$)\s*\w+)
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
My honest recommendation is to take a representative sample to an online Regex calculator and play with it until you can stomach the output.
As you've noted, the input is not regular enough to really harness Regex. But you may be able to hack it down a little bit at least. There's probably not gonna be a one true perfect answer to that nastiness.