I'm trying to clean up this very Noisy (due to OCR) dataset of names and email addresses and one problem is multiple names in one entry, for example
"Fenner, Robert: Fishbume, Howard" should be "Fenner, Robert" and "Fishbume, Howard"
or "Fendrich, Karen N., Ricci, Vincent" should be "Fendrich, Karen N." and "Ricci, Vincent"
How could I use regex to find entries where to strings are separated by a comma or colon, that are themselves separated by a comma and then split the string?
other variations of the problem:
"'Emily Phaup ' Ryan, Thomas M" -> "Emily Phaup", "Ryan, Thomas M"
"A Lilly, Alisia Rudd, Andrew McComb, Daniel Lisbon, David Compton"
->"A Lilly", "Alisia Rudd", "Andrew McComb", "Daniel Lisbon", "David Compton"
"Abigail.Perlmangus.pm.com Jay.Poole#us.pm.com" -> "Abigail.Perlmangus.pm.com", "Jay.Poole#us.pm.com"
and a couple more.
I know that it might not be possible to separate all of these occurences (especially without accidentally sepperating correct names) but separating some of them would definitely help
EDIT: I guess my question is a bit too broad, so I'll narrow it down a bit:
Is there a way to find Strings with the format "string1,string2, string3,string4" (the strings can contain any kind of chars and whitespaces) and split them into two seperate strings: "string1,string2" and "string3,string4"?
and could someone give me some pointers on how to do it, because I'm quite inexperienced with regex.
Well i would have try something like that
public static void main(String[] args) throws URISyntaxException, IOException {
String regex = "(\\w+(,|:|$)\\s*\\w+)(,|:|$)";
Pattern pattern = Pattern.compile(regex);
String [] tests = {
"Fenner, Robert: Fishbume, Howard"
,"string1, string2, string3, string4"
};
for (String test : tests) {
Matcher matcher = pattern.matcher(test);
while(matcher.find()){
System.out.println(matcher.group(1));
}
}
}
Output :
Fenner, Robert
Fishbume, Howard
string1, string2
string3, string4
This won't work for all your cases, but answer to your last edit
What i've done, is searching any word characters (\w+) followed by either , or : or being at the end of string. Followed by any space and other word characters followed again by , or : or end of line.
Regex detail
(\w+(,|:|$)\s*\w+)(,|:|$)
1st Capturing group (\w+(,|:|$)\s*\w+)
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (,|:|$)
1st Alternative: ,
, matches the character , literally
2nd Alternative: :
: matches the character : literally
3rd Alternative: $
$ assert position at end of the string
My honest recommendation is to take a representative sample to an online Regex calculator and play with it until you can stomach the output.
As you've noted, the input is not regular enough to really harness Regex. But you may be able to hack it down a little bit at least. There's probably not gonna be a one true perfect answer to that nastiness.
Related
I need regex to check if String has only one word (e.g. "This", "Country", "Boston ", " Programming ").
So far I used an alternative way of doing it which is to check if String contains spaces. However, I am sure that this can be done using regex.
One possible way in my opinion is "^\w{2,}\s". Does this work properly? Are there any other possible answers?
The pattern ^\w{2,}\s matches 2 or more word characters from the start of the string, followed by a mandatory whitespace char (that can also match a newline)
As the pattern is also unanchored, it can also match Boston in Boston test
If you want to match a single word with as least 2 characters surrounded by optional horizontal whitespace characters using \h* and add an anchor $ to assert the end of the string.
^\h*\w{2,}\h*$
Regex demo
In Java
String regex = "^\\h*\\w{2,}\\h*$";
Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo
I'm using the following regex to match city:
[a-zA-Z]+(?:[ '-][a-zA-Z]+)*
The problem is it not only matches the city but also part of the street name.
How can I make it match only the city (such as Brooklyn and Columbia City)?
UPDATE:
The data is represented in 1 line of text (each address will be fed to regex engine separately):
2778 Ray Ridge Pkwy,
Brooklyn NY 1194-5954
1776 99th St,
Brooklyn NY 11994-1264
1776 99th St,
Columbia City OR 11994-1264
I suggest the following approach: match the words from the beginning of the string till the first occurrence of 2 uppercase letters followed with the ZIP (see the look-ahead (?=\s+[A-Z]{2}\s+\d{5}-\d{4}) below):
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\s+\d+-\d+)
See demo
The regex:
^ - then starts looking from the beginning
[A-Za-z]+ - matches a word
(?:[\s'-]+[A-Za-z]+)* - matches 0 or more words that...
(?=\s+[A-Z]{2}\s+\d+-\d+) - are right before a space + 2 uppercase letters, space, 1 or more digits, hyphen and 1 or more digits.
If the ZIP (or whatever the numbers stand for) is optional, you may just rely on the 2 uppercase letters:
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\b)
Note that \b in \s+[A-Z]{2}\b is a word boundary that will force a non-word (space or punctuation or even end of string) to appear after 2 uppercase letters.
Just do not forget to use double backslash in Java to escape regex special metacharacters.
Here is a Java code demo:
String s = "Brooklyn NY 1194-5954";
Pattern pattern = Pattern.compile("^[A-Za-z]+(?:[\\s'-]+[A-Za-z]+)*(?=\\s+[A-Z]{2}\\b)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
OK.. I think I got it after hrs of tweaking and testing. May be helpful for someone else. This did the trick:
(?<=\n)[a-zA-Z]+(?:[ '-][a-z]+)* ?[A-Z]?[a-z]+
In case all your data is like in your example in the question, the pattern in your data is everything from the comma after the street to minimum 2 uppercase letters which represents the state.
This pattern matches the pattern as described and selects a group which should represent the city:
,\s+([a-zA-Z\s]*)[A-Z]{2,}?\s+
My problem is to find a word between two words. Out of these two words one is an all UPPER CASE word which can be anything and the other word is "is". I tried out few regexes but none are helping me. Here is my example:
String :
In THE house BIG BLACK cat is very good.
Expected output :
cat
RegEx used :
(?<=[A-Z]*\s)(.*?)(?=\sis)
The above RegEx gives me BIG BLACK cat as output whereas I just need cat.
One solution is to simplify your regular expression a bit,
[A-Z]+\s(\w+)\sis
and use only the matched group (i.e., \1). See it in action here.
Since you came up with something more complex, I assume you understand all the parts of the above expression but for someone who might come along later, here are more details:
[A-Z]+ will match one or more upper-case characters
\s will match a space
(\w+) will match one or more word characters ([a-zA-Z0-9_]) and store the match in the first match group
\s will match a space
is will match "is"
My example is very specific and may break down for different input. Your question didn't provided many details about what other inputs you expect, so I'm not confident my solution will work in all cases.
Try this one:
String TestInput = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern
.compile(
"(?<=\\b\\p{Lu}+\\s) # lookbehind assertion to ensure a uppercase word before\n"
+ "\\p{L}+ # matching at least one letter\n"
+ "(?=\\sis) # lookahead assertion to ensure a whitespace is ahead\n"
, Pattern.COMMENTS); Matcher m = p.matcher(TestInput);
if(m.find())
System.out.println(m.group(0));
it matches only "cat".
\p{L} is a Unicode property for a letter in any language.
\p{Lu} is a Unicode property for an uppercase letter in any language.
You want to look for a condition that depends on several parts of infirmation and then only retrieve a specific part of that information. That is not possible in a regex without grouping. In Java you should do it like this:
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+\\s(\\w+)\\sis");
Matcher matcher = pattern.matcher("In THE house BIG BLACK cat is very good.");
if (matcher.find())
System.out.println(matcher.group(1));
}
}
}
The group(1) is the one with brackets around it. In this case w+. And that's your word. The return type of group() is String so you can use it right away
The following part has a extrange behavior
(?<=[A-Z]*\s)(.*?)
For some reason [A-Z]* is matching a empty string. And (.*?) is matching BIG BLACK. With a little tweaks, I think the following will work (but it still matches some false positives):
(?<=[A-Z]+\s)(\w+)(?=\sis)
A slightly better regex would be:
(?<=\b[A-Z]+\s)(\w+)(?=\sis)
Hope it helps
String m = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern.compile("[A-Z]+\\s\\w+\\sis");
Matcher m1 = p.matcher(m);
if(m1.find()){
String group []= m1.group().split("\\s");// split by space
System.out.println(group[1]);// print the 2 position
}
I'm exploring the power of regular expressions, so I'm just wondering if something like this is possible:
public class StringSplit {
public static void main(String args[]) {
System.out.println(
java.util.Arrays.deepToString(
"12345".split(INSERT_REGEX_HERE)
)
); // prints "[12, 23, 34, 45]"
}
}
If possible, then simply provide the regex (and preemptively some explanation on how it works).
If it's only possible in some regex flavors other than Java, then feel free to provide those as well.
If it's not possible, then please explain why.
BONUS QUESTION
Same question, but with a find() loop instead of split:
Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
while (m.find()) {
System.out.println(m.group());
} // prints "12", "23", "34", "45"
Please note that it's not so much that I have a concrete task to accomplish one way or another, but rather I want to understand regular expressions. I don't need code that does what I want; I want regexes, if they exist, that I can use in the above code to accomplish the task (or regexes in other flavors that work with a "direct translation" of the code into another language).
And if they don't exist, I'd like a good solid explanation why.
I don't think this is possible with split(), but with find() it's pretty simple. Just use a lookahead with a capturing group inside:
Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
while (m.find())
{
System.out.println(m.group(1));
}
Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.
As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above ("(?=(\\d\\d))") and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.
There's no split() equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.
This somewhat heavy implementation using Matcher.find instead of split will also work, although by the time you have to code a for loop for such a trivial task you might as well drop the regular expressions altogether and use substrings (for similar coding complexity minus the CPU cycles):
import java.util.*;
import java.util.regex.*;
public class StringSplit {
public static void main(String args[]) {
ArrayList<String> result = new ArrayList<String>();
for (Matcher m = Pattern.compile("..").matcher("12345"); m.find(result.isEmpty() ? 0 : m.start() + 1); result.add(m.group()));
System.out.println( result.toString() ); // prints "[12, 23, 34, 45]"
}
}
EDIT1
match(): the reason why nobody so far has been able to concoct a regular expression like your BONUS_REGEX lies within Matcher, which will resume looking for the next group where the previous group ended (i.e. no overlap), as oposed to after where the previous group started -- that is, short of explicitly respecifying the start search position (above). A good candidate for BONUS_REGEX would have been "(.\\G.|^..)" but, unfortunately, the \G-anchor-in-the-middle trick doesn't work with Java's Match (but works just fine in Perl):
perl -e 'while ("12345"=~/(^..|.\G.)/g) { print "$1\n" }'
12
23
34
45
split(): as for INSERT_REGEX_HERE a good candidate would have been (?<=..)(?=..) (split point is the zero-width position where I have two characters to my right and two to my left), but again, because split concieves naught of overlap you end up with [12, 3, 45] (which is close, but no cigar.)
EDIT2
For fun, you can trick split() into doing what you want by first doubling non-boundary characters (here you need a reserved character value to split around):
Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1#$1").split("#")
We can be smart and eliminate the need for a reserved character by taking advantage of the fact that zero-width look-ahead assertions (unlike look-behind) can have an unbounded length; we can therefore split around all points which are an even number of characters away from the end of the doubled string (and at least two characters away from its beginning), producing the same result as above:
Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1").split("(?<=..)(?=(..)*$)")
Alternatively tricking match() in a similar way (but without the need for a reserved character value):
Matcher m = Pattern.compile("..").matcher(
Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1")
);
while (m.find()) {
System.out.println(m.group());
} // prints "12", "23", "34", "45"
Split chops a string into multiple pieces, but that doesn't allow for overlap. You'd need to use a loop to get overlapping pieces.
I don't think you can do this with split() because it throws away the part that matches the regular expression.
In Perl this works:
my $string = '12345';
my #array = ();
while ( $string =~ s/(\d(\d))/$2/ ) {
push(#array, $1);
}
print join(" ", #array);
# prints: 12 23 34 45
The find-and-replace expression says: match the first two adjacent digits and replace them in the string with just the second of the two digits.
Alternative, using plain matching with Perl. Should work anywhere where lookaheads do. And no need for loops here.
$_ = '12345';
#list = /(?=(..))./g;
print "#list";
# Output:
# 12 23 34 45
But this one, as posted earlier, is nicer if the \G trick works:
$_ = '12345';
#list = /^..|.\G./g;
print "#list";
# Output:
# 12 23 34 45
Edit: Sorry, didn't see that all of this was posted already.
Creating overlapping matches with String#split isn't possible, as the other answers have already stated. It is however possible to add a regex-replace before it to prepare the String, and then use the split to create regular pairs:
"12345".replaceAll(".(?=(.).)","$0$1")
.split("(?<=\\G..)")
The .replaceAll(".(?=(.).)","$0$1") will transform "12345" into "12233445". It basically replaces every 123 substring to 1223, then every 234 to 2334 (note that it's overlapping), etc. In other words, it'll duplicate every character, except for the first and last.
.(?=(.).) # Replace-regex:
. # A single character
(?= ) # followed by (using a positive lookahead):
. . # two more characters
( ) # of which the first is saved in capture group 1
$0$1 # Replacement-regex:
$0 # The entire match, which is the character itself since everything
# else was inside a lookahead
$1 # followed by capture group 1
After that, .split("(?<=\\G..)") will split this new String into pairs:
(?<=\G..) # Split-regex:
(?<= ) # A positive lookbehind:
\G # Matching the end of the previous match
# (or the start of the string initially)
.. # followed by two characters
Some more information about .split("(?<=\\G..)") can be found here.
Try it online.