Regex to remove initials from full name - java

I have names like "D John Livingston" , "S. Jennifer Adstan" and I want only the initials to be removed from the names , "D" in the first name and "S." in the second name. How can i do it using java regex?

The following code snippet seems to be working well:
String input = "John O'Connel";
input = input.replaceAll("\\b[A-Z]+(?:\\.|\\s+|$)", "").trim();
System.out.println(input);
John O'Connel
Your question is chock full of edge cases, since an initial could be, for example, more than one letter, and could appear at the start, middle, or end of the name. I replaced using the pattern \s*[A-Z]+(?:\.|\b), which seems to at least cover your examples. Also, I make a call to String#trim() for some whitespace cleanup for initials at the very beginning or end.
Demo

For this I would consider using String replaceAll().
So how do we design the regex?
Basically there are three cases you need to consider:
A. a single letter at the beginning of the name (optional period), followed by one
space
B. a single letter at the end of the name (optional period), preceded by one
space
C. a single letter in the middle of the name (optional period), surrounded by
two spaces
For the first two cases, you need to leave no spaces. So you would match one space and replace it with zero spaces.
For the last case, you need to leave one space. However, rather than handling this case explicitly, you may treat it as either A or B, since those will replace only one of the two spaces, leaving you with the desired number of spaces: 1.
So how do we combine case A and case B together? Using the symbol of |.
To prevent grabbing a single letter from a larger chain of letters, you can use the word border marker \b on the side which is not demarcated by a space character. (Normally for cases A and B, I would have used ^ and $ to explicitly match begin and end of string for this purpose. However, since we also need to handle case C in the middle of the string, we should use word border marker instead. )
And how do we represent the optional period? Since the period is a special character it must be escaped: \. Then it is marked as optional with question mark: \.? However, there's still the problem that the A. in the middle of a name might be matched as just A since period also counts as a word border. To prevent this, we add a possessive quantifier to the optional period \\.?+.
Putting all of this together, our regex would be: (\b[A-Z]\.?+ )|( [A-Z]\.?+\b)
However, in the final Java string, the backslash must be escaped, so in the final Java string, each \ will appear as \\
Example code:
String pattern = "(\\b[A-Z]\\.?+ )|( [A-Z]\\.?+\\b)";
String input1 = "MC Hammer I Smash U";
String input2 = "S. Jennifer A. Adstan JR.";
System.out.println(input1.replaceAll(pattern, ""));
System.out.println(input2.replaceAll(pattern, ""));
Output:
MC Hammer Smash
Jennifer Adstan JR.

Related

How to insert spaces after full stops at the end of sentences, but not in abbreviations or floating point numbers?

I have a JTextArea in which I want to replace all full stops without a space next to them e.g in "This is a sentence.This is another C.O.D sentence.This is yet another C.A.T. sentence." to "This is a sentence. This is another C.O.D sentence. This is yet another C.A.T. sentence.". But I don't want the abbreviations or floating point numbers to gain extra spaces e.g "This is a C.A.T. float 5.5" should not become "This is a C. A. T. float 5. 5"! I am using string.replaceAll(".",". ") for this which is not proving to be sufficient.
Keeping it simple, without negative look-behinds and such:
s = s.replaceAll("([^A-Z0-9.])\\.([^0-9 \t])", "$1. $2");
Replace the period when not:
after a capital itself (U.N.C. or M.Twain)
after a digit (1. - hoping the sentence does not end in a digit)
after a period (...)
before a digit (.5 - hoping the next sentence does not start with a digit)
before a space or tab
you can use the regex
([^A-Z])\.(?!\d)
which replaces all "." not followed by a number and not preceded by a uppercase letter
see the regex demo, online compiler
(You should edit your question to clearly state your requirement, e.g. handling of abbreviation)
You could replace (?<!\b[A-Z])\.(?!\d) with .<space>
Demonstration: https://regex101.com/r/g1g7Yg/1
Explanation:
(?<! ) negative look-behind group
\b[A-Z] word boundary following by one uppercase character
(i.e. one upper case character)
\. a dot
(?!\d) negative look-ahead group, of single digit
Which basically means, replace a dot if it is NOT preceded by single upper case character, and NOT followed by digit
There are still some flaws that it will not replace Hello world.1 apple 1 day. It shouldn't be difficult to change the regex to fix this if you understand the above regex.

Replace white spaces only in part of the string

I have a String like
"This is apple tree"
I want to remove the white spaces available until the word apple.After the change it will be like
"Thisisapple tree"
I need to achieve this in single replace command combined with regular expressions.
For now it looks like you may be looking for
String s = "This is apple tree";
System.out.println(s.replaceAll("\\G(\\S+)(?<!(?<!\\S)apple)\\s", "$1"));
Output: Thisisapple tree.
Explanation:
\G represents either end of previous match or start of input (^) if there was no previous match yet (when we are attempting to find first match)
\S+ represents one or more non-whitespace characters (to match words, including non-alphabetic characters like ' or punctuation)
(?<!(?<!\\S)apple)\\s negative-look-behind will prevent accepting whitespace which has apple before it (I added another negative-look-behind before apple to make sure that it doesn't have any non-whitespace which ensures that this is not part of some other word)
$1 in replacement represents match from group 1 (the one from (\S+)) which represents word. So we are replacing word and spaces with only word (effectively removing spaces)
WARNING: This solution assumes that
sentence doesn't start with space,
words can be separated with only one space.
If we want to get rid of this assumptions we would need something like:
System.out.println(s.replaceAll("^\\s+|\\G(\\S+)(?<!(?<!\\S)apple)\\s+", "$1"));
^\s+ will allow us to match spaces at beginning of string (and replace them with content of group 1 (word) which in this case will be empty, so we will simply remove these whitespaces)
\s+ at the end allows us to match word and one or more spaces after it (to remove them)
A single replace() is unlikely to solve your problem. You could do something like this..
String s[] = "This is an apple tree, not an orange tree".split("apple");
System.out.println(new StringBuilder(s[0].replace(" ","")).append("apple").append(s[1]));
This is achived via lookahead assertion, like this:
String str = "This is an apple tree";
System.out.println(str.replaceAll(" (?=.*apple)", ""));
It means: replace all spaces in front of which there anywhere word apple
If you want to use a regular expression you could try:
Matcher matcher = Pattern.compile("^(.*?\\bapple\\b)(.*)$").matcher("This is an apple but this apple is an orange");
System.out.println((!matcher.matches()) ? "No match" : matcher.group(1).replaceAll(" ", "") + matcher.group(2));
This checks that "apple" is an individual word and not just part of another word such as "snapple". It also splits at the first use of "apple".

Find java comments (multi and single line) using regex

I found the following regex online at http://regexlib.com/
(\/\*(\s*|.*?)*\*\/)|(\/\/.*)
It seems to work well for the following matches:
// Compute the exam average score for the midterm exam
/**
* The HelloWorld program implements an application that
*/
BUT it also tends to match
http://regexr.com/foo.html?q=bar
at least starting at the //
I'm new to regex and a total infant, but I read that if you put a caret at the beginning it forces the match to start at the beginning of the line, however this doesn't seem to work on RegExr.
I'm using the following:
^(\/\*(\s*|.*?)*\*\/)|(\/\/.*)$
The regex you are looking for is one that allows the comment beginning (// or /*) to appear anywhere except in each of the regexps that result in tokens that can contain those substrings inside. If you look at the lexical structure of java language, you'll see that the only lexical element that can contain a // or a /* inside is the string literal, so to match a comment inside a string you have to match all the string (for not having a string literal before your match that happens to begin a string literal --- and contain your comment inside)
So, the string before your comment should be composed of any valid string that don't begin a string literal (without ending) and so, it can be rounded by any number of string literals with any string that doesn't form a string literal in between. If you consider a string literal, it should be matched by the following:
\"()*\"
and the inside of the parenthesis must be filled with something that cannot be a \n, a single ", a single \, and also not a unicode literal \uxxxx that results in a valid " (java forbids to use normal java characters to be encoded as unicode sequences, so this last case doesn't apply) but can be a escaped \\ or a escaped \", so this leads to
\"([^\\\"\n]|\\.)*\"
and this can be repeated any number of times optionaly, and preceded of any character not being a " (that should begin the last part considered):
([^\\"](\"([^\\\"\n]|\\.)*\")?)*
well, the previous part to our valid string should be matched by this string, and then comes the comment string, it can be any of two forms:
\/\/[^\n]*$
or
/\*([^\*]|\*[^\/])*\*\/
(this is, a slash, an asterisk (escaped), and any number of things that can be: either something different than a * or * followed by something not a /, to finally reach a */ sequence)
These can be grouped in an alternative group, as in:
(\/\/[^\n]*\n|\/\*([^\*]|\*[^\/])*\*\/)
finally, our expression shows:
^([^\\"](\"([^\\\"\n]|\\.)*\")?)*(\/\/[^\n]*|\/\*([^\*]|\*[^/])*\*\/)
But you should be careful that your matched comment begins not at the beginning, but in the 4th group (in the mark of the 4th left parenthesis) and the regexp should match the string from the beginning, see demo
Note
Think you are matching not only the comment, but the text before. This makes the result match to be composed of what is before the matching you want and the matched. Also think that if you try this regexp with several comments in sequence, it will match only the last, as we have not covered the case of a /* ... /* .... */ sequence (the comment is also something that can be embedded into a comment, but considering also this case will make you hate regexps forever. The correct way to cope with this problem is to write a lex/flex specification to get the java tokens and you'll only get them, but this is out of scope in this explanation. See an probably valid example here.
You can try this pattern:
(?ms)^[^'"\n]*?(?:(?:"(?:\\.|[^"])*"|'\\?.')[^'"\n]*?)*((?:(?://[^\n]*|/\*.*?\*/)[ \t]*)+)
This captures comments in group 1, but only if the comment is not inside a string. Demo.
Breakdown:
(?ms) multiline flag, makes ^ match at the start of a line
singleline flag makes . match newlines
^ start of line
[^'"\n]*? match anything but " or ' or newline
(?: then, any number strings:
(?:
" start with a quote...
(?: ...followed by any number of...
\\. ...a backslash and the escaped character
| or
[^"] any character other than "
)*
" ...and finally the closing quote
| or...
'\\?.' a single character in single quotes, possibly escaped
)
[^'"\n]*? and everything up to the next string or newline
)*
( finally, capture (any number of) comments:
(?:
(?: either...
//[^\n]* a single line comment
| or
/\*.*?\*/ a multiline comment
)
[ \t]* and any subsequent comments if only separated by whitespace
)+
)

Match consecutive single characters as whole word

While filtering from list of strings, i want to match consecutive single characters as whole word
e.g. below strings
'm g road'
'some a b c d limited'
in first case should match if user types
"mg" or "m g" or "m g road" or "mg road"
in second case should match if user types
"some abcd" or "some a b c d" or "abcd" or "a b c d"
How i can do that, can i achieve this using regex?
Order of whole words i can handle right now using searching words one by one, but not sure how to treat consecutive single chars as single word
e.g. "mg road" or "road mg" i can handle by searching "mg" and "road" one by one
EDIT
For making requirement more clear, below is my test case
#Test
public void testRemoveSpaceFromConsecutiveSingleCharacters() throws Exception {
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("some a b c d limited").equals("some abcd limited"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("m g road").equals("mg road"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c").equals("bank abc"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c limited n a").equals("bank abc limited na"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("c road").equals("c road"));
}
1.) Strip out spaces within space-surrounded single letters from stringtocheck and userinput.
.replaceAll("(?<=\\b\\w) +(?=\\w\\b)","")
(?<=\b\w) look behind to check if preceded by \b word boundary, \w word character
(?=\\w\\b) look ahead to check if followed by \w word character, \b word boundary
See demo at regexplanet (click Java)
2.) Check if stringtocheck .contains userinput.
Sounds like you simply want to ignore white space. You can easily can do this by stripping out white space from both the target string and the user input before looking for a match.
You're basically wanting each search term to be modified to allow intervening spaces, so
"abcd" becomes regex "\ba ?b ?c ?d\b"
To achieve this, do this to each word before matching:
word = "\\b" + word.replaceAll("(?<=.)(?=.)", " ?") + "\\b";
The word breaks \b are necessary to stop matching "comma bcd" or "abc duck".
This regex will match all single characters separated by one or more spaces
(^(\w\s+)+)|(\s+\w)+$|((\s+\w)+\s+)
The following regex (in multiline mode) could help you out:
^(?<first>\w+)(?<chars>(?:.(?!(?:\b\w{2,}\b)))*)
# assure that it is the beginning of the line
# capture as many word characters as possible in the first group "first"
# the construction afterwards consumes everything up to (not including)
# a word which has at least two characters...
# ... and saves it to the group called "chars"
You would only need to replace the whitespaces in the second group (aka "chars").
See a demo on regex101.com
str = str.replaceAll("\\s","");

How to replace strings using java String.replaceAll() excluding some patterns?

I am using String.Replaceall to replace forward slash / followed or preceded by a space with a comma followed by space ", " EXCEPT some patterns (for example n/v, n/d should not be affected)
ALL the following inputs
"nausea/vomiting"
"nausea /vomiting"
"nausea/ vomiting"
"nausea / vomiting"
Should be outputted as
nausea, vomiting
HOWEVER ALL the following inputs
"user have n/v but not other/ complications"
"user have n/d but not other / complications"
Should be outputted as follows
"user have n/v but not other, complications"
"user have n/d but not other, complications"
I have tried
String source= "nausea/vomiting"
String regex= "([^n/v])(\\s*/\\s*)";
source.replaceAll(regex, ", ");
But it cuts the a before / and gives me nause , vomiting
Does any body know a solution?
Your first capturing group, ([^n/v]), captures any single character that is not the letter n, the letter v, or a slash (/). In this case, it's matching the a at the end of nausea and capturing it to be replaced.
You need to be a bit more clear about what you are and are not replacing here. Do you just want to make sure there's a comma instead when it doesn't end in "vomiting" or "d"? You can use non-capturing groups to indicate this:
(?=asdf) does not capture but when placed at the end ensures that right after the match the string will contain asdf; (?!asdf) ensures that it will not. Whichever you use, the question mark after the initial parenthesis ensures that any text it matches will not be returned or replaced when the match is found.
Also, do not forget that in Java source you must always double up any backslashes you put in string literals.
[^n/v] is a character class, and means anything except a n, / or a v.
You are probably looking for something like a negative lookbehind:
String regex= "(?<!\\bn)(\\s*/\\s*)";
This will match any of your slash and space combinations that are not preceded by just an n, and works for all your examples. You can read more on lookaround here.

Categories

Resources