Match consecutive single characters as whole word - java

While filtering from list of strings, i want to match consecutive single characters as whole word
e.g. below strings
'm g road'
'some a b c d limited'
in first case should match if user types
"mg" or "m g" or "m g road" or "mg road"
in second case should match if user types
"some abcd" or "some a b c d" or "abcd" or "a b c d"
How i can do that, can i achieve this using regex?
Order of whole words i can handle right now using searching words one by one, but not sure how to treat consecutive single chars as single word
e.g. "mg road" or "road mg" i can handle by searching "mg" and "road" one by one
EDIT
For making requirement more clear, below is my test case
#Test
public void testRemoveSpaceFromConsecutiveSingleCharacters() throws Exception {
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("some a b c d limited").equals("some abcd limited"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("m g road").equals("mg road"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c").equals("bank abc"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c limited n a").equals("bank abc limited na"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("c road").equals("c road"));
}

1.) Strip out spaces within space-surrounded single letters from stringtocheck and userinput.
.replaceAll("(?<=\\b\\w) +(?=\\w\\b)","")
(?<=\b\w) look behind to check if preceded by \b word boundary, \w word character
(?=\\w\\b) look ahead to check if followed by \w word character, \b word boundary
See demo at regexplanet (click Java)
2.) Check if stringtocheck .contains userinput.

Sounds like you simply want to ignore white space. You can easily can do this by stripping out white space from both the target string and the user input before looking for a match.

You're basically wanting each search term to be modified to allow intervening spaces, so
"abcd" becomes regex "\ba ?b ?c ?d\b"
To achieve this, do this to each word before matching:
word = "\\b" + word.replaceAll("(?<=.)(?=.)", " ?") + "\\b";
The word breaks \b are necessary to stop matching "comma bcd" or "abc duck".

This regex will match all single characters separated by one or more spaces
(^(\w\s+)+)|(\s+\w)+$|((\s+\w)+\s+)

The following regex (in multiline mode) could help you out:
^(?<first>\w+)(?<chars>(?:.(?!(?:\b\w{2,}\b)))*)
# assure that it is the beginning of the line
# capture as many word characters as possible in the first group "first"
# the construction afterwards consumes everything up to (not including)
# a word which has at least two characters...
# ... and saves it to the group called "chars"
You would only need to replace the whitespaces in the second group (aka "chars").
See a demo on regex101.com

str = str.replaceAll("\\s","");

Related

Regexp to fit all string NOT ending with a LIST of known suffixes (not characters, but words)

I need to be able to build a regexp capturing all possible patterns, except for the strings ending with b or i or f or dt.
My string always starts with words and has an underscore before the closing suffix.
If I didn't have the dt in the blacklist of suffixes, I would probably do something like the following:
\w+_[^f|b|i]+ OR maybe (.*)_[^f|b|i]
But the [^x|y|z] format only captures single characters, and I wasn't able to combine it with a sequence of characters.
Any help would be appreciated,
Thanks.
If what you want to match always starts with word characters and contains an underscore before the closing suffix you might match one or more word characters \w+, match an underscore and then match one or more word charcters \w+
Then use a negative lookbehind to assert that what is on the left side is not b, f, i or dt and end with a word boundary \b to make sure the suffix is not part of a larger word.
\w+_\w+(?<![bfi]|dt)\b
Details
\w+_\w+ Match one or more word characters, an _ and again one or more word characters
(?<! Negative lookbehind
[bfi] character class which match b, f or i
| Or
dt Match literally
) Close negative lookbehind
\b Word boundary
Demo Java
Note that .*_[^f|b|i] with matches() does not mean match if does not end with, it means match if it ends with a char other than the one(s) defined in the character set. However, in this case, it seems to make no difference. The only trouble is that | is treated as a pipe char in the character class, and dt will be treated as 2 separate chars if you place it inside the character class.
You have at least 2 options (there can be more): use a regex that matches any string that does not end with a _ followed with b, i, f or dt or match these letters/combinations of letters with the underscore at the end of the string and negate the result.
Approach 1:
List<String> strs = Arrays.asList("aaaa_b", "zzzzzz_i", "---------_f", "TTTTT_dt", "..._.");
for (String str : strs)
System.out.println("\"" + str + "\": " + str.matches(".*(?<!_[bif]|_dt)"));
Output:
"aaaa_b": false
"zzzzzz_i": false
"---------_f": false
"TTTTT_dt": false
"..._.": true
NOTE: To make it case insensitve, you may prepend the pattern with (?i), "(?i).*(?<!_[bif]|_dt)". Also, the . does not match line breaks by default, you may want to let it match them with (?s), "(?si).*(?<!_[bif]|_dt)".
Approach 2:
List<String> strs = Arrays.asList("aaaa_b", "zzzzzz_i", "---------_f", "TTTTT_dt", "..._.");
Pattern p = Pattern.compile("_(?:[bif]|dt)\\z");
for (String str : strs) {
System.out.println("\"" + str + "\": " + !p.matcher(str).find());
}
Output is the same. Same case insensitivity note applies.

Regex to remove initials from full name

I have names like "D John Livingston" , "S. Jennifer Adstan" and I want only the initials to be removed from the names , "D" in the first name and "S." in the second name. How can i do it using java regex?
The following code snippet seems to be working well:
String input = "John O'Connel";
input = input.replaceAll("\\b[A-Z]+(?:\\.|\\s+|$)", "").trim();
System.out.println(input);
John O'Connel
Your question is chock full of edge cases, since an initial could be, for example, more than one letter, and could appear at the start, middle, or end of the name. I replaced using the pattern \s*[A-Z]+(?:\.|\b), which seems to at least cover your examples. Also, I make a call to String#trim() for some whitespace cleanup for initials at the very beginning or end.
Demo
For this I would consider using String replaceAll().
So how do we design the regex?
Basically there are three cases you need to consider:
A. a single letter at the beginning of the name (optional period), followed by one
space
B. a single letter at the end of the name (optional period), preceded by one
space
C. a single letter in the middle of the name (optional period), surrounded by
two spaces
For the first two cases, you need to leave no spaces. So you would match one space and replace it with zero spaces.
For the last case, you need to leave one space. However, rather than handling this case explicitly, you may treat it as either A or B, since those will replace only one of the two spaces, leaving you with the desired number of spaces: 1.
So how do we combine case A and case B together? Using the symbol of |.
To prevent grabbing a single letter from a larger chain of letters, you can use the word border marker \b on the side which is not demarcated by a space character. (Normally for cases A and B, I would have used ^ and $ to explicitly match begin and end of string for this purpose. However, since we also need to handle case C in the middle of the string, we should use word border marker instead. )
And how do we represent the optional period? Since the period is a special character it must be escaped: \. Then it is marked as optional with question mark: \.? However, there's still the problem that the A. in the middle of a name might be matched as just A since period also counts as a word border. To prevent this, we add a possessive quantifier to the optional period \\.?+.
Putting all of this together, our regex would be: (\b[A-Z]\.?+ )|( [A-Z]\.?+\b)
However, in the final Java string, the backslash must be escaped, so in the final Java string, each \ will appear as \\
Example code:
String pattern = "(\\b[A-Z]\\.?+ )|( [A-Z]\\.?+\\b)";
String input1 = "MC Hammer I Smash U";
String input2 = "S. Jennifer A. Adstan JR.";
System.out.println(input1.replaceAll(pattern, ""));
System.out.println(input2.replaceAll(pattern, ""));
Output:
MC Hammer Smash
Jennifer Adstan JR.

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match
You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail
Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

Replace white spaces only in part of the string

I have a String like
"This is apple tree"
I want to remove the white spaces available until the word apple.After the change it will be like
"Thisisapple tree"
I need to achieve this in single replace command combined with regular expressions.
For now it looks like you may be looking for
String s = "This is apple tree";
System.out.println(s.replaceAll("\\G(\\S+)(?<!(?<!\\S)apple)\\s", "$1"));
Output: Thisisapple tree.
Explanation:
\G represents either end of previous match or start of input (^) if there was no previous match yet (when we are attempting to find first match)
\S+ represents one or more non-whitespace characters (to match words, including non-alphabetic characters like ' or punctuation)
(?<!(?<!\\S)apple)\\s negative-look-behind will prevent accepting whitespace which has apple before it (I added another negative-look-behind before apple to make sure that it doesn't have any non-whitespace which ensures that this is not part of some other word)
$1 in replacement represents match from group 1 (the one from (\S+)) which represents word. So we are replacing word and spaces with only word (effectively removing spaces)
WARNING: This solution assumes that
sentence doesn't start with space,
words can be separated with only one space.
If we want to get rid of this assumptions we would need something like:
System.out.println(s.replaceAll("^\\s+|\\G(\\S+)(?<!(?<!\\S)apple)\\s+", "$1"));
^\s+ will allow us to match spaces at beginning of string (and replace them with content of group 1 (word) which in this case will be empty, so we will simply remove these whitespaces)
\s+ at the end allows us to match word and one or more spaces after it (to remove them)
A single replace() is unlikely to solve your problem. You could do something like this..
String s[] = "This is an apple tree, not an orange tree".split("apple");
System.out.println(new StringBuilder(s[0].replace(" ","")).append("apple").append(s[1]));
This is achived via lookahead assertion, like this:
String str = "This is an apple tree";
System.out.println(str.replaceAll(" (?=.*apple)", ""));
It means: replace all spaces in front of which there anywhere word apple
If you want to use a regular expression you could try:
Matcher matcher = Pattern.compile("^(.*?\\bapple\\b)(.*)$").matcher("This is an apple but this apple is an orange");
System.out.println((!matcher.matches()) ? "No match" : matcher.group(1).replaceAll(" ", "") + matcher.group(2));
This checks that "apple" is an individual word and not just part of another word such as "snapple". It also splits at the first use of "apple".

Java literate text word parsing regexp

Firstly I was happy with [A-Za-z]+
Now I need to parse words that end with the letter "s", but i should skip words that have 2 or more first letters in upper-case.
I try something like [\n\\ ][A-Za-z]{0,1}[a-z]*s[ \\.\\,\\?\\!\\:]+ but the first part of it [\n\\ ] for some reason doesn't see the beginning of the line.
here is the example
the text is Denis goeS to school every day!
but the only parsed word is goeS
Any Ideas?
What about
\b[A-Z]?[a-z]*x\b
the \b is a word boundary, I assume that what you wanted. the ? is the shorter form of {0,1}
Try this:
Pattern p = Pattern.compile("\\b([A-Z]?[a-z]*[sS])\\b");
Matcher m = p.matcher("Denis goeS to school every day!");
while(m.find())
{
System.out.println( m.group(1) );
}
The regex matches every word that starts with anything but a whitespace or 2 upper case characters, only contains lower case characters in the middle and ends on either s or S.
In your example this would match Denis and goeS. If you want to only match upper case S change the expression to "\\b([A-Z]?[a-z]*[S])\\b" which woudl match goeS and GoeS but not GOeS, gOeSor goES.

Categories

Resources