Writing a regex to detect repeat-characters [duplicate] - java

This question already has answers here:
Regex to match the longest repeating substring
(5 answers)
Closed 9 years ago.
I need to write a regex, that would identify a word that have a repeating character set at the end. According to the following code fragment, the repeating character set is An. I need to write a regex so this will be spotted and displayed.
According to the following code, \\w will match any word character (including digit, letter, or special character). But i only want to identify english characters.
String stringToMatch = "IranAnAn";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Word contains duplicate characters " + m.group(1));
}
UPDATE
Word contains duplicate characters a
Word contains duplicate characters a
Word contains duplicate characters An

You want to catch as many characters in your set as possible, so instead of (\\w) you should use (\\w+) and you want the sequence to be at the end, so you need to add $ (and I have removed the + after \\1 which is not useful to detect repetition: only one repetition is needed):
Pattern p = Pattern.compile("(\\w+)\\1$");
Your program then outputs An as expected.
Finally, if you only want to capture ascii characters, you can use [a-zA-Z] instead of \\w:
Pattern p = Pattern.compile("([a-zA-Z]+)\\1$");
And if you want the character set to be at least 2 characters:
Pattern p = Pattern.compile("([a-zA-Z]{2,})\\1$");

If by "only English characters" you mean A-Z and a-z, the follow regex will work:
".*([A-Za-z]{2,})\\1$"

Related

What regex can match similar characters? [duplicate]

This question already has answers here:
Converting Symbols, Accent Letters to English Alphabet
(12 answers)
Closed 3 years ago.
What regex could match similar characters, like (ä and a) or in Russian (и and й)?
Below my code...
Sting text1 = " Passagiere noch auf ihr fehlendes Gepäck"
Sting text2 = " Passagiere noch auf ihr fehlendes Gepack"
Pattern p1 = Pattern.compile("\\b" + "Gepack");
Pattern p2 = Pattern.compile("\\b" + "Gepack");
Matcher m1 = p1.matcher(text1); // doesn't find any occurrence
Matcher m2 = p2.matcher(text2) // founds one occurrence
You could build up a character class of all the characters you want to match so you could replace pattern one with
Pattern p1 = Pattern.compile("\\b" + "Gep[aä]ck");
But this could get very burdensome very quickly
There is a mechanism in Unicode called Normalisation, see here for details, that lets you reformat your string to compare in different ways.
Normalisation Form Canonical Decomposition (NFD) takes a string containing accented character code points and creates multiple code points, starting with the base character and then with code points cosponsoring to combining character versions of the accents in a well defined order for each accented character.
Having done this to your input you can use a regex to remove all the accents from the string as they will all have the Unicode property Mark, sometimes shortened to M.
This gives you a string containing only base characters that your regex will match against.

Matching a pattern in Java when a few starting characters and a few ending characters are known

I would like to find a pattern which is a string. I know the first few characters of that string. And I also know the set of characters or words the string is ending with. How do I find this pattern? My string is constituted of words and special characters.
My string starts with a special character and ends with a special character followed by any two characters, which are variables.
If you don't know what the special character are and want to find them out from the input you can do
String regex = start + "(.*?)" + end + "(.)(.)"
As #olivier-grégoire points out, this assumes start and end are sufficiently quoted e.g. use Pattern.quote(String) if you are not sure.
The two characters matched will be in group 2 and 3 when you use a Matcher

How to replace a specific occurrence of a sub-string in a string and ignoring incomplete matches in java? [duplicate]

This question already has answers here:
Search for a word in a String
(9 answers)
Closed 7 years ago.
In my application I'm giving dictionary word suggestions and replacing the selected word with the suggested word using .replaceAll(). However that is replacing every sub string in the entire string
for example in this String,
String sentence = "od and not odds as a sample sam. but not odinary";
If I suggest the first word as odd .replaceAll() will replace every occurrence of od with odd hence affecting the fourth word to oddds and changing the sentence to
sentence.replaceAll("od", "odd");
//sentence String becomes
sentence ="odd and not oddds as a sample sam. but not oddinary"
Replacing the od to odd has affected all the other words which have the od characters in them.
Can any one help me with a better aproach?
Use regex. For you example "\bod\b" will just match od as a whole word. \b is a word boundary, meaning either the start or the end of a word (whether it ends with a dot or a whitespace or anything else).
The replaceAll method can already take in a regex, but if you need more power you can look at the Matcher class.
String REPLACE_WORD = "od"
sentence.replaceAll("\\b" + REPLACE_WORD + "\\b", "odd");
will give you the correct answer. The \ tells java that you want to write \ instead of \b (it first parses the string, and than parses that string as regex).
As mentioned, you can use a Matcher from Java.util.regex.* which has a lot of useful functionality.
String text = "I detect quite an od odour.";
String searchTerm = "\\bod\\b";
Pattern pattern = Pattern.compile(searchTerm);
Matcher matcher = pattern.matcher(text);
text = matcher.replaceAll("odd");
System.out.println(text);
The output would be:
I detect quite an odd odour.
Use the regular expression in the replaceAll() method:
\bod\b
This will filter out occurrences of the od inside any other word.
Of course when you use it in Java method, you need to escape the \
So
replaceAll("\\bod\\b", "odd");
should do it.

finding repeated characters in a row (3 times or more) in a string

Here is the code for finding repeated character like A in AAbbbc
String stringToMatch = "abccdef";
Pattern p = Pattern.compile("((\\w)\\2+)+");
Matcher m = p.matcher(tweet);
while (m.find())
{
System.out.println("Duplicate character " + m.group(0));
}
Now the problem is that I want to find the characters that are repeated but 3 times or more in a row,
when I change 2 to 3 in the above code it does not work,
Can anyone help?
You shouldn't change 2 to 3 because it's the number of capture groups, not it's frequency.You can use two group references here :
"((\\w)\\2\\2)+"
But still your regex doesn't match strings like your example! Since it just match repeated characters.For that aim you can use following regex :
"((\\w)\\2+\\2)+.*"
You may use the repetation quantifier.
Pattern p = Pattern.compile("(\\w)\\1{2,}");
Matcher m = p.matcher(tweet);
while (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}
Now the duplicate character is captured by index 1 not index 0 which refers the whole match. Just change the number inside the repeatation quantifier to match the char which repeats n or more times like "(\\w)\\1{5,}" ..
That original regex is flawed. It only finds "word" characters (alpha, numeric, underscore). The requirement is "find characters that repeat 3 or more times in a row." The dot is the any-character metacharacter.
(?=(.)\1{3})(\1+)
So, that will find a character that occurs 4 or more consecutive times (i.e., meets your requirement of a character that "repeats" three or more times). If you really meant "occurs," change the 3 to 2. Anyway, it does a non-consuming "zero-length assertion" before capturing any data, so should be more efficient. It will only consume and capture data once you've found your minimum requirement (a single character that repeats at least 3 times). You can then consume it with the one-or-more '+' quantifier because you know it's a match you want; further quantification is redundant--your positive lookahead has already assured (asserted) that. Your results are in capture group 2 "(\1+)" and you can refer to it as \2.
Note: I tested that with perl command-line utility, so that's the raw regex. It looks like you may need to escape certain characters prior to using it in the programming language you're using.

How to write a regex that prevents partial matching [duplicate]

This question already has answers here:
Regex whitespace word boundary
(3 answers)
Closed 2 years ago.
How do I build a regex pattern that searches over a text T and tries to find a search string S.
There are 2 requirements:
S could be made of any character.
S could be anywhere in the string but can't be part of a word.
I know that in order to escape special regex characters I put the search string between \Q and \E as such:
\EMySearch_String\Q
How do I prevent finding partial matching of S in T?
You can do like this if
can't be part of a word
is interpreted as
preceded by start-of-string or space and followed by end-of-string or space:
String s = "3894$75\\/^()";
String text = "fdsfsd3894$75\\/^()dasdasd 22348 3894$75\\/^()";
Matcher m = Pattern.compile("(?<=^|\\s)\\Q" + s + "\\E(?=\\s|$)").matcher(text);
while (m.find()) {
System.out.println("Found match! :'" + m.group() + "'");
}
This prints only one
Found match! :'3894$75/^()'
I think what you're trying to find can be easily solved with lookaheads and lookbehinds. Take a look at this for a good explanation.
Then there's a bit of flip-flopping booleans, but you're looking ahead and behind for NOT Non-Space characters (\S). You don't want to look for space characters only because S might be at the start or end of the string. Like so:
(?<!\S)S(?!\S)

Categories

Resources