How can I find repeated characters with a regex in Java? - java

Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)

Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)

String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}

Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}

Related

.split() and [\\W] creates an additional empty string?

I'm creating a small program to split a string into tokens (consecutive English alphabet characters, then outputting the number of tokens as well as the actual tokens. The problem is an extra empty string element is created wherever there is a comma followed by a space.
I've researched into regular expressions and understand that \W is anything that is not a word character.
String str = sc.nextLine();
// creating an array of tokens
String tokens[] = str.split("[\\W]");
int len = tokens.length;
System.out.println(len);
for (int i = 0; i < len; i++) {
System.out.println(tokens[i]);
}
Input:
Hello, World.
Expected output:
2
Hello
World
Actual output:
3
Hello
World
Note: this is my first stack overflow post, if I've done anything wrong please let me know, thanks
Try str.split("\\W+")
It means 1 or more non-word character
\W matches only 1 character. So it breaks at , and then breaks again at the space
That’s why it gives you back an extra empty string.
\W+ will match on ‘, ‘ as one, so it will break only once, so you will get back only the tokens. (It works on multiple tokens not just two. So ‘hello, world, again’ will give you [hello,world,again].
If you use .split("\\W") you will get empty items if:
non-word char(s) appear(s) at the start of the string
non-word chars appear in succession, one after another as \W matches 1 non-word char, breaks the string, and then the next non-word char breaks it again, producing empty strings.
There are two ways out.
Either remove all non-word chars at the start and then split with \W+:
String tokens[] = str.replaceFirst("^\\W+", "").split("\\W+");
Or, match the chunks of word chars with \w+ pattern:
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(" abc=-=123");
List<String> tokens = new ArrayList<>();
while(m.find()) {
tokens.add(m.group());
}
System.out.println(tokens);
See the online demo.
Try this
Scanner inputter = new Scanner(System.in);
System.out.print("Please enter your thoughts : ");
final String words = inputter.nextLine();
final String[] tokens = words.split("\\W+");
Arrays.stream(tokens).forEach(System.out::println);

Regular expression for phrase contain literals and numbers but is not all phrase as a number only with fixed range length

i want to have regular expression to check input character as a-z and 0-9 but i do not want to allow input as just numeric value at all ( must be have at least one alphabetic character)
for example :
413123123123131
not allowed but if have just only one alphabetic character in any place of phrase it's ok
i trying to define correct Regex for that and at final i raised to
[0-9]*[a-z].*
but in now i confused how to defined {x,y} length of phrase i want to have {9,31} but after last * i can not to have length block too i trying to define group but unlucky and not worked
tested at https://www.debuggex.com/
how can i to add it ??
What you seek is
String regex = "(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*";
Use it with String#matches() / Pattern#matches() method to require a full string match:
if (s.matches(regex)) {
return true;
}
Details
^ - implicit in matches() - matches the start of string
(?=.{9,31}$) - a positive lookahead that requires 9 to 31 any chars other than line break chars from the start to end of the string
\\p{Alnum}* - 0 or more alphanumeric chars
\\p{Alpha} - an ASCII letter
\\p{Alnum}* - 0 or more alphanumeric chars
Java demo:
String lines[] = {"413123123123131", "4131231231231a"};
Pattern p = Pattern.compile("(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*");
for(String line : lines)
{
Matcher m = p.matcher(line);
if(m.matches()) {
System.out.println(line + ": MATCH");
} else {
System.out.println(line + ": NO MATCH");
}
}
Output:
413123123123131: NO MATCH
4131231231231a: MATCH
This might be what you are looking for.
[0-9a-zA-Z]*[a-zA-Z][0-9a-zA-Z]*
To help explain it, think of the middle term as your one required character and the outer terms as any number of alpha numeric characters.
Edit: to restrict the length of the string as a whole you may have to check that manually after matching. ie.
if (str.length > 9 && str.length < 31)
Wiktor does provide a solution that involves more regex, please look at his for a better regex pattern
Try this Regex:
^(?:(?=[a-z])[a-z0-9]{9,31}|(?=\d.*[a-z])[a-z0-9]{9,31})$
OR a bit shorter form:
^(?:(?=[a-z])|(?=\d.*[a-z]))[a-z0-9]{9,31}$
Demo
Explanation(for the 1st regex):
^ - position before the start of the string
(?=[a-z])[a-z0-9]{9,31} means If the string starts with a letter, then match Letters and digits. minimum 9 and maximum 31
| - OR
(?=\d.*[a-z])[a-z0-9]{9,31} means If the string starts with a digit followed by a letter somewhere in the string, then match letters and digits. Minimum 9 and Maximum 31. This also ensures that If the string starts with a digit and if there is no letter anywhere in the string, there won't be any match
$ - position after the last literal of the string
OUTPUT:
413123123123131 NO MATCH(no alphabets)
kjkhsjkf989089054835werewrew65 MATCH
kdfgfd4374985794379857984379857weorjijuiower NO MATCH(length more than 31)
9087erkjfg9080980984590p465467 MATCH
4131231231231a MATCH
kjdfg34 NO MATCH(Length less than 9)
Here's the regex:
[a-zA-Z\d]*[a-zA-Z][a-zA-Z\d]*
The trick here is to have something that is not optional. The leading and trailing [a-zA-Z\d] has a * quantifier, so they are optional. But the [a-zA-Z] in the middle there is not optional. The string must have a character that matches [a-zA-Z] in order to be matched.
However, you need to check the length of the string with length afterwards and not with regex. I can't think of any way how you can do this in regex.
Actually, I think you can do this regexless pretty easily:
private static boolean matches(String input) {
for (int i = 0 ; i < input.length() ; i++) {
if (Character.isLetter(input.charAt(i))) {
return input.length() >= 9 && input.length() <= 31;
}
}
return false;
}

Replacing consecutive repeated characters in java

I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}

Why are my character and word counts off?

Given the following string:
String text = "The woods are\nlovely,\t\tdark and deep.";
I want all whitespace treated as a single character. So for instance, the \n is 1 char. The \t\t should also be 1 char. With that logic, I count 36 characters and 7 words. But when I run this through the following code:
String text = "The woods are\nlovely,\t\tdark and deep.";
int numNewCharacters = 0;
for(int i=0; i < text.length(); i++)
if(!Character.isWhitespace(text.charAt(i)))
numNewCharacters++;
int numNewWords = text.split("\\s").length;
// Prints "30"
System.out.println("Chars:" + numNewCharacters);
// Prints "8"
System.out.println("Words:" + numNewWords);
It's telling me that there are 30 characters and 8 words. Any ideas as to why? Thanks in advance.
You are matching on individual whitespaces. Instead you could match on one or more:
text.split("\\s+")
You are counting only non white space characters in the first loop - so not counting space etc at all. Then 30 is the right answer. As for the second - I suspect split is treating consecutive white spaces as distinct, so there is a "null" word between the two tabs.
Reimueus has already solved your word count problem:
text.split("\\s+")
And your character count is corret. Newlines \n and tabs \t are considered whitespace. If you don't want them to be, you can implement your own isWhitespace function.
Here is the complete solution to counting words and characters:
System.out.println("Characters: " + text.replaceAll("\\s+", " ").length());
Matcher m = Pattern.compile("[^\\s]+", Pattern.MULTILINE).matcher(text);
int wordCount = 0;
while (m.find()) {
wordCount ++;
}
System.out.println("Words: "+ wordCount);
Character count is accomplished by replacing all whitespaces groups to a single space and just taking the resulting string's length;
For word count we create a pattern that will match any char group which does not contain a whitespace. You could use \\w+ pattern here, but it will match only alphanumeric characters and underscore. Note also Pattern.MULTILINE parameter.

Java RegEx - for an Integer not containing a "."

I need to be able to return signed and unsigned integer constants with no
intervening symbols, possibly preceded by + or -. The only allowed digits are 3, 4, and 5.
I can't figure out a way to say that the expression must not contain a period before or after the integer.
This is what I have so far, but if I pass say "34.5 - 43" the string returned will be: "34 5 43".
All that needs to be returned is "43".
public String getInts(String toBeScanned){
String INT = "";
Pattern p = Pattern.compile("\\b[+-]?[3-5]+\\b");
Matcher m = p.matcher(toBeScanned);
if (m.matches() == true){
INT = toBeScanned;
}
else{
m = p.matcher(" " + toBeScanned);
while (m.find()){
INT = INT + m.group() + " ";
}
}
return INT;
}
Any thoughts or pushes in the right direction are appreciated. Is there a way to say it that the first and last character can be [\b and not .]
This is frustrating the heck out of me. Help!
You don't want a word boundary \b here. I think the best is to create your own assertion, try this
(?<![.\d])[+-]?[3-5]+(?![.\d])
See it here on Regexr
(?<![.\d]) is a negative lookbehind assertion, it says before the pattern is no dot and no digit allowed.
(?![.\d]) is a negative lookahead assertion, it says after the pattern is no dot and no digit allowed.
Improvement
to avoid that it matches stuff like "hf34" we can make it more strict
(?<![.\w])[+-]?[3-5]+(?![.\w])
See it on Regexr
The word boundary \b
\b matches on a change from a word character to a non word character. A word character is a letter or a digit or a _. That means you will also get problems with your \b before the [+-], because there is no \b between a space/start of the string and a [+-].
"\b[+-]?[3-5]+[.][3-5]+\b"
This pattern says that in order to match, there must be at least one number before, and one number after the decimal point.
Is there a way to say it that the first and last character can be [\b and not .]
[^\.\b]
matches \b but not '.'
Is that what you are looking for?
[^\.\b][+-]?[3-5]+[^\.\b]
Will match '43' but not '34.5'

Categories

Resources