I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}
Related
I want to split a string after a certain length.
Let's say we have a string of "message"
123456789
Split like this :
"12" "34" "567" "89"
I thought of splitting them into 2 first using
"(?<=\\G.{2})"
Regexp and then join the last two and again split into 3 but is there any way to do it on a single go using RegExp. Please help me out
Use ^(.{2})(.{2})(.{3})(.{2}).* (See it in action in regex101) to group the String to the specified length and grab the groups as separate Strings
String input = "123456789";
List<String> output = new ArrayList<>();
Pattern pattern = Pattern.compile("^(.{2})(.{2})(.{3})(.{2}).*");
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
output.add(matcher.group(i));
}
}
System.out.println(output);
NOTE: Group capturing starts from 1 as the group 0 matches the whole String
And a Magnificent Sorcery from #YCF_L from comment
String pattern = "^(.{2})(.{2})(.{3})(.{2}).*";
String[] vals = "123456789".replaceAll(pattern, "$1-$2-$3-$4").split("-");
Whats the magic here is you can replace the captured group by replaceAll() method. Use $n (where n is a digit) to refer to captured subsequences. See this stackoverflow question for better explanation.
NOTE: here its assumed that no input string contains - in it.
if so, then find any other character that will not be in any of
your input strings so that it can be used as a delimiter.
test this regex in regex101 with 123456789 test string.
^(\d{2})(\d{2})(\d{3})(\d{2})$
output :
Match 1
Full match 0-9 `123456789`
Group 1. 0-2 `12`
Group 2. 2-4 `34`
Group 3. 4-7 `567`
Group 4. 7-9 `89`
public static int getWordCount(String sentence) {
return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
+ sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}
My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.
The word should take hyphens or underscores in between
Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.
The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?
Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.
Based on your specifications, you seem to look for the following regex:
[\w-]+
Next you can use this approach to count the number of matches:
public static int getWordCount(String sentence) {
Pattern pattern = Pattern.compile("[\\w-]+");
Matcher matcher = pattern.matcher(sentence);
int count = 0;
while (matcher.find())
count++;
return count;
}
online jDoodle demo.
This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.
If you don't want words to start or end with hyphens, you can use the following regex:
\w+([-]\w+)*
This part ([-][_])* is wrong. The notation [xyz] means "any single one of the characters inside the brackets" (see http://www.regular-expressions.info/charclass.html). So effectively, you allow exactly the character - and exactly the character _, in that order.
Fixing your group makes it work:
[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*
and it can be further simplified using \w to
\w+(-\w+)*
because \w matches 0..9, A..Z, a..z and _ (http://www.regular-expressions.info/shorthand.html) and so you only need to add -.
if you can use java 8:
long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();
With java 8
public static int getColumnCount(String row) {
return (int) Pattern.compile("[\\w-]+")
.matcher(row)
.results()
.count();
}
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I can't figure out how to solve this:
given two strings, one representing a pattern, one a random string, determine whether it pattern matches with the first string
ex:
string1: "aaba"
string2: "catcatdogcat"
thus, string1 and string2 are pattern matched
versus if string2 were "catcatcatcat" this would not be pattern matched.
Do this for any pattern and string.
I know it's recursion but I'm pretty stuck on... how to solve this
Ok, I'm gonna try to explain a recursion for this,sounds right but I don't have a chance to test it ( not at home ).
Take a vector v['size of alphabet'], where v[i] = how many letters from string2 = letter i from string 1.
In you case in the end it is : v['a'] = 3, v[b] =3;
You initialise the vector with 1.
For the rec function:
You take the first letter from string1 : a;
Represent for a from string2 is the string that starts at string2 and ends at string2+v['a']; which is 'c';
You check if this is a valid solution untill now, and it is.
Then you go into rec( string1 + 1 ) , letter a again,
since v['a'] still = 1 then you take the second a as = 'a'.
You check if this is a valid soulution, and it is not because you have already defined the first a as 'c'.
You go back in the recursion and increment v['a'], start from the begging.
You take the first letter of string1 : a;
Represent from string2 which is 'ca' , ( now v['a'] = 2 )
check if valid.
rec ( string1 +1 );
and so on...
at a point you will reach v['a'] = 3 and v['b'] = 3;
then with the rec function you will find the solution.
I for one find it easier to implement in a interative function but you said something about recursion so yeah.
Take the number of unique letters. Then you want to iterate through all combinations of possible lengths for each letter using the following constraints:
sum(length of letter * occurances of letter) has to be the length of string2
Each length must be at least 1
That is, for 2 unique letters, and a string length of 4, the possible lengths are:
(1, 3) and (2, 2)
From here it's simple. For each unique letter you can find out the string that letter must represent for the given string, as you know the length of each letter. Then it's a matter mapping each letter to the string it must represent, and if at any time a letter corresponds to a string that didn't match an earlier instance of it, then you have no match.
For your example:
string1: "aaba"
string2: "catcatdogcat"
Here, for the iteration where the lengths are (3, 3). Since we know a is of length 3, we know the first iteration of a must be "cat". Then the next a, corresponds to "cat" (still have a match). Then the next 3 have to correspond to b. This is the first b so it can match any 3 chars. Then match a at the end to cat again, and you're done.
If you want a,b,c to be unique as outlined in #MartijnCourteaux comment(and in your question now that I read again), then at the end you can just check your map for common values, if there are any common values then you have no match.
If you have a match at ANY iteration, then the string matches the pattern. There is only no match, if there is no match at ALL iterations.
This is quite easy to achieve:
Regex is the way to go. In Regex, there is something called a backreference. Backreferences need to match the very same string, the mentioned match group has already matched. i.e. the Regex ^([ab])\\1$ will match every String like aa or bb. The First group matches either a or b - but the backreference NEEDS to match the same thing, the matchgroup (in this case "1") matched.
So, all you need to do is: Convert your String-based pattern to a Regex pattern.
Example:
String regex = "^([a-z]+)\\1([a-z]+)\\1$";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher("catcatdogcat");
if (m.matches()){
System.out.println("matches!");
System.out.println(m.group(0));
System.out.println(m.group(1));
System.out.println(m.group(2));
}else{
System.out.println("no matches!");
}
produces:
matches!
catcatdogcat
cat
dog
this will EXACTLY match your given string "catcatdogcat", while match Group 1 beeing "cat" and match group 2 beeing "dog".
What you now need to do is:
Write a function, that checks your string pattern aaba char by char.
First occurence of a letter: replace it with ([a-z]+) and note the number of that matchgroup (Array, Hashmap, ...)
Any further occurence of the letter: replace it with \\1 (if the recorded number for the letter was 1)
Wrap the result with ^ and $.
Finally, your String aaba will be converted to ^([a-z]+)\\1([a-z]+)\\1$ and serve your needs. The pattern abccba will become the regex ^([a-z]+)([a-z]+)([a-z]+)\\3\\2\\1$
Finally use the matcher to check your given string.
This example assumes only lowercase characters, but you can extend it.
however it is imporant to keep the "+", cause the "*" would allow zero length-matches, which will make your regex match about ALL THE TIME.
Second example mentioned:
import java.util.regex.*;
public class HelloWorld {
public static void main(String[] args) {
String regex = "^([a-z]+)([a-z]+)([a-z]+)\\3\\2\\1$";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher("catdogcowcowdogcat");
if (m.matches()){
System.out.println("matches!");
System.out.println(m.group(0));
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}else{
System.out.println("no matches!");
}
}
}
produces:
matches!
catdogcowcowdogcat
cat
dog
cow
edit: if needed (even if it does not 100% match your requirements - see comments):
public static String convertToRegex(String pattern){
String regex = "";
Map<Character, Integer> refs = new HashMap<Character, Integer>();
Integer i=1;
for (Character c : pattern.toCharArray()){
if (refs.containsKey(c)){
//known.
regex += "\\" + refs.get(c);
}else{
//unknown
regex += "([a-z]+)";
refs.put(c, i++);
}
}
return "^" + regex + "$";
}
I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.
Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)
Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)
String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}
Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}