For a given word, I want to search for all the substrings that appear next to each other at least 3 times, and replace all of them by only one. I know how to do this when the substring is only one character. For instance, the code below returns "Bah" for the input string "Bahhhhhhh":
String term = "Bahhhhhhh";
term = term.replaceAll("(.)\\1{2,}", "$1");
However, I need a more generic pattern that converts "Bahahahaha" into "Baha".
String[] terms = { "Bahhhhhhh", "Bahahahaha" };
for (String term : terms) {
System.out.println(term.replaceAll("(.+?)\\1{2,}", "$1"));
}
Output:
Bah
Baha
This will work for repetitions of 1, 2, or 3 characters long.
String term = "Bahhhhhhh";
term = term.replaceAll("(.{1,3})\\1{2,}", "$1");
You'll want to be careful to avoid huge backtracking performance hits. That's why I limited it to 1-3 characters.
Related
I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}
So if I have 22332, I want to replace that for BEA, as in mobile keypad.I want to see how many times a digit appear so that I can count A--2,B--22,C--222,D--3,E--33,F--333, etc(and a 0 is pause).I want to write a decoder that takes in digit string and replaces digit occurrences with letters.example : 44335557075557777 will be decoded as HELP PLS.
This is the key portion of the code:
public void printMessages() throws Exception {
File msgFile = new File("messages.txt");
Scanner input = new Scanner(msgFile);
while(input.hasNext()) {
String x = input.next();
String y = input.nextLine();
System.out.println(x+":"+y);
}
It takes the input from a file as digit String.Then Scanner prints the digit.I tried to split the string digits and then I don't know how to evaluate the appearance of the mentioned kind in the question.
for(String x : b.split(""))
System.out.print(x);
gives: 44335557075557777(input from the file).
I don't know how can I call each repetitive index and see how they formulate such pattern as in mobile keypad.If I use for loop then I have to cycle through whole string and use lots of if statements. There must be some other way.
Another suggestion of making use of regex in breaking the encoded string.
By making use of look-around + back-reference makes it easy to split the string at positions that preceding and following characters are different.
e.g.
String line = "44335557075557777";
String[] tokens = line.split("(?<=(.))(?!\\1)");
// tokens will contain ["44", "33", "555", "7", "0", "7", "555", "7777"]
Then it should be trivial for you to map each string to its corresponding character, either by a Map or even naively by bunch of if-elses
Edit: Some background on the regex
(?<=(.))(?!\1)
(?<= ) : Look behind group, which means finding
something (a zero-length patternin this example)
preceded by this group of pattern
( ) : capture group #1
. : any char
: zero-length pattern between look behind and look
ahead group
(?! ) : Negative look ahead group, which means finding
a pattern (zero-length in this example) NOT followed
by this group of pattern
\1 : back-reference, whatever matched by
capture group #1
So it means, find any zero-length positions, for which the character before and after such position is different, and use such positions to do splitting.
I need to extract the desired string which attached to the word.
For example
pot-1_Sam
pot-22_Daniel
pot_444_Jack
pot_5434_Bill
I need to get the names from the above strings. i.e Sam, Daniel, Jack and Bill.
Thing is if I use substring the position keeps on changing due to the length of the number. How to achieve them using REGEX.
Update:
Some strings has 2 underscore options like
pot_US-1_Sam
pot_RUS_444_Jack
Assuming you have a standard set of above formats, It seems you need not to have any regex, you can try using lastIndexOf and substring methods.
String result = yourString.substring(yourString.lastIndexOf("_")+1, yourString.length());
Your answer is:
String[] s = new String[4];
s[0] = "pot-1_Sam";
s[1] = "pot-22_Daniel";
s[2] = "pot_444_Jack";
s[3] = "pot_5434_Bill";
ArrayList<String> result = new ArrayList<String>();
for (String value : s) {
String[] splitedArray = value.split("_");
result.add(splitedArray[splitedArray.length-1]);
}
for(String resultingValue : result){
System.out.println(resultingValue);
}
You have 2 options:
Keep using the indexOf method to get the index of the last _ (This assumes that there is no _ in the names you are after). Once that you have the last index of the _ character, you can use the substring method to get the bit you are after.
Use a regular expression. The strings you have shown essentially have the pattern where in you have numbers, followed by an underscore which is in turn followed by the word you are after. You can use a regular expression such as \\d+_ (which will match one or more digits followed by an underscore) in combination with the split method. The string you are after will be in the last array position.
Use a string tokenizer based on '_' and get the last element. No need for REGEX.
Or use the split method on the string object like so :
String[] strArray = strValue.split("_");
String lastToken = strArray[strArray.length -1];
String[] s = {
"pot-1_Sam",
"pot-22_Daniel",
"pot_444_Jack",
"pot_5434_Bill"
};
for (String e : s)
System.out.println(e.replaceAll(".*_", ""));
String s = "10.226.18.158:10.226.17.183:ABCD :AAAA"
My requirement is to split the string at up to 3rd : or up to 2nd :. i.e.
Something like String sa[] = s.split(), but with the regex splitting only up to 3rd or 2nd.
s[0] = "10.226.18.158"
s[1] = "10.226.17.183"
s[2] = "ABCD :AAAA"
According to the String#split() javadoc you can add a number to limit the number of splits.
s.split(":", 3);
Edit: as melwil metions This will return an array of up to the number passed in long.
So in your example of splitting up to 2nd : you would need to pass in 3.
s.split(":",3) returns the output
sa[0] = "10.226.18.158"
sa[1] = "10.226.17.183"
sa[2] = "ABCD :AAAA"
Relevent section quoted from the java doc about how the second argument (limit) works.
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
You can split your string basing on one non-whitespece character, \S{1}, followed by a colon, ::
String sa[] = s.split("\\S{1}:");
I have a String that I have to parse for different keywords.
For example, I have the String:
"I will come and meet you at the 123woods"
And my keywords are
'123woods'
'woods'
I should report whenever I have a match and where. Multiple occurrences should also be accounted for.
However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.
My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?
The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.
String text = "I will come and meet you at the woods 123woods and all the woods";
List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");
String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.
Use regex + word boundaries as others answered.
"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");
will be true.
"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");
will be false.
Hope this works for you:
String string = "I will come and meet you at the 123woods";
String keyword = "123woods";
Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
System.out.println("Keyword matched the string");
}
http://codigounico.blogspot.com/
How about something like Arrays.asList(String.split(" ")).contains("xx")?
See String.split() and How can I test if an array contains a certain value.
Got a way to match Exact word from String in Android:
String full = "Hello World. How are you ?";
String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";
boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);
Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);
Result: false-true-true-false
Function for match word:
private boolean isContainExactWord(String fullString, String partWord){
String pattern = "\\b"+partWord+"\\b";
Pattern p=Pattern.compile(pattern);
Matcher m=p.matcher(fullString);
return m.find();
}
Done
public class FindTextInLine {
String match = "123woods";
String text = "I will come and meet you at the 123woods";
public void findText () {
if (text.contains(match)) {
System.out.println("Keyword matched the string" );
}
}
}
Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.
The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:
This is a classical application for multi-pattern-search-algorithms.
Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).
Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.
Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.
// example with StringSearchAlgorithms
AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));
CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);
StringFinder finder = stringSearch.createFinder(text);
List<StringMatch> all = finder.findAll();
A much simpler way to do this is to use split():
String match = "123woods";
String text = "I will come and meet you at the 123woods";
String[] sentence = text.split();
for(String word: sentence)
{
if(word.equals(match))
return true;
}
return false;
This is a simpler, less elegant way to do the same thing without using tokens, etc.
You can use regular expressions.
Use Matcher and Pattern methods to get the desired output
You can also use regex matching with the \b flag (whole word boundary).
To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion.
One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.
\b(?>123woods|woods)\b
it searches 123woods as primary search, once it got matched it exits the search.
Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)
It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:
// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
+ "say, at the woods of 123woods.";
// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings =
java.util.Arrays.asList(sentence.split(" |,|\\."));
// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");
By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:
// Set... ready...?
int counter = 0;
// Go!
for(String s : strings)
{
// Asking if the sentence exists in the keywords, not the other
// around, to find repeated keywords in the sentence.
Boolean found = keywords.contains(s.toLowerCase());
if(found)
{
counter ++;
System.out.println("Found: " + s);
}
}
// Statistics:
if (counter > 0)
{
System.out.println("In sentence: " + sentence + "\n"
+ "Count: " + counter);
}
And the results are:
Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5
If you want to identify a whole word in a string and change the content of that word you can do this way. Your final string stays equals, except the word you treated. In this case "not" stays "'not'" in final string.
StringBuilder sb = new StringBuilder();
String[] splited = value.split("\\s+");
if(ArrayUtils.isNotEmpty(splited)) {
for(String valor : splited) {
sb.append(" ");
if("not".equals(valor.toLowerCase())) {
sb.append("'").append(valor).append("'");
} else {
sb.append(valor);
}
}
}
return sb.toString();