codingbat wordEnds using regex - java

I'm trying to solve wordEnds from codingbat.com using regex.
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"
This is the simplest as I can make it with my current knowledge of regex:
public String wordEnds(String str, String word) {
return str.replaceAll(
".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
.replace("word", java.util.regex.Pattern.quote(word)),
"$1$2"
);
}
replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.
The regex has two major parts:
If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
The initial "if" test ensures that the atomic lookbehind captures only if there's a word
Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
Groups 1 and 2 would capture empty strings
I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.
Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.
I managed to reduce the occurrence of word within the pattern to just one.
".+?(?<=(^|.)word)(?=(.?))|.+"
I'm still looking if it's possible to simplify this further, but I also have another question:
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?

Based on your solution I managed to simplify the code a little bit:
public String wordEnds(String str, String word) {
return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}
Another way of writing it would be:
public String wordEnds(String str, String word) {
return str.replaceAll(
String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
"$1$2");
}

With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
In Oracle's implementation, the behavior of look-behind is as follow:
By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.
Effectively, it will try to verify the look-behind on the shortest string first.
The implementation multiplies the matching complexity by O(k) factor.
This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.
You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.
In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.
My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.
I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.

I am working in .NET's regex but I was able to change your pattern to:
.+?(?<=(\w?)word)(?=(\w?))|.+
with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.
It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)

Another solution to look at...
public String wordEnds(String str, String word) {
if(str.equals(word)) return "";
int i = 0;
String result = "";
int stringLen = str.length();
int wordLen = word.length();
int diffLen = stringLen - wordLen;
while(i<=diffLen){
if(i==0 && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i+wordLen);
}else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1);
}else if(str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
}
i++;
}
if(result.length()==1) result = result + result;
return result;
}

Another possible solution:
public String wordEnds(String str, String word) {
String result = "";
if (str.contains(word)) {
for (int i = 0; i < str.length(); i++) {
if (str.startsWith(word, i)) {
if (i > 0) {
result += str.charAt(i - 1);
}
if ((i + word.length()) < str.length()) {
result += str.charAt(i + word.length());
}
}
}
}
return result;
}

Related

Regex pattern match performance in Java for long string

I have a regex that works great(500 nanoseconds) when a match is found, but takes a lot of time (over 3 secs) when there is no match. I suspect this could be because of backtracking. I tried some options, like converting .* to (.*)? based on some documentation, but it didn't help.
Input: a very long string - 5k chars in some cases.
Regex to match: .*substring1.*substring2.*
I am pre-compiling the pattern and re-using the matcher, what else can I try?
Here's my code snippet - I will be calling this method with millions of different input strings, but just a handful of regex patterns.
private static HashMap<String, Pattern> patternMap = new HashMap<String, Pattern>();
private static HashMap<String, Matcher> matcherMap = new HashMap<String, Matcher>();
Here's my method:
public static Boolean regex_match(String line, String regex) {
if (regex == null || line == null) {
return null;
}
if (!patternMap.containsKey(regex)) {
patternMap.put(regex, Pattern.compile(regex));
matcherMap.put(regex,patternMap.get(regex).matcher(""));
}
return matcherMap.get(regex).reset(line).find(0);
}
Your regex is subject to a problem known as catastrophic backtracking, as you hinted at. Essentially, the first .* will match the entire string, and then backtrack until substring1 matches. This will repeat with substring2. Because substring2 fails, the second .* will need to find another place where substring2 begins to match, and then it will fail again. Each time substring1 matches, we need to check every single place that substring2 might match.
You already are using pattern.find(), so you can omit the starting and ending .*. Then, changing the inner .* to a .*? could improve the performance by turning the greedy matcher into a lazy one.
This produces: substring1.*?substring2
You can verify that the pattern will match if you use indexOf():
int pos1 = str.indexOf("substring1");
int pos2 = str.indexOf("substring2", pos1);
if(pos1 != -1 && pos2 != -1){
// regex
}
When the regex doesn't match, you will get catastrophic backtracking. In fact, your pattern is likely doing a lot of backtracking even when there is a match. The .* will eat up the entire string, and then needs to go backwards, reluctantly giving back characters.
If your string looks like: substring1 substring2........50000 more characters......, then you will get better performance with the lazy .*?. Please note that (.*)? is NOT the same as .*?.
The performance of the regex will vary depending on what the substrings are, and what they're matched against. If your string looks like: substring1........50000 more characters...... substring2, then you will get better performance with the .* that you have.
Using String.indexOf() is much faster than Regex if the case is simple enough you can use it. You could recode your problem as:
public static boolean containsStrings(String source, String string1, String string2) {
long pos1, pos2;
pos1 = source.indexOf(string1);
if(pos1 > -1) {
pos2 = source.indexOf(string2,pos1 + string1.length);
if(pos2 > pos1 && source.indexOf(string1,pos2 + string2.length) < -1) {
return true;
}
}
return false;
}
Note that my solution does not deal with the case where string2 is contained in string1, if that is the case you'll need to add that to the logic.
^((?!substring1).)*substring1((?!substring2).)*substring2.*?\Z
Should do it because a string that contains one substring multiple times but not both in order won't backtrack ad nauseam. You can drop the .*?\Z at the end if you don't need the matcher to end at end of input.

Regular expression for counting words in a sentence

public static int getWordCount(String sentence) {
return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
+ sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}
My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.
The word should take hyphens or underscores in between
Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.
The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?
Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.
Based on your specifications, you seem to look for the following regex:
[\w-]+
Next you can use this approach to count the number of matches:
public static int getWordCount(String sentence) {
Pattern pattern = Pattern.compile("[\\w-]+");
Matcher matcher = pattern.matcher(sentence);
int count = 0;
while (matcher.find())
count++;
return count;
}
online jDoodle demo.
This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.
If you don't want words to start or end with hyphens, you can use the following regex:
\w+([-]\w+)*
This part ([-][_])* is wrong. The notation [xyz] means "any single one of the characters inside the brackets" (see http://www.regular-expressions.info/charclass.html). So effectively, you allow exactly the character - and exactly the character _, in that order.
Fixing your group makes it work:
[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*
and it can be further simplified using \w to
\w+(-\w+)*
because \w matches 0..9, A..Z, a..z and _ (http://www.regular-expressions.info/shorthand.html) and so you only need to add -.
if you can use java 8:
long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();
With java 8
public static int getColumnCount(String row) {
return (int) Pattern.compile("[\\w-]+")
.matcher(row)
.results()
.count();
}

regex is very slow, how to check if a string is only with word chars fast?

I have a function to check if a string(most of the string is only with one CJK char) is only with word chars, and it will be invoked many many times, so the cost is unacceptable, but I don't know how to optimize it, any suggestions?
/*\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
For more details see Unicode TR-18, and bear in mind that the set of characters
in each class can vary between Unicode releases.*/
private static final Pattern sOnlyWordChars = Pattern.compile("\\w+");
private boolean isOnlyWordChars(String s) {
return sOnlyWordChars.matcher(s).matches();
}
when s is "3g", or "go_url", or "hao123", isOnlyWordChars(s) should return true.
private boolean isOnlyWordChars(String s) {
char[] chars = s.toCharArray();
for (char c : chars) {
if(!Character.isLetter(c)) {
return false;
}
}
return true;
}
A better implementation
public static boolean isAlpha(String str) {
if (str == null) {
return false;
}
int sz = str.length();
for (int i = 0; i < sz; i++) {
if (Character.isLetter(str.charAt(i)) == false) {
return false;
}
}
return true;
}
Or if you are using Apache Commons, StringUtils.isAlpha(). the second implemenation of the answer is actually from the source code if isAlpha.
UPDATE
HI Sorry for the late reply. I wasn't pretty sure about the speed although I read in several places that loop is faster than regex. To be sure I run the following codes in ideoone and here is the result
for 5000000 iteration
with your codes: 4.99 seconds (runtime error after that so for big data it is not working)
with my first code 2.71 seconds
with my second code 2.52 seconds
for 500000 iteration
with your codes: 1.07 seconds
with my first code 0.36 seconds
with my second code 0.33 seconds
Here is the sample code I used.
N.B. There might be small mistakes. You can play with it to test in different scenario.
according to the comment of Jan, I think those are minor things like using private or public. yest condition checking is a nice point.
I think that the chief problem is your pattern.
I was working through an iterative solution, when I noticed that it failed on one of my test strings Supercalifragilisticexpalidociou5. This reason for this: \w+ only cares if there is one or more word characters. It doesn't care if you're not looking at a word character beyond what it's already matched.
To rectify this, use a lookaround:
(?!\W+)(\w+)
The \W+ condition will lock the regex if one or more characters are found to be a non-word character (such as &*()!#!#$).
The only thing i see is to change your pattern to:
^\\w++$
but i am not a java expert
explanations:
I have added anchors (ie ^ $) that increases the performances of the pattern (the regex engine fails at the first non word character until it encounters the end). I have added a possessive quantifier (ie ++), then the regex engine doesn't matter of backtrack positions and is more fast.
more informations here.
If you want to do this using regexes, then the most efficient way do it is to change the logic to a negation; i.e. "every character is a letter" becomes "no character is a non-letter".
private static final Pattern pat = Pattern.compile("\\W");
private boolean isOnlyWordChars(String s) {
return !pat.matcher(s).find();
}
This will test each character at most once ... with no backtracking.

Use regex to replace sequences in a string with modified characters

I am trying to solve a codingbat problem using regular expressions whether it works on the website or not.
So far, I have the following code which does not add a * between the two consecutive equal characters. Instead, it just bulldozes over them and replaces them with a set string.
public String pairStar(String str) {
Pattern pattern = Pattern.compile("([a-z])\\1", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
if(matcher.find())
matcher.replaceAll(str);//this is where I don't know what to do
return str;
}
I want to know how I could keep using regex and replace the whole string. If needed, I think a recursive system could help.
This works:
while(str.matches(".*(.)\\1.*")) {
str = str.replaceAll("(.)\\1", "$1*$1");
}
return str;
Explanation of the regex:
The search regex (.)\\1:
(.) means "any character" (the .) and the brackets create a group - group 1 (the first left bracket)
\\1, which in regex is \1 (a java literal String must escape a backslash with another backslash) means "the first group" - this kind of term is called a "back reference"
So together (.)\1 means "any repeated character"
The replacement regex $1*$1:
The $1 term means "the content captured as group 1"
Recursive solution:
Technically, the solution called for on that site is a recursive solution, so here is recursive implementation:
public String pairStar(String str) {
if (!str.matches(".*(.)\\1.*")) return str;
return pairStar(str.replaceAll("(.)\\1", "$1*$1"));
}
FWIW, here's a non-recursive solution:
public String pairStar(String str) {
int len = str.length();
StringBuilder sb = new StringBuilder(len*2);
char last = '\0';
for (int i=0; i < len; ++i) {
char c = str.charAt(i);
if (c == last) sb.append('*');
sb.append(c);
last = c;
}
return sb.toString();
}
I dont know java, but I believe there is replace function for string in java or with regular expression. Your match string would be
([a-z])\\1
And the replace string would be
$1*$1
After some searching I think you are looking for this,
str.replaceAll("([a-z])\\1", "$1*$1").replaceAll("([a-z])\\1", "$1*$1");
This is my own solutions.
Recursive solution (which is probably more or less the solution that the problem is designed for)
public String pairStar(String str) {
if (str.length() <= 1) return str;
else return str.charAt(0) +
(str.charAt(0) == str.charAt(1) ? "*" : "") +
pairStar(str.substring(1));
}
If you want to complain about substring, then you can write a helper function pairStar(String str, int index) which does the actual recursion work.
Regex one-liner one-function-call solution
public String pairStar(String str) {
return str.replaceAll("(.)(?=\\1)", "$1*");
}
Both solution has the same spirit. They both check whether the current character is the same as the next character or not. If they are the same then insert a * between the 2 identical characters. Then we move on to check the next character. This is to produce the expected output a*a*a*a from input aaaa.
The normal regex solution of "(.)\\1" has a problem: it consumes 2 characters per match. As a result, we failed to compare whether the character after the 2nd character is the same character. The look-ahead is used to resolve this problem - it will do comparison with the next character without consuming it.
This is similar to the recursive solution, where we compare the next character str.charAt(0) == str.charAt(1), while calling the function recursively on the substring with only the current character removed pairStar(str.substring(1).

Regex in java and its performance compared to indexOf

Please can someone tell me how to match "_" and a period "." excatly one time in a string using regex, Also is it more efficient using indexOf() instead of regex expression.
String s= "Hello_Wor.ld" or
s="12323_!£££$.asdfasd"
bascially any no of characters can come before and after _ and . the only requirement is that the entire string should only contain one occurance of _ and .
indexOf will be much quicker than a regex, and will probably also be easier to understand.
Just test if indexOf('_') >= 0, and then if indexOf('_', indexOfFirstUnderScore) < 0. Do the same for the period.
private boolean containsOneAndOnlyOne(String s, char c) {
int firstIndex = s.indexOf(c);
if (firstIndex < 0) {
return false;
}
int secondIndex = s.indexOf(c, firstIndex + 1);
return secondIndex < 0;
}
Matches a string with a single .:
/^[^.]*\.[^.]*$/
Same for _:
/^[^_]*_[^_]*/
The combined regex should be something like:
/^([^._]*\.[^._]*_[^._]*)|([^._]*_[^._]*\.[^._]*)$/
It should by now be obvious that indexOf is the better solution, being simpler (performance is irrelevant until it has been shown to be the bottleneck).
If interested, note how the combined regex has two terms, for "string with a single . before a single _" and vice versa. It would have six for three characters, and n! for n. It would be simpler to run both regexes and AND the result than to use the combined regex.
One must always look for a simpler solution while using regexes.

Categories

Resources