Java: How to implement wildcard matching? - java

I'm researching on how to find k values in the BST that are closest to the target, and came across the following implementation with the rules:
'?' Matches any single character.
'*' Matches any sequence of characters (including the empty sequence).
The matching should cover the entire input string (not partial).
The function prototype should be:
bool isMatch(const char *s, const char *p)
Some examples:
isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "*") → true
isMatch("aa", "a*") → true
isMatch("ab", "?*") → true
isMatch("aab", "cab") → false
Code:
import java.util.*;
public class WildcardMatching {
boolean isMatch(String s, String p) {
int i=0, j=0;
int ii=-1, jj=-1;
while(i<s.length()) {
if(j<p.length() && p.charAt(j)=='*') {
ii=i;
jj=j;
j++;
} else if(j<p.length() &&
(s.charAt(i) == p.charAt(j) ||
p.charAt(j) == '?')) {
i++;
j++;
} else {
if(jj==-1) return false;
j=jj;
i=ii+1;
}
}
while(j<p.length() && p.charAt(j)=='*') j++;
return j==p.length();
}
public static void main(String args[]) {
String s = "aab";
String p = "a*";
WildcardMatching wcm = new WildcardMatching();
System.out.println(wcm.isMatch(s, p));
}
}
And my question is, what's the reason for having two additional indexes, ii and jj, and why do they get initialized with -1? What's the purpose of each? Wouldn't traversing it with i and j be enough?
And what's the purpose of ii=i; and jj=j; in the first if case, and i=ii+1; and j=jj; in the third if case?
Lastly, in what case would you encounter while(j<p.length() && p.charAt(j)=='*') j++;?
Examples would be extremely helpful in understanding.
Thank you in advance and will accept answer/up vote.

It looks like ii and jj are used to handle the wildcard "*", which matches to any sequence. Their initialization to -1 acts as a flag: it tells us if we've hit an unmatched sequence and are not currently evaluating a "*". We can walk through your examples one at a time.
Notice that i is related to the parameter s (the original string) and j is related to the parameter p (the pattern).
isMatch("aa","a"):
this returns false because the j<p.length() statement will fail before we leave the while loop, since the length of p ("a") is only 1 whereas the length of s ("aa") is 2, so we'll jump to the else block. This is where the -1 initialization comes in: since we never saw any wildcards in p, jj is still -1, indicating that there's no way the strings can match, so we return false.
isMatch("aa","aa"):
s and p are exactly the same, so the program repeatedly evaluates the else-if block with no problems and finally breaks out of the while loop once i equals 2 (the length of "aa"). The second while loop never runs, since j is not less than p.length() - in fact, since the else-if increments i and j together, they are both equal to 2, and 2 is not less than the length of "aa". We return j == p.length(), which evaluates to 2 == 2, and get true.
isMatch("aaa","aa"): this one fails for the same reason as the first. Namely, the strings are not the same length and we never hit a wildcard character.
isMatch("aa","*"): this is where it gets interesting. First we'll enter the if block, since we've seen a "*" in p. We set ii and jj to 0 and increment j only. On the second iteration, j<p.length() fails, so we jump to the else block. jj is not -1 anymore (it's 0), so we reset j to 0 and set i to 0+1. This basically allows us to keep evaluating the wildcard, since j just gets reset to jj, which holds the position of the wildcard, and ii tells us where to start from in our original string. This test case also explains the second while loop. In some cases our pattern may be much shorter than the original string, so we need to make sure it's matched up with wildcards. For example, isMatch("aaaaaa","a**") should return true, but the final return statement is checking to see if j == p.length(), asking if we checked the entire pattern. Normally we would stop at the first wildcard, since it matches anything, so we need to finally run through the rest of the pattern and make sure it only contains wildcards.
From here you can figure out the logic behind the other test cases. I hope this helped!

Lets look at this a bit out of order.
First, this is a parallel iteration of the string (s) and the wildcard pattern (p), using variable i to index s and variable j to index p.
The while loop will stop iterating when end of s is reached. When that happens, hopefully end of p has been reached too, in while case it'll return true (j==p.length()).
If however p ends with a *, that is also valid (e.g. isMatch("ab", "ab*")), and that's what the while(j<p.length() && p.charAt(j)=='*') j++; loop ensures, i.e. any * in the pattern at this point is skipped, and if that reaches end of p, then it returns true. If end of p is not reached, it returns false.
That was the answer to your last question. Now lets look at the loop. The else if will iterate both i and j as long as there is a match, e.g. 'a' == 'a' or 'a' == '?'.
When a * wildcard is found (first if), it saves the current positions in ii and jj, in case backtracking becomes necessary, then skips the wildcard character.
This basically starts by assuming the wildcard matches the empty string (e.g. isMatch("ab", "a*b")). When it continues iterating, the else if will match the rest and method ends up returning true.
Now, if a mismatch is found (the else block), it will try to backtrack. Of course, if it doesn't have a saved wildcard (jj==-1), it can't backtrack, so it just returns false. That's why jj is initialized to -1, so it can detect if a wildcard was saved. ii could be initialized to anything, but is initialized to -1 for consistency.
If a wildcard position was saved in ii and jj, it will restore those values, then forward i by one, i.e. assuming that if the next character is matched against the wildcard, the rest of the matching will succeed and return true.
That's the logic. Now, it could be optimized a tiny bit, because that backtracking is sub-optimal. It currently resets j back to the *, and i back to the next character. When it loops around, it will enter the if and save the save value again in jj and save the i value in ii, and then increment j. Since that is a given (unless end of s is reached), the backtracking could just do that too, saving an iteration loop, i.e.
} else {
if(jj==-1) return false;
i=++ii;
j=jj+1;
}

The code looks buggy to me. (See below)
The ostensible purpose of ii and jj is to implement a form of backtracking.
For example, when you try to match "abcde" against the pattern "a*e", the algorithm will first match the "a" in the pattern against the "a" in the the input string. Then it will eagerly match the "*" against the rest of the string ... and find that it has made a mistake. At that point, it needs to backtrack and try an alternative
The ii and jj are to record the point to backtrack to, and the uses those variables are either recording a new backtrack point, or backtracking.
Or at least, that was probably the author's intent at some point.
The while(j<p.length() && p.charAt(j)=='*') j++; seems to be dealing with an edge-case
However, I don't think this code is correct.
It certainly won't cope with backtracking in the case where there are multiple "*" wildcards in the pattern. That requires a recursive solution.
The part:
if(j<p.length() && p.charAt(j)=='*') {
ii=i;
jj=j;
j++;
doesn't make much sense. I'd have thought it should increment i not j. It might "mesh" with the behavior of the else part, but even if it does this is a convoluted way of coding this.
Advice:
Don't use this code as an example. Even if it works (in a limited sense) it is not a good way to do this task, or an example of clarity or good style.
I would handle this by translating the wildcard pattern into a regex and then using Pattern / Matcher to do the matching.
For example: Wildcard matching in Java

I know you are asking about BST, but to be honest there is also a way of doing that with regex (not for competitive programming, but stable and fast enough be used in a production environment):
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class WildCardMatcher{
public static void main(String []args){
// Test
String urlPattern = "http://*.my-webdomain.???",
urlToMatch = "http://webmail.my-webdomain.com";
WildCardMatcher wildCardMatcher = new WildCardMatcher(urlPattern);
System.out.printf("\"%s\".matches(\"%s\") -> %s%n", urlToMatch, wildCardMatcher, wildCardMatcher.matches(urlToMatch));
}
private final Pattern p;
public WildCardMatcher(final String urlPattern){
Pattern charsToEscape = Pattern.compile("([^*?]+)([*?]*)");
// here we need to escape all the strings that are not "?" or "*", and replace any "?" and "*" with ".?" and ".*"
Matcher m = charsToEscape.matcher(urlPattern);
StringBuffer sb = new StringBuffer();
String replacement, g1, g2;
while(m.find()){
g1 = m.group(1);
g2 = m.group(2);
// We first have to escape pattern (original string can contain charachters that are invalid for regex), then escaping the '\' charachters that have a special meaning for replacement strings
replacement = (g1 == null ? "" : Matcher.quoteReplacement(Pattern.quote(g1))) +
(g2 == null ? "" : g2.replaceAll("([*?])", ".$1")); // simply replacing "*" and "?"" with ".*" and ".?"
m.appendReplacement(sb, replacement);
}
m.appendTail(sb);
p = Pattern.compile(sb.toString());
}
#Override
public String toString(){
return p.toString();
}
public boolean matches(final String urlToMatch){
return p.matcher(urlToMatch).matches();
}
}
There is still a list of optimizations that you can implement (lowecase / uppercase distinction, setting a max-length to the string being checked to prevent attackers to make you check against a 4-GigaByte-String, ...).

Related

How to know if a string could match a regular expression by adding more characters

This is a tricky question, and maybe in the end it has no solution (or not a reasonable one, at least). I'd like to have a Java specific example, but if it can be done, I think I could do it with any example.
My goal is to find a way of knowing whether an string being read from an input stream could still match a given regular expression pattern. Or, in other words, read the stream until we've got a string that definitely will not match such pattern, no matter how much characters you add to it.
A declaration for a minimalist simple method to achieve this could be something like:
boolean couldMatch(CharSequence charsSoFar, Pattern pattern);
Such a method would return true in case that charsSoFar could still match pattern if new characters are added, or false if it has no chance at all to match it even adding new characters.
To put a more concrete example, say we have a pattern for float numbers like "^([+-]?\\d*\\.?\\d*)$".
With such a pattern, couldMatch would return true for the following example charsSoFar parameter:
"+"
"-"
"123"
".24"
"-1.04"
And so on and so forth, because you can continue adding digits to all of these, plus one dot also in the three first ones.
On the other hand, all these examples derived from the previous one should return false:
"+A"
"-B"
"123z"
".24."
"-1.04+"
It's clear at first sight that these will never comply with the aforementioned pattern, no matter how many characters you add to it.
EDIT:
I add my current non-regex approach right now, so to make things more clear.
First, I declare the following functional interface:
public interface Matcher {
/**
* It will return the matching part of "source" if any.
*
* #param source
* #return
*/
CharSequence match(CharSequence source);
}
Then, the previous function would be redefined as:
boolean couldMatch(CharSequence charsSoFar, Matcher matcher);
And a (drafted) matcher for floats could look like (note this does not support the + sign at the start, just the -):
public class FloatMatcher implements Matcher {
#Override
public CharSequence match(CharSequence source) {
StringBuilder rtn = new StringBuilder();
if (source.length() == 0)
return "";
if ("0123456789-.".indexOf(source.charAt(0)) != -1 ) {
rtn.append(source.charAt(0));
}
boolean gotDot = false;
for (int i = 1; i < source.length(); i++) {
if (gotDot) {
if ("0123456789".indexOf(source.charAt(i)) != -1) {
rtn.append(source.charAt(i));
} else
return rtn.toString();
} else if (".0123456789".indexOf(source.charAt(i)) != -1) {
rtn.append(source.charAt(i));
if (source.charAt(i) == '.')
gotDot = true;
} else {
return rtn.toString();
}
}
return rtn.toString();
}
}
Inside the omitted body for the couldMatch method, it will just call matcher.match() iteratively with a new character added at the end of the source parameter and return true while the returned CharSequence is equal to the source parameter, and false as soon as it's different (meaning that the last char added broke the match).
You can do it as easy as
boolean couldMatch(CharSequence charsSoFar, Pattern pattern) {
Matcher m = pattern.matcher(charsSoFar);
return m.matches() || m.hitEnd();
}
If the sequence does not match and the engine did not reach the end of the input, it implies that there is a contradicting character before the end, which won’t go away when adding more characters at the end.
Or, as the documentation says:
Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher.
When this method returns true, then it is possible that more input would have changed the result of the last search.
This is also used by the Scanner class internally, to determine whether it should load more data from the source stream for a matching operation.
Using the method above with your sample data yields
Pattern fpNumber = Pattern.compile("[+-]?\\d*\\.?\\d*");
String[] positive = {"+", "-", "123", ".24", "-1.04" };
String[] negative = { "+A", "-B", "123z", ".24.", "-1.04+" };
for(String p: positive) {
System.out.println("should accept more input: "+p
+", couldMatch: "+couldMatch(p, fpNumber));
}
for(String n: negative) {
System.out.println("can never match at all: "+n
+", couldMatch: "+couldMatch(n, fpNumber));
}
should accept more input: +, couldMatch: true
should accept more input: -, couldMatch: true
should accept more input: 123, couldMatch: true
should accept more input: .24, couldMatch: true
should accept more input: -1.04, couldMatch: true
can never match at all: +A, couldMatch: false
can never match at all: -B, couldMatch: false
can never match at all: 123z, couldMatch: false
can never match at all: .24., couldMatch: false
can never match at all: -1.04+, couldMatch: false
Of course, this doesn’t say anything about the chances of turning a nonmatching content into a match. You could still construct patterns for which no additional character could ever match. However, for ordinary use cases like the floating point number format, it’s reasonable.
I have no specific solution, but you might be able to do this with negations.
If you setup regex patterns in a blacklist that definitely do not match with your pattern (e.g. + followed by char) you could check against these. If a blacklisted regex returns true, you can abort.
Another idea is to use negative lookaheads (https://www.regular-expressions.info/lookaround.html)

checking for specific characters in String using matches()

What im trying to do is reject any string that contains characters outside a-z, 0-9 or _
I tried using the match function below as id seen elsewhere but i cant get it to work correctly. It will either tell me the string is fine when its not, or it will tell me its not fine when it is.
public static Boolean checkc(String word) {
String w = word;
for (int i = 0; i < w.length(); i++) {
if (w.substring(i, i).matches("[A-Za-z0-9_]")) {
return true;
}
}
return false;
}
The logic might be wrong now because I've fiddled with it trying to get it working but to be fair, it wasnt working in the first place. Im checking a few things in the function thats calling this, so i just need to know if it string is fine given the rules.
The end index argument to substring is exclusive, so substring(i, i) always returns a 0 length string. You could fix this by using substring(i, i+1), but there's no reason to use a loop here. You can just use word.matches("[A-Za-z0-9_]+") and check the entire string at once. The regex quantifier + means "one or more". You could also use the quantifier * which means "zero or more", if the method should return true if the string is empty.
Edit: There's also another problem with your loop logic that I just noticed. Your conditional in the loop returns true the first time the condition is met:
for (...) {
if ( /* condition is met */ )
return true;
}
return false;
That logic only requires that the condition be met at least once, and then it returns true, but you probably meant the following:
for (...) {
if (! /* condition is met */ )
return false;
}
return true;
That requires that the condition be met for every character.
Try this:
public static boolean check(String word) {
return word.matches("[^a-zA-Z0-9_]+");
}
this method returns true when word string contains no single character described in the square bracket, ^ regex symbol means same as logical ! (in example !true == false). Plus symbol + after square bracket means that one symbol [] could repeat one or more time.
javadoc link to Pattern class (regex explanations and examples)
Regex101 convenient online regex debug tool
stringToCheck.String.matches("[^0-9a-zA-Z_]")
This will check whether string that needs to be matched contains any digits or alphabets and return a boolean value

Matching Subset in a String

Let's say I have-
String x = "ab";
String y = "xypa";
If I want to see if any subset of x exists in y, what would be the fastest way? Looping is time consuming. In the example above a subset of x is "a" which is found in y.
The answer really depends on many things.
If you just want to find any subset and you're doing this only once, looping is just fine (and the best you can do without using additional storage) and you can stop when you find a single character that matches.
If you have a fixed x and want to use it for matching several strings y, you can do some pre-processing to store the characters in x in a table and use this table to check if each character of y occurs in x or not.
If you want to find the largest subset, then you're looking at a different problem: the longest common subsequence problem.
Well, I'm not sure it's better than looping, but you could use String#matches:
if (y.matches(".*[" + x + "]+.*")) ...
You'd need to escape characters that are special in a regex [] construct, though (like ], -, \, ...).
The above is just an example, if you're doing it more than once, you'll want to use Pattern, Matcher, and the other stuff from the java.util.regex package.
You have to use for loop or use regex which is just as expensive as a for loop, becasue you need to convert one of your strings into chars basically.
Boolean isSubset = false;
for(int i = 0; i < x.length(); i++) {
if(y.contains(x.charAt(i))) {
isSubset = true;
break;
}
}
using a for loop.
It looks like this could be a case of the longest common substring problem.
You can generate all subsets of x (e.g. , in your example, ab, a, b) and then generate a regexp that would do the
Pattern p = Pattern.compile("(ab|a|b)");
Matcher m = p.matcher(y);
if(m.find()) {
System.err.println(m.group());
}
If both Strings will only contain [a-z]. Then fastest would be to make two bitmaps, 26 bits longs. Mark all the bits contained in the String. Take the AND of the bitmaps, the resulting bits are present in both Strings, the largest common subset. This would be a simple O(n) with n the length of the biggest String.
(If you want to cover the whole lot of UTF, bloom filters might be more appropriate. )
Looping is time-consuming, but there's no way to do what you want other than going over the target string repeatedly.
What you can do is optimize by checking the smallest strings first, and work your way up. For example, if the target string doesn't contain abc, it can't possibly contain abcdef.
Other optimizations off the top of my head:
Don't continue to check for a match after a non-matching character is hit, though in Java you can let the computer worry about this.
Don't check to see if something is a match if there aren't enough characters left in the target string for a match to be possible.
If you need speed and have lots of space, you might be able to break the target string up into a fancy data structure like a trie for better results, though I don't have an exact algorithm in mind.
Another storage-is-not-a-problem solution: decompose the target into every possible substring and store the results in a HashSet.
What about this:?
package so3935620;
import static org.junit.Assert.*;
import java.util.BitSet;
import org.junit.Test;
public class Main {
public static boolean overlap(String s1, String s2) {
BitSet bs = new BitSet();
for (int i = 0; i < s1.length(); i++) {
bs.set(s1.charAt(i));
}
for (int i = 0; i < s2.length(); i++) {
if (bs.get(s2.charAt(i))) {
return true;
}
}
return false;
}
#Test
public void test() {
assertFalse(overlap("", ""));
assertTrue(overlap("a", "a"));
assertFalse(overlap("abcdefg", "ABCDEFG"));
}
}
And if that version is too slow, you can compute the BitSet depending on s1, save that in some variable and later only loop over s2.

How to validate phone number(US format) in Java?

I just want to know where am i wrong here:
import java.io.*;
class Tokens{
public static void main(String[] args)
{
//String[] result = "this is a test".split("");
String[] result = "4543 6546 6556".split("");
boolean flag= true;
String num[] = {"0","1","2","3","4","5","6","7","8","9"};
String specialChars[] = {"-","#","#","*"," "};
for (int x=1; x<result.length; x++)
{
for (int y=0; y<num.length; y++)
{
if ((result[x].equals(num[y])))
{
flag = false;
continue;
}
else
{
flag = true;
}
if (flag == true)
break;
}
if (flag == false)
break;
}
System.out.println(flag);
}
}
If this is not homework, is there a reason you're avoiding regular expressions?
Here are some useful ones: http://regexlib.com/DisplayPatterns.aspx?cattabindex=6&categoryId=7
More generally, your code doesn't seem to validate that you have a phone number, it seems to merely validate that your strings consists only of digits. You're also not allowing any special characters right now.
Asides from the regex suggestion (which is a good one), it would seem to make more sense to deal with arrays of characters rather than single-char Strings.
In particular, the split("") call (shudder) could/should be replaced by toCharArray(). This lets you iterate over each individual character, which more clearly indicates your intent, is less prone to bugs as you know you're treating each character at once, and is more efficient*. Likewise your valid character sets should also be characters.
Your logic is pretty strangely expressed; you're not even referencing the specialChars set at all, and the looping logic once you've found a match seems odd. I think this is your bug; the matching seems to be the wrong way round in that if the character matches the first valid char, you set flag to false and continue round the current loop; so it will definitely not match the next valid char and hence you break out of the loop with a true flag. Always.
I would have thought something like this would be more intuitive:
private static final Set<Character> VALID_CHARS = ...;
public boolean isValidPhoneNumber(String number)
{
for (char c : number,toCharArray())
{
if (!VALID_CHARS.contains(c))
{
return false;
}
}
// All characters were valid
return true;
}
This doesn't take sequences into account (e.g. the strings "--------** " and "1" would be valid because all individual characters are valid) but then neither does your original code. A regex is better because it lets you specify the pattern, I supply the above snippet as an example of a clearer way of iterating through the characters.
*Yes, premature optimization is the root of all evil, but when better, cleaner code also happens to be faster that's an extra win for free.
Maybe this is overkill, but with a grammar similar to:
<phone_numer> := <area_code><space>*<local_code><space>*<number> |
<area_code><space>*"-"<space>*<local_code><space>*"-"<space>*<number>
<area_code> := <digit><digit><digit> |
"("<digit><digit><digit>")"
<local_code> := <digit><digit><digit>
<number> := <digit><digit><digit><digit>
you can write a recursive descent parser. See this page for an example.
You can checkout the Pattern class in Java, really easy to work with regular expression using this class:
https://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html.

codingbat wordEnds using regex

I'm trying to solve wordEnds from codingbat.com using regex.
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"
This is the simplest as I can make it with my current knowledge of regex:
public String wordEnds(String str, String word) {
return str.replaceAll(
".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
.replace("word", java.util.regex.Pattern.quote(word)),
"$1$2"
);
}
replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.
The regex has two major parts:
If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
The initial "if" test ensures that the atomic lookbehind captures only if there's a word
Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
Groups 1 and 2 would capture empty strings
I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.
Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.
I managed to reduce the occurrence of word within the pattern to just one.
".+?(?<=(^|.)word)(?=(.?))|.+"
I'm still looking if it's possible to simplify this further, but I also have another question:
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
Based on your solution I managed to simplify the code a little bit:
public String wordEnds(String str, String word) {
return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}
Another way of writing it would be:
public String wordEnds(String str, String word) {
return str.replaceAll(
String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
"$1$2");
}
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
In Oracle's implementation, the behavior of look-behind is as follow:
By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.
Effectively, it will try to verify the look-behind on the shortest string first.
The implementation multiplies the matching complexity by O(k) factor.
This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.
You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.
In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.
My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.
I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.
I am working in .NET's regex but I was able to change your pattern to:
.+?(?<=(\w?)word)(?=(\w?))|.+
with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.
It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)
Another solution to look at...
public String wordEnds(String str, String word) {
if(str.equals(word)) return "";
int i = 0;
String result = "";
int stringLen = str.length();
int wordLen = word.length();
int diffLen = stringLen - wordLen;
while(i<=diffLen){
if(i==0 && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i+wordLen);
}else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1);
}else if(str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
}
i++;
}
if(result.length()==1) result = result + result;
return result;
}
Another possible solution:
public String wordEnds(String str, String word) {
String result = "";
if (str.contains(word)) {
for (int i = 0; i < str.length(); i++) {
if (str.startsWith(word, i)) {
if (i > 0) {
result += str.charAt(i - 1);
}
if ((i + word.length()) < str.length()) {
result += str.charAt(i + word.length());
}
}
}
}
return result;
}

Categories

Resources