Java regex match pattern with priority - java

I am using a system where & followed by a certain letter or number represents a color.
Valid characters that can follow & are [A-Fa-fK-Ok-or0-9]
For example I have the string &aThis is a test &bstring that &ehas plenty &4&lof &7colors.
I want to split at every &x while keeping the &x in the strings.
So I use a positive lookahead in my regex
(?=(&[A-Fa-fK-Ok-or0-9]))
That works completely fine, the output is:
&aThis is a test
&bstring that
&ehas plenty
&4
&lof
&7colors.
The problem is that the spot that has two instances of &x right next to each other should not be split, that line should be &4&lof instead.
Does anyone know what regex I can use so that when there are two of &x next to each other that they are matched together. Two instances of the color code should have priority over a single instance.

Issue Description
The problem is known: you need to tokenize a string that may contain consecutive separators you need to keep as a single item in the resulting string list/array.
Splitting with lookaround(s) cannot help here, because an unanchored lookaround tests each position inside the string. If your pattern matched any char in the string, you could use \G operator, but it is not the case. Even adding a + quantifier - s0.split("(?=(?:&[A-Fa-fK-Ok-or0-9])+)" would still return &4, &lof as separate tokens because of this.
Solution
Use matching rather than splitting, and use building blocks to keep it readable.
String s0 = "This is a text&aThis is a test &bstring that &ehas plenty &4&lof &7colors.";
String colorRx = "&[A-Fa-fK-Ok-or0-9]";
String nonColorRx = "[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*";
Pattern pattern = Pattern.compile("(?:" + colorRx + ")+" + nonColorRx + "|" + nonColorRx);
Matcher m = pattern.matcher(s0);
List<String> res = new ArrayList<>();
while (m.find()){
if (!m.group(0).isEmpty()) res.add(m.group(0)); // Add if non-empty!
}
System.out.println(res);
// => [This is a text, &aThis is a test , &bstring that , &ehas plenty , &4&lof , &7colors.]
The regex is
(?:&[A-Fa-fK-Ok-or0-9])+[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*|[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*
See the regex demo here. It is actually based on your initial pattern: first, we match all the color codes (1 or more sequences), and then we match 0+ characters that are not a starting point for the color sequence (i.e. all strings other than the color codes). The [^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)* subpattern is a synonym of (?s)(?:(?!&[A-Fa-fK-Ok-or0-9]).)* and it is quite handy when you need to match some chunk of text other than the one you specify, but as it is resource consuming (especially in Java), the unrolled version is preferable.
So, the pattern - (?:" + colorRx + ")+" + nonColorRx + "|" + nonColorRx - matches 1+ colorRx subpatterns followed with optional nonColorRx subpatterns, OR (|) zero or more nonColorRx subpatterns. The .group(0).isEmpy() does not allow empty strings in the resulting array.

Something like this will work.
It uses the String#split method and places the valid lines into an ArrayList (e.g. colorLines)
String mainStr = "&aThis is a test &bstring that &ehas plenty &4&lof &7colors";
String [] arr = mainStr.split("&");
List<String> colorLines = new ArrayList<String>();
String lastColor = "";
for (String s : arr)
{
s = s.trim();
if (s.length() > 0)
{
if (s.length() == 1)
{
lastColor += s;
}
else
{
colorLines.add(lastColor.length() > 0 ? lastColor + s : s);
lastColor = "";
}
}
}
for (String s : colorLines)
{
System.out.println(s);
}
Outputs:
aThis is a test
bstring that
ehas plenty
4lof
7colors

I tried:
{
String line = "&aThis is a test &bstring that &ehas plenty &4&lof &7colors.";
String pattern = " &(a-z)*(0-9)*";
String strs[] = line.split(pattern, 0);
for (int i=0; i<strs.length; i++){
if (i!=0){
System.out.println("&"+strs[i]);
} else {
System.out.println(strs[i]);
}
}
}
and the output is :
{
&aThis is a test
&bstring that
&ehas plenty
&4&lof
&7colors.
}
We can add the & at the beginning of all the substrings to get the result you are looking for.

Related

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"
Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result
This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.
You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):
So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}
Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}
To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.
The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.
Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";
The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Iterating through String with .find() in Java regex

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.
This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij
Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here
Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

How to find a whole word in a String in Java?

I have a String that I have to parse for different keywords.
For example, I have the String:
"I will come and meet you at the 123woods"
And my keywords are
'123woods'
'woods'
I should report whenever I have a match and where. Multiple occurrences should also be accounted for.
However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.
My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?
The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.
String text = "I will come and meet you at the woods 123woods and all the woods";
List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");
String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.
Use regex + word boundaries as others answered.
"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");
will be true.
"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");
will be false.
Hope this works for you:
String string = "I will come and meet you at the 123woods";
String keyword = "123woods";
Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
System.out.println("Keyword matched the string");
}
http://codigounico.blogspot.com/
How about something like Arrays.asList(String.split(" ")).contains("xx")?
See String.split() and How can I test if an array contains a certain value.
Got a way to match Exact word from String in Android:
String full = "Hello World. How are you ?";
String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";
boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);
Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);
Result: false-true-true-false
Function for match word:
private boolean isContainExactWord(String fullString, String partWord){
String pattern = "\\b"+partWord+"\\b";
Pattern p=Pattern.compile(pattern);
Matcher m=p.matcher(fullString);
return m.find();
}
Done
public class FindTextInLine {
String match = "123woods";
String text = "I will come and meet you at the 123woods";
public void findText () {
if (text.contains(match)) {
System.out.println("Keyword matched the string" );
}
}
}
Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.
The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:
This is a classical application for multi-pattern-search-algorithms.
Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).
Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.
Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.
// example with StringSearchAlgorithms
AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));
CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);
StringFinder finder = stringSearch.createFinder(text);
List<StringMatch> all = finder.findAll();
A much simpler way to do this is to use split():
String match = "123woods";
String text = "I will come and meet you at the 123woods";
String[] sentence = text.split();
for(String word: sentence)
{
if(word.equals(match))
return true;
}
return false;
This is a simpler, less elegant way to do the same thing without using tokens, etc.
You can use regular expressions.
Use Matcher and Pattern methods to get the desired output
You can also use regex matching with the \b flag (whole word boundary).
To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion.
One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.
\b(?>123woods|woods)\b
it searches 123woods as primary search, once it got matched it exits the search.
Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)
It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:
// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
+ "say, at the woods of 123woods.";
// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings =
java.util.Arrays.asList(sentence.split(" |,|\\."));
// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");
By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:
// Set... ready...?
int counter = 0;
// Go!
for(String s : strings)
{
// Asking if the sentence exists in the keywords, not the other
// around, to find repeated keywords in the sentence.
Boolean found = keywords.contains(s.toLowerCase());
if(found)
{
counter ++;
System.out.println("Found: " + s);
}
}
// Statistics:
if (counter > 0)
{
System.out.println("In sentence: " + sentence + "\n"
+ "Count: " + counter);
}
And the results are:
Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5
If you want to identify a whole word in a string and change the content of that word you can do this way. Your final string stays equals, except the word you treated. In this case "not" stays "'not'" in final string.
StringBuilder sb = new StringBuilder();
String[] splited = value.split("\\s+");
if(ArrayUtils.isNotEmpty(splited)) {
for(String valor : splited) {
sb.append(" ");
if("not".equals(valor.toLowerCase())) {
sb.append("'").append(valor).append("'");
} else {
sb.append(valor);
}
}
}
return sb.toString();

Java Split not working as expected

I am trying to use a simple split to break up the following string: 00-00000
My expression is: ^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])
And my usage is:
String s = "00-00000";
String pattern = "^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])";
String[] parts = s.split(pattern);
If I play around with the Pattern and Matcher classes I can see that my pattern does match and the matcher tells me my groupCount is 7 which is correct. But when I try and split them I have no luck.
String.split does not use capturing groups as its result. It finds whatever matches and uses that as the delimiter. So the resulting String[] are substrings in between what the regex matches. As it is the regex matches the whole string, and with the whole string as a delimiter there is nothing else left so it returns an empty array.
If you want to use regex capturing groups you will have to use Matcher.group(), String.split() will not do.
for your example, you could simply do this:
String s = "00-00000";
String pattern = "-";
String[] parts = s.split(pattern);
I can not be sure, but I think what you are trying to do is to get each matched group into an array.
Matcher matcher = Pattern.compile(pattern).matcher();
if (matcher.matches()) {
String s[] = new String[matcher.groupCount()) {
for (int i=0;i<matches.groupCount();i++) {
s[i] = matcher.group(i);
}
}
}
From the documentation:
String[] split(String regex) -- Returns: the array of strings computed by splitting this string around matches of the given regular expression
Essentially the regular expression is used to define delimiters in the input string. You can use capturing groups and backreferences in your pattern (e.g. for lookarounds), but ultimately what matters is what and where the pattern matches, because that defines what goes into the returned array.
If you want to split your original string into 7 parts using regular expression, then you can do something like this:
String s = "12-3456";
String[] parts = s.split("(?!^)");
System.out.println(parts.length); // prints "7"
for (String part : parts) {
System.out.println("Part [" + part + "]");
} // prints "[1] [2] [-] [3] [4] [5] [6] "
This splits on zero-length matching assertion (?!^), which is anywhere except before the first character in the string. This prevents the empty string to be the first element in the array, and trailing empty string is already discarded because we use the default limit parameter to split.
Using regular expression to get individual character of a string like this is an overkill, though. If you have only a few characters, then the most concise option is to use foreach on the toCharArray():
for (char ch : "12-3456".toCharArray()) {
System.out.print("[" + ch + "] ");
}
This is not the most efficient option if you have a longer string.
Splitting on -
This may also be what you're looking for:
String s = "12-3456";
String[] parts = s.split("-");
System.out.println(parts.length); // prints "2"
for (String part : parts) {
System.out.print("[" + part + "] ");
} // prints "[12] [3456] "

Categories

Resources