Java regex: Replace all characters with `+` except instances of a given string - java

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"

Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result

This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.

You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):

So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}

Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}

To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.

The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

Related

Capitalize first letters in words in the string with different separators using java 8 stream

I need to capitalize first letter in every word in the string, BUT it's not so easy as it seems to be as the word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators, i.e. after them the next letter must be capitalized.
Example what program should do:
For input: "#he&llo wo!r^ld"
Output should be: "#He&Llo Wo!R^Ld"
There are questions that sound similar here, but there solutions really don't help.
This one for example:
String output = Arrays.stream(input.split("[\\s&]+"))
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
.collect(Collectors.joining(" "));
As in my task there can be various separators, this solution doesn't work.
It is possible to split a string and keep the delimiters, so taking into account the requirement for delimiters:
word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators
the pattern which keeps the delimiters in the result array would be: "((?<=[^-`\\w])|(?=[^-`\\w]))":
[^-`\\w]: all characters except -, backtick and word characters \w: [A-Za-z0-9_]
Then, the "words" are capitalized, and delimiters are kept as is:
static String capitalize(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("((?<=[^-`\\w])|(?=[^-`\\w]))"))
.map(s -> s.matches("[-`\\w]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Tests:
System.out.println(capitalize("#he&l_lo-wo!r^ld"));
System.out.println(capitalize("#`he`&l+lo wo!r^ld"));
Output:
#He&l_lo-wo!R^Ld
#`he`&L+Lo Wo!R^Ld
Update
If it is needed to process not only ASCII set of characters but apply to other alphabets or character sets (e.g. Cyrillic, Greek, etc.), POSIX class \\p{IsWord} may be used and matching of Unicode characters needs to be enabled using pattern flag (?U):
static String capitalizeUnicode(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("(?U)((?<=[^-`\\p{IsWord}])|(?=[^-`\\p{IsWord}]))")
.map(s -> s.matches("(?U)[-`\\p{IsWord}]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Test:
System.out.println(capitalizeUnicode("#he&l_lo-wo!r^ld"));
System.out.println(capitalizeUnicode("#привет&`ёж`+дос^βιδ/ως"));
Output:
#He&L_lo-wo!R^Ld
#Привет&`ёж`+Дос^Βιδ/Ως
You can't use split that easily - split will eliminate the separators and give you only the things in between. As you need the separators, no can do.
One real dirty trick is to use something called 'lookahead'. That argument you pass to split is a regular expression. Most 'characters' in a regexp have the property that they consume the matching input. If you do input.split("\\s+") then that doesn't 'just' split on whitespace, it also consumes them: The whitespace is no longer part of the individual entries in your string array.
However, consider ^ and $. or \\b. These still match things but don't consume anything. You don't consume 'end of string'. In fact, ^^^hello$$$ matches the string "hello" just as well. You can do this yourself, using lookahead: It matches when the lookahead is there but does not consume it:
String[] args = "Hello World$Huh Weird".split("(?=[\\s_$-]+)");
for (String arg : args) System.out.println("*" + args[i] + "*");
Unfortunately, this 'works', in that it saves your separators, but isn't getting you all that much closer to a solution:
*Hello*
* World*
*$Huh*
* *
* *
* Weird*
You can go with lookbehind as well, but it's limited; they don't do variable length, for example.
The conclusion should rapidly become: Actually, doing this with split is a mistake.
Then, once split is off the table, you should no longer use streams, either: Streams don't do well once you need to know stuff about the previous element in a stream to do the job: A stream of characters doesn't work, as you need to know if the previous character was a non-letter or not.
In general, "I want to do X, and use Y" is a mistake. Keep an open mind. It's akin to asking: "I want to butter my toast, and use a hammer to do it". Oookaaaaayyyy, you can probably do that, but, eh, why? There are butter knives right there in the drawer, just.. put down the hammer, that's toast. Not a nail.
Same here.
A simple loop can take care of this, no problem:
private static final String BREAK_CHARS = "&-_`";
public String toTitleCase(String input) {
StringBuilder out = new StringBuilder();
boolean atBreak = true;
for (char c : input.toCharArray()) {
out.append(atBreak ? Character.toUpperCase(c) : c);
atBreak = Character.isWhitespace(c) || (BREAK_CHARS.indexOf(c) > -1);
}
return out.toString();
}
Simple. Efficient. Easy to read. Easy to modify. For example, if you want to go with 'any non-letter counts', trivial: atBreak = Character.isLetter(c);.
Contrast to the stream solution which is fragile, weird, far less efficient, and requires a regexp that needs half a page's worth of comment for anybody to understand it.
Can you do this with streams? Yes. You can butter toast with a hammer, too. Doesn't make it a good idea though. Put down the hammer!
You can use a simple FSM as you iterate over the characters in the string, with two states, either in a word, or not in a word. If you are not in a word and the next character is a letter, convert it to upper case, otherwise, if it is not a letter or if you are already in a word, simply copy it unmodified.
boolean isWord(int c) {
return c == '`' || c == '_' || c == '-' || Character.isLetter(c) || Character.isDigit(c);
}
String capitalize(String s) {
StringBuilder sb = new StringBuilder();
boolean inWord = false;
for (int c : s.codePoints().toArray()) {
if (!inWord && Character.isLetter(c)) {
sb.appendCodePoint(Character.toUpperCase(c));
} else {
sb.appendCodePoint(c);
}
inWord = isWord(c);
}
return sb.toString();
}
Note: I have used codePoints(), appendCodePoint(int), and int so that characters outside the basic multilingual plane (with code points greater than 64k) are handled correctly.
I need to capitalize first letter in every word
Here is one way to do it. Admittedly this is a might longer but your requirement to change the first letter to upper case (not first digit or first non-letter) required a helper method. Otherwise it would have been easier. Some others seemed to have missed this point.
Establish word pattern, and test data.
String wordPattern = "[\\w_-`]+";
Pattern p = Pattern.compile(wordPattern);
String[] inputData = { "#he&llo wo!r^ld", "0hel`lo-w0rld" };
Now this simply finds each successive word in the string based on the established regular expression. As each word is found, it changes the first letter in the word to upper case and then puts it in a string buffer in the correct position where the match was found.
for (String input : inputData) {
StringBuilder sb = new StringBuilder(input);
Matcher m = p.matcher(input);
while (m.find()) {
sb.replace(m.start(), m.end(),
upperFirstLetter(m.group()));
}
System.out.println(input + " -> " + sb);
}
prints
#he&llo wo!r^ld -> #He&Llo Wo!R^Ld
0hel`lo-w0rld -> 0Hel`lo-W0rld
Since words may start with digits, and the requirement was to convert the first letter (not character) to upper case. This method finds the first letter, converts it to upper case and
returns the new string. So 01_hello would become 01_Hello
public static String upperFirstLetter(String word) {
char[] chs = word.toCharArray();
for (int i = 0; i < chs.length; i++) {
if (Character.isLetter(chs[i])) {
chs[i] = Character.toUpperCase(chs[i]);
break;
}
}
return String.valueOf(chs);
}

Java regex match pattern with priority

I am using a system where & followed by a certain letter or number represents a color.
Valid characters that can follow & are [A-Fa-fK-Ok-or0-9]
For example I have the string &aThis is a test &bstring that &ehas plenty &4&lof &7colors.
I want to split at every &x while keeping the &x in the strings.
So I use a positive lookahead in my regex
(?=(&[A-Fa-fK-Ok-or0-9]))
That works completely fine, the output is:
&aThis is a test
&bstring that
&ehas plenty
&4
&lof
&7colors.
The problem is that the spot that has two instances of &x right next to each other should not be split, that line should be &4&lof instead.
Does anyone know what regex I can use so that when there are two of &x next to each other that they are matched together. Two instances of the color code should have priority over a single instance.
Issue Description
The problem is known: you need to tokenize a string that may contain consecutive separators you need to keep as a single item in the resulting string list/array.
Splitting with lookaround(s) cannot help here, because an unanchored lookaround tests each position inside the string. If your pattern matched any char in the string, you could use \G operator, but it is not the case. Even adding a + quantifier - s0.split("(?=(?:&[A-Fa-fK-Ok-or0-9])+)" would still return &4, &lof as separate tokens because of this.
Solution
Use matching rather than splitting, and use building blocks to keep it readable.
String s0 = "This is a text&aThis is a test &bstring that &ehas plenty &4&lof &7colors.";
String colorRx = "&[A-Fa-fK-Ok-or0-9]";
String nonColorRx = "[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*";
Pattern pattern = Pattern.compile("(?:" + colorRx + ")+" + nonColorRx + "|" + nonColorRx);
Matcher m = pattern.matcher(s0);
List<String> res = new ArrayList<>();
while (m.find()){
if (!m.group(0).isEmpty()) res.add(m.group(0)); // Add if non-empty!
}
System.out.println(res);
// => [This is a text, &aThis is a test , &bstring that , &ehas plenty , &4&lof , &7colors.]
The regex is
(?:&[A-Fa-fK-Ok-or0-9])+[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*|[^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)*
See the regex demo here. It is actually based on your initial pattern: first, we match all the color codes (1 or more sequences), and then we match 0+ characters that are not a starting point for the color sequence (i.e. all strings other than the color codes). The [^&]*(?:&(?![A-Fa-fK-Ok-or0-9])[^&]*)* subpattern is a synonym of (?s)(?:(?!&[A-Fa-fK-Ok-or0-9]).)* and it is quite handy when you need to match some chunk of text other than the one you specify, but as it is resource consuming (especially in Java), the unrolled version is preferable.
So, the pattern - (?:" + colorRx + ")+" + nonColorRx + "|" + nonColorRx - matches 1+ colorRx subpatterns followed with optional nonColorRx subpatterns, OR (|) zero or more nonColorRx subpatterns. The .group(0).isEmpy() does not allow empty strings in the resulting array.
Something like this will work.
It uses the String#split method and places the valid lines into an ArrayList (e.g. colorLines)
String mainStr = "&aThis is a test &bstring that &ehas plenty &4&lof &7colors";
String [] arr = mainStr.split("&");
List<String> colorLines = new ArrayList<String>();
String lastColor = "";
for (String s : arr)
{
s = s.trim();
if (s.length() > 0)
{
if (s.length() == 1)
{
lastColor += s;
}
else
{
colorLines.add(lastColor.length() > 0 ? lastColor + s : s);
lastColor = "";
}
}
}
for (String s : colorLines)
{
System.out.println(s);
}
Outputs:
aThis is a test
bstring that
ehas plenty
4lof
7colors
I tried:
{
String line = "&aThis is a test &bstring that &ehas plenty &4&lof &7colors.";
String pattern = " &(a-z)*(0-9)*";
String strs[] = line.split(pattern, 0);
for (int i=0; i<strs.length; i++){
if (i!=0){
System.out.println("&"+strs[i]);
} else {
System.out.println(strs[i]);
}
}
}
and the output is :
{
&aThis is a test
&bstring that
&ehas plenty
&4&lof
&7colors.
}
We can add the & at the beginning of all the substrings to get the result you are looking for.

Replace word with special characters from string in Java

I am writing a method which should replace all words which matches with ones from the list with '****'
characters. So far I have code which works but all special characters are ignored.
I have tried with "\\W" in my expression but looks like I didn't use it well so I could use some help.
Here's code I have so far:
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck, badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("(?i)\\b" + badWords.get(i) + "\\b", "****");
}
}
E.g. I have list of words ['bad', '#$$'].
If I have a string: "This is bad string with #$$" I am expecting this method to return "This is **** string with ****"
Note that method should be aware of case sensitive words, e.g. TesT and test should handle same.
I'm not sure why you use the StringUtils you can just directly replace words that match the bad words. This code works for me:
public static void main(String[] args) {
ArrayList<String> badWords = new ArrayList<String>();
badWords.add("test");
badWords.add("BadTest");
badWords.add("\\$\\$");
String test = "This is a TeSt and a $$ with Badtest.";
for(int i = 0; i < badWords.size(); i++) {
test = test.replaceAll("(?i)" + badWords.get(i), "****");
}
test = test.replaceAll("\\w*\\*{4}", "****");
System.out.println(test);
}
Output:
This is a **** and a **** with ****.
The problem is that these special characters e.g. $ are regex control characters and not literal characters. You'll need to escape any occurrence of the following characters in the bad word using two backslashes:
{}()\[].+*?^$|
My guess is that your list of bad words contains special characters that have particular meanings when interpreted in a regular expression (which is what the replaceAll method does). $, for example, typically matches the end of the string/line. So I'd recommend a combination of things:
Don't use containsIgnoreCase to identify whether a replacement needs to be done. Just let the replaceAll run each time - if there is no match against the bad word list, nothing will be done to the string.
The characters like $ that have special meanings in regular expressions should be escaped when they are added into the bad word list. For example, badwords.add("#\\$\\$");
Try something like this:
String stringToCheck = "This is b!d string with #$$";
List<String> badWords = asList("b!d","#$$");
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck,badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("["+badWords.get(i)+"]+","****");
}
}
System.out.println(stringToCheck);
Another solution: bad words matched with word boundaries (and case insensitive).
Pattern badWords = Pattern.compile("\\b(a|b|ĉĉĉ|dddd)\\b",
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
String text = "adfsa a dfs bb addfdsaf ĉĉĉ adsfs dddd asdfaf a";
Matcher m = badWords.matcher(text);
StringBuffer sb = new StringBuffer(text.length());
while (m.find()) {
m.appendReplacement(sb, stars(m.group(1)));
}
m.appendTail(sb);
String cleanText = sb.toString();
System.out.println(text);
System.out.println(cleanText);
}
private static String stars(String s) {
return s.replaceAll("(?su).", "*");
/*
int cpLength = s.codePointCount(0, s.length());
final String stars = "******************************";
return cpLength >= stars.length() ? stars : stars.substring(0, cpLength);
*/
}
And then (in comment) the stars with the correct count: one star for a Unicode code point giving two surrogate pairs (two UTF-16 chars).

Iterating through String with .find() in Java regex

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.
This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij
Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here
Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

How to split a comma separated String while ignoring escaped commas?

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.
so calling my:
commaDelimitedListToStringArray("test,test\\,test\\,test,test", "\\")
should return:
["test", "test,test,test", "test"]
My current attempt is to use String.split() to split the String using regular expressions:
String[] array = str.split("[^\\\\],");
But the returned array is:
["tes", "test\,test\,tes", "test"]
Any ideas?
The regular expression
[^\\],
means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.
I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like
(?<!\\),
(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)
Try:
String array[] = str.split("(?<!\\\\),");
Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.
For future reference, here is the complete method i ended up with:
public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
// these characters need to be escaped in a regular expression
String regularExpressionSpecialChars = "/.*+?|()[]{}\\";
String escapedEscapeChar = escapeChar;
// if the escape char for our comma separated list needs to be escaped
// for the regular expression, escape it using the \ char
if(regularExpressionSpecialChars.indexOf(escapeChar) != -1)
escapedEscapeChar = "\\" + escapeChar;
// see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);
// remove the escapeChar for the end result
String[] result = new String[temp.length];
for(int i=0; i<temp.length; i++) {
result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
}
return result;
}
As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,tes" , "test"]
As drvdijk said, (?<!\\), will misinterpret escaped backslashes.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,test" , "test"]
-(unescape commas)->
["test\\\\,test\\,test,test" , "test"]
I would expect being able to escape backslashes as well...
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\" , "test\\,test" , "test"]
-(unescape commas and backslashes)->
["test\\,test\\" , "test,test" , "test"]
drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?
I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).
My answer does not take the escape character as a parameter.
public static List<String> commaDelimitedListStringToStringList(String list) {
// Check the validity of the list
// ex: "te\\st" is not valid, backslash should be escaped
if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
// Could also raise an exception
return null;
}
// Matcher for the list elements
Matcher matcher = Pattern
.compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
// Unescape the list element
result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
}
return result;
}
Description for the pattern (unescaped):
(?<=(^|,)) forward is start of string or a ,
([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,
(?=(,|$)) behind is end of string or a ,
The pattern may be simplified.
Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.
Also, what is the need of having an escape character if only one character is special, it could simply be doubled...
public static List<String> commaDelimitedListStringToStringList2(String list) {
if (!list.matches("^(([^,]|,,)*(,|$))+")) {
return null;
}
Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
result.add(matcher.group().replaceAll(",,", ","));
}
return result;
}
split(/(?<!\\),/g) worked for me, but the accepted answer did not
> var x = "test,test\,test\,test,test"
undefined
> x.split(/(?<!\\),/g)
[ 'test', 'test\\,test\\,test', 'test' ]
> x.split("(?<!\\\\),")
[ 'test,test\\,test\\,test,test' ]
It's probably not "super fancy" solution, but possibly more time-efficient one. Escaping an escape character is also supported and it's working in browsers not supporting 'lookbehinds'.
function splitByDelimiterIfItIsNotEscaped (text, delimiter, escapeCharacter) {
const splittedText = []
let numberOfDelimitersBeforeOtherCharacter = 0
let nextSplittedTextPartIndex = 0
for (let characterIndex = 0, character = text[0]; characterIndex < text.length; characterIndex++, character = text[characterIndex]) {
if (character === escapeCharacter) {
numberOfDelimitersBeforeOtherCharacter++
} else if (character === delimiter && (!numberOfDelimitersBeforeOtherCharacter || !(numberOfDelimitersBeforeOtherCharacter % 2))) {
splittedText.push(text.substring(nextSplittedTextPartIndex, characterIndex))
nextSplittedTextPartIndex = characterIndex + 1
} else {
numberOfDelimitersBeforeOtherCharacter = 0
}
}
if (nextSplittedTextPartIndex <= text.length) {
splittedText.push(text.substring(nextSplittedTextPartIndex, text.length))
}
return splittedText
}
function onChange () {
console.log(splitByDelimiterIfItIsNotEscaped(inputBox.value, ',', '\\'))
}
addEventListener('change', onChange)
onChange()
After making a change unfocus the input box (use tab for example).
<input id="inputBox" value="test,test\,test\,test,test"/>

Categories

Resources