String.replaceAll for multiple characters

String.replaceAll for multiple characters - java

I have a line with ^||^ as my delimiter, I am using
int charCount = line.replaceAll("[^" + fileSeperator + "]", "").length();
if(fileSeperator.length()>1)
{
charCount=charCount/fileSeperator.length();
System.out.println(charCount+"char count between");
}
This does not work if i have a line that has stray | or ^ as it counts these as well. How can i modify the regex or any other suggestions?

If I understand correctly, what you're really trying to do is count the number of times that ^||^ appears in your String.
If that's the case, you can use:
Matcher m = Pattern.compile(Pattern.quote("^||^")).matcher(line);
int count = 0;
while ( m.find() )
count++;
System.out.println(count + "char count between");
But you really don't need the regex engine for this.
int startIndex = 0;
int count = 0;
while ( true ) {
int newIndex = line.indexOf(fileDelimiter, startIndex);
if ( newIndex == -1 ) {
break;
} else {
startIndex = newIndex + 1;
count++;
}
}

Certain characters have special meanings in a regular expression, such as ^ and |. These must be escaped with a backslash in order for them to be treated as normal characters and not as special characters. For example, the following regular expression matches all caret (^) and pipe (|) characters (note the backslashes): [\^\|]
The Pattern.quote() method can be used to escape all of the special characters in a given String.
String quoted = Pattern.quote("^||^"); //returns "\^\|\|\^";
Also note that a character class only matches one character. Thus, the regex [^\^\|\|\^] will match all characters except ^ and |, not all characters except the sequence ^||^. If your intention is to count the number of delimiters (^||^) in a String, then a better approach might be to use the String.indexOf(String, int) method.

Mark Peters's answer seems better. I edited so my answer won't cause any confusion.

You should replace it like this with proper escaping since your delimiter has all special character of regex:
line.replaceAll("\\^\\|\\|\\^", "");
OR else don't use regex at all and call replace method like this:
line.replace("^||^", "");

Lazy solutions.
Depending on the end goal (the println statement is a little confusing):
int numberOfDelimiters = (line.length() - line.replace(fileSeparator,"").length())
/ fileSeparator.length();
int numberOfNonDelimiterChars = line.replace(fileSeparator,"").length();

Related

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"

Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result

This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.

You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):

So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}

Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}

To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.

The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

Regular expression for phrase contain literals and numbers but is not all phrase as a number only with fixed range length

i want to have regular expression to check input character as a-z and 0-9 but i do not want to allow input as just numeric value at all ( must be have at least one alphabetic character)
for example :
413123123123131
not allowed but if have just only one alphabetic character in any place of phrase it's ok
i trying to define correct Regex for that and at final i raised to
[0-9]*[a-z].*
but in now i confused how to defined {x,y} length of phrase i want to have {9,31} but after last * i can not to have length block too i trying to define group but unlucky and not worked
tested at https://www.debuggex.com/
how can i to add it ??

What you seek is
String regex = "(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*";
Use it with String#matches() / Pattern#matches() method to require a full string match:
if (s.matches(regex)) {
return true;
}
Details
^ - implicit in matches() - matches the start of string
(?=.{9,31}$) - a positive lookahead that requires 9 to 31 any chars other than line break chars from the start to end of the string
\\p{Alnum}* - 0 or more alphanumeric chars
\\p{Alpha} - an ASCII letter
\\p{Alnum}* - 0 or more alphanumeric chars
Java demo:
String lines[] = {"413123123123131", "4131231231231a"};
Pattern p = Pattern.compile("(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*");
for(String line : lines)
{
Matcher m = p.matcher(line);
if(m.matches()) {
System.out.println(line + ": MATCH");
} else {
System.out.println(line + ": NO MATCH");
}
}
Output:
413123123123131: NO MATCH
4131231231231a: MATCH

This might be what you are looking for.
[0-9a-zA-Z]*[a-zA-Z][0-9a-zA-Z]*
To help explain it, think of the middle term as your one required character and the outer terms as any number of alpha numeric characters.
Edit: to restrict the length of the string as a whole you may have to check that manually after matching. ie.
if (str.length > 9 && str.length < 31)
Wiktor does provide a solution that involves more regex, please look at his for a better regex pattern

Try this Regex:
^(?:(?=[a-z])[a-z0-9]{9,31}|(?=\d.*[a-z])[a-z0-9]{9,31})$
OR a bit shorter form:
^(?:(?=[a-z])|(?=\d.*[a-z]))[a-z0-9]{9,31}$
Demo
Explanation(for the 1st regex):
^ - position before the start of the string
(?=[a-z])[a-z0-9]{9,31} means If the string starts with a letter, then match Letters and digits. minimum 9 and maximum 31
| - OR
(?=\d.*[a-z])[a-z0-9]{9,31} means If the string starts with a digit followed by a letter somewhere in the string, then match letters and digits. Minimum 9 and Maximum 31. This also ensures that If the string starts with a digit and if there is no letter anywhere in the string, there won't be any match
$ - position after the last literal of the string
OUTPUT:
413123123123131 NO MATCH(no alphabets)
kjkhsjkf989089054835werewrew65 MATCH
kdfgfd4374985794379857984379857weorjijuiower NO MATCH(length more than 31)
9087erkjfg9080980984590p465467 MATCH
4131231231231a MATCH
kjdfg34 NO MATCH(Length less than 9)

Here's the regex:
[a-zA-Z\d]*[a-zA-Z][a-zA-Z\d]*
The trick here is to have something that is not optional. The leading and trailing [a-zA-Z\d] has a * quantifier, so they are optional. But the [a-zA-Z] in the middle there is not optional. The string must have a character that matches [a-zA-Z] in order to be matched.
However, you need to check the length of the string with length afterwards and not with regex. I can't think of any way how you can do this in regex.
Actually, I think you can do this regexless pretty easily:
private static boolean matches(String input) {
for (int i = 0 ; i < input.length() ; i++) {
if (Character.isLetter(input.charAt(i))) {
return input.length() >= 9 && input.length() <= 31;
}
}
return false;
}

indexOf() vs regex for identifying special characters like $ and {

I would like to check if a special character like { or $is present in a string or not. I used regexp but during code review I was asked to use indexOf() instead regex( as its costlier). I would like to understand how indexOf() is used to identify special characters. (I familiar that this can be done to index substring)
String photoRoot = "http://someurl/${TOKEN1}/${TOKEN2}";
Pattern p = Pattern.compile("\\$\\{(.*?)\\}");
Matcher m = p.matcher(photoRoot);
if (m.find()) {
// logic to be performed
}

There are more then one indexOf(...) methods but all of them treat all characters the same, there is no need to escape any characters while using these methods.
Here is how you can get the two tokens by using some of the indexOf(...) methods:
String photoRoot = "http://someurl/${TOKEN1}/${TOKEN2}";
String startDelimiter = "${";
char endDelimiter = '}';
int start = -1, end = -1;
while (true) {
start = photoRoot.indexOf(startDelimiter, end);
end = photoRoot.indexOf(endDelimiter, start + startDelimiter.length());
if (start != -1 && end != -1) {
System.out.println(photoRoot.substring(start + startDelimiter.length(), end));
} else {
break;
}
}

If you're only looking to find a couple of different special characters you'd just use indexOf("$") or indexOf("}"). You will need to specify each special character you want to find separately.
There is no way though to have it find the index of every special character in one statement: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(int)

If you just need to check for 2 characters as in your question, the answer will be
var found = photoRoot.indexOf("$") >=0 ||| photoRoot.indexOf("?") >=0;

It's always difficult to guess while there is contradicting information. The code does not look for special characters, it searches for a pattern - and indexOf will not help you there.
Titus' answer is good for avoiding pattern matching if you need to find the pattern ${...} (as opposed to "identifying special characters")
If (as the reviewer appears to think) you just need to look for any of a set of special characters you can apply indexOf( on_special_char ) repeatedly, but you can also do
for( int i = 0; i < photoRoot.length(); ++i ){
if( "${}".indexOf( photoRoot.charAt(i) ) >= 0 ){
// one of the special characters is at pos i
}
}
Not sure where the performance "break even" between multiple indexOf calls on the target string and the (above) iteration on the target with indexOf on the (short) string containing the specials is. But it may be easier to maintain and permits dynamic adaption to the set of specials.
Of course, the simple
photoRoot.matches( ".*" + Pattern.quote( specials ) + ".*" );
is also dynamically adaptable.

codingbat wordEnds using regex

I'm trying to solve wordEnds from codingbat.com using regex.
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"
This is the simplest as I can make it with my current knowledge of regex:
public String wordEnds(String str, String word) {
return str.replaceAll(
".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
.replace("word", java.util.regex.Pattern.quote(word)),
"$1$2"
);
}
replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.
The regex has two major parts:
If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
The initial "if" test ensures that the atomic lookbehind captures only if there's a word
Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
Groups 1 and 2 would capture empty strings
I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.
Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.
I managed to reduce the occurrence of word within the pattern to just one.
".+?(?<=(^|.)word)(?=(.?))|.+"
I'm still looking if it's possible to simplify this further, but I also have another question:
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?

Based on your solution I managed to simplify the code a little bit:
public String wordEnds(String str, String word) {
return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}
Another way of writing it would be:
public String wordEnds(String str, String word) {
return str.replaceAll(
String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
"$1$2");
}

With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
In Oracle's implementation, the behavior of look-behind is as follow:
By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.
Effectively, it will try to verify the look-behind on the shortest string first.
The implementation multiplies the matching complexity by O(k) factor.
This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.
You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.
In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.
My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.
I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.

I am working in .NET's regex but I was able to change your pattern to:
.+?(?<=(\w?)word)(?=(\w?))|.+
with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.
It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)

Another solution to look at...
public String wordEnds(String str, String word) {
if(str.equals(word)) return "";
int i = 0;
String result = "";
int stringLen = str.length();
int wordLen = word.length();
int diffLen = stringLen - wordLen;
while(i<=diffLen){
if(i==0 && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i+wordLen);
}else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1);
}else if(str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
}
i++;
}
if(result.length()==1) result = result + result;
return result;
}

Another possible solution:
public String wordEnds(String str, String word) {
String result = "";
if (str.contains(word)) {
for (int i = 0; i < str.length(); i++) {
if (str.startsWith(word, i)) {
if (i > 0) {
result += str.charAt(i - 1);
}
if ((i + word.length()) < str.length()) {
result += str.charAt(i + word.length());
}
}
}
}
return result;
}

How to split a comma separated String while ignoring escaped commas?

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.
so calling my:
commaDelimitedListToStringArray("test,test\\,test\\,test,test", "\\")
should return:
["test", "test,test,test", "test"]
My current attempt is to use String.split() to split the String using regular expressions:
String[] array = str.split("[^\\\\],");
But the returned array is:
["tes", "test\,test\,tes", "test"]
Any ideas?

The regular expression
[^\\],
means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.
I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like
(?<!\\),
(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)

Try:
String array[] = str.split("(?<!\\\\),");
Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

For future reference, here is the complete method i ended up with:
public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
// these characters need to be escaped in a regular expression
String regularExpressionSpecialChars = "/.*+?|()[]{}\\";
String escapedEscapeChar = escapeChar;
// if the escape char for our comma separated list needs to be escaped
// for the regular expression, escape it using the \ char
if(regularExpressionSpecialChars.indexOf(escapeChar) != -1)
escapedEscapeChar = "\\" + escapeChar;
// see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);
// remove the escapeChar for the end result
String[] result = new String[temp.length];
for(int i=0; i<temp.length; i++) {
result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
}
return result;
}

As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,tes" , "test"]
As drvdijk said, (?<!\\), will misinterpret escaped backslashes.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,test" , "test"]
-(unescape commas)->
["test\\\\,test\\,test,test" , "test"]
I would expect being able to escape backslashes as well...
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\" , "test\\,test" , "test"]
-(unescape commas and backslashes)->
["test\\,test\\" , "test,test" , "test"]
drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?
I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).
My answer does not take the escape character as a parameter.
public static List<String> commaDelimitedListStringToStringList(String list) {
// Check the validity of the list
// ex: "te\\st" is not valid, backslash should be escaped
if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
// Could also raise an exception
return null;
}
// Matcher for the list elements
Matcher matcher = Pattern
.compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
// Unescape the list element
result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
}
return result;
}
Description for the pattern (unescaped):
(?<=(^|,)) forward is start of string or a ,
([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,
(?=(,|$)) behind is end of string or a ,
The pattern may be simplified.
Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.
Also, what is the need of having an escape character if only one character is special, it could simply be doubled...
public static List<String> commaDelimitedListStringToStringList2(String list) {
if (!list.matches("^(([^,]|,,)*(,|$))+")) {
return null;
}
Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
result.add(matcher.group().replaceAll(",,", ","));
}
return result;
}

split(/(?<!\\),/g) worked for me, but the accepted answer did not
> var x = "test,test\,test\,test,test"
undefined
> x.split(/(?<!\\),/g)
[ 'test', 'test\\,test\\,test', 'test' ]
> x.split("(?<!\\\\),")
[ 'test,test\\,test\\,test,test' ]

It's probably not "super fancy" solution, but possibly more time-efficient one. Escaping an escape character is also supported and it's working in browsers not supporting 'lookbehinds'.
function splitByDelimiterIfItIsNotEscaped (text, delimiter, escapeCharacter) {
const splittedText = []
let numberOfDelimitersBeforeOtherCharacter = 0
let nextSplittedTextPartIndex = 0
for (let characterIndex = 0, character = text[0]; characterIndex < text.length; characterIndex++, character = text[characterIndex]) {
if (character === escapeCharacter) {
numberOfDelimitersBeforeOtherCharacter++
} else if (character === delimiter && (!numberOfDelimitersBeforeOtherCharacter || !(numberOfDelimitersBeforeOtherCharacter % 2))) {
splittedText.push(text.substring(nextSplittedTextPartIndex, characterIndex))
nextSplittedTextPartIndex = characterIndex + 1
} else {
numberOfDelimitersBeforeOtherCharacter = 0
}
}
if (nextSplittedTextPartIndex <= text.length) {
splittedText.push(text.substring(nextSplittedTextPartIndex, text.length))
}
return splittedText
}
function onChange () {
console.log(splitByDelimiterIfItIsNotEscaped(inputBox.value, ',', '\\'))
}
addEventListener('change', onChange)
onChange()
After making a change unfocus the input box (use tab for example).
<input id="inputBox" value="test,test\,test\,test,test"/>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String.replaceAll for multiple characters - java

Mark Peters's answer seems better. I edited so my answer won't cause any confusion.

You should replace it like this with proper escaping since your delimiter has all special character of regex: line.replaceAll("\\^\\|\\|\\^", ""); OR else don't use regex at all and call replace method like this: line.replace("^||^", "");

Lazy solutions. Depending on the end goal (the println statement is a little confusing): int numberOfDelimiters = (line.length() - line.replace(fileSeparator,"").length()) / fileSeparator.length(); int numberOfNonDelimiterChars = line.replace(fileSeparator,"").length();

Related

Java regex: Replace all characters with `+` except instances of a given string

Regular expression for phrase contain literals and numbers but is not all phrase as a number only with fixed range length

indexOf() vs regex for identifying special characters like $ and {

codingbat wordEnds using regex

How to split a comma separated String while ignoring escaped commas?

Categories

Resources