How to split a comma separated String while ignoring escaped commas?

How to split a comma separated String while ignoring escaped commas? - java

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.
so calling my:
commaDelimitedListToStringArray("test,test\\,test\\,test,test", "\\")
should return:
["test", "test,test,test", "test"]
My current attempt is to use String.split() to split the String using regular expressions:
String[] array = str.split("[^\\\\],");
But the returned array is:
["tes", "test\,test\,tes", "test"]
Any ideas?

The regular expression
[^\\],
means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.
I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like
(?<!\\),
(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)

Try:
String array[] = str.split("(?<!\\\\),");
Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

For future reference, here is the complete method i ended up with:
public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
// these characters need to be escaped in a regular expression
String regularExpressionSpecialChars = "/.*+?|()[]{}\\";
String escapedEscapeChar = escapeChar;
// if the escape char for our comma separated list needs to be escaped
// for the regular expression, escape it using the \ char
if(regularExpressionSpecialChars.indexOf(escapeChar) != -1)
escapedEscapeChar = "\\" + escapeChar;
// see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);
// remove the escapeChar for the end result
String[] result = new String[temp.length];
for(int i=0; i<temp.length; i++) {
result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
}
return result;
}

As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,tes" , "test"]
As drvdijk said, (?<!\\), will misinterpret escaped backslashes.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,test" , "test"]
-(unescape commas)->
["test\\\\,test\\,test,test" , "test"]
I would expect being able to escape backslashes as well...
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\" , "test\\,test" , "test"]
-(unescape commas and backslashes)->
["test\\,test\\" , "test,test" , "test"]
drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?
I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).
My answer does not take the escape character as a parameter.
public static List<String> commaDelimitedListStringToStringList(String list) {
// Check the validity of the list
// ex: "te\\st" is not valid, backslash should be escaped
if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
// Could also raise an exception
return null;
}
// Matcher for the list elements
Matcher matcher = Pattern
.compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
// Unescape the list element
result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
}
return result;
}
Description for the pattern (unescaped):
(?<=(^|,)) forward is start of string or a ,
([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,
(?=(,|$)) behind is end of string or a ,
The pattern may be simplified.
Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.
Also, what is the need of having an escape character if only one character is special, it could simply be doubled...
public static List<String> commaDelimitedListStringToStringList2(String list) {
if (!list.matches("^(([^,]|,,)*(,|$))+")) {
return null;
}
Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
result.add(matcher.group().replaceAll(",,", ","));
}
return result;
}

split(/(?<!\\),/g) worked for me, but the accepted answer did not
> var x = "test,test\,test\,test,test"
undefined
> x.split(/(?<!\\),/g)
[ 'test', 'test\\,test\\,test', 'test' ]
> x.split("(?<!\\\\),")
[ 'test,test\\,test\\,test,test' ]

It's probably not "super fancy" solution, but possibly more time-efficient one. Escaping an escape character is also supported and it's working in browsers not supporting 'lookbehinds'.
function splitByDelimiterIfItIsNotEscaped (text, delimiter, escapeCharacter) {
const splittedText = []
let numberOfDelimitersBeforeOtherCharacter = 0
let nextSplittedTextPartIndex = 0
for (let characterIndex = 0, character = text[0]; characterIndex < text.length; characterIndex++, character = text[characterIndex]) {
if (character === escapeCharacter) {
numberOfDelimitersBeforeOtherCharacter++
} else if (character === delimiter && (!numberOfDelimitersBeforeOtherCharacter || !(numberOfDelimitersBeforeOtherCharacter % 2))) {
splittedText.push(text.substring(nextSplittedTextPartIndex, characterIndex))
nextSplittedTextPartIndex = characterIndex + 1
} else {
numberOfDelimitersBeforeOtherCharacter = 0
}
}
if (nextSplittedTextPartIndex <= text.length) {
splittedText.push(text.substring(nextSplittedTextPartIndex, text.length))
}
return splittedText
}
function onChange () {
console.log(splitByDelimiterIfItIsNotEscaped(inputBox.value, ',', '\\'))
}
addEventListener('change', onChange)
onChange()
After making a change unfocus the input box (use tab for example).
<input id="inputBox" value="test,test\,test\,test,test"/>

Related

How to split a string and save the 2 characters that I split with?

I am trying to split a given string using the java split method while the string should be devided by two different characters (+ and -) and I am willing to save the characters inside the array aswell in the same index the string has been saven.
for example :
input : String s = "4x^2+3x-2"
output :
arr[0] = 4x^2
arr[1] = +3x
arr[2] = -2
I know how to get the + or - characters in a different index between the numbers but it is not helping me,
any suggestions please?

You can face this problem in many ways. I´m sure there are clever and fancy ways to split this expression. I will show you the simplest problem-solving process that can help you.
State the problem you need to solve, the input and output
Problem: Split a math expression into subexpressions at + and - signals
Input: 4x^2+3x-2
Output: 4x^2,+3x,-2
Create a pseudo code with some logic you might think works
Given an expression string
Create an empty list of expressions
Create a subExpression string
For each character in the expression
Check if the character is + ou - then
add the subExpression in the list and create a new empty subexpression
otherwise, append the character in the subExpression
In the end, add the left subexpression in the list
Implement the pseudo-code in the programming language of your choice
String expression = "4x^2+3x-2";
List<String> expressions = new ArrayList();
StringBuilder subExpression = new StringBuilder();
for (int i = 0; i < expression.length(); i++) {
char character = expression.charAt(i);
if (character == '-' || character == '+') {
expressions.add(subExpression.toString());
subExpression = new StringBuilder(String.valueOf(character));
} else {
subExpression.append(String.valueOf(character));
}
}
expressions.add(subExpression.toString());
System.out.println(expressions);
Output
[4x^2, +3x, -2]
You will end with one algorithm that works for your problem. You can start to improve it.

Try this code:
String s = "4x^2+3x-2";
s = s.replace("+", "#+");
s = s.replace("-", "#-");
String[] ss = s.split("#");
for (int i = 0; i < ss.length; i++) {
Log.e("XOP",ss[i]);
}
This code replaces + and - with #+ and #- respectively and then splits the string with #. That way the + and - operators are not lost in the result.
If you require # as input character then you can use any other Unicode character instead of #.

Try this one:
String s = "4x^2+3x-2";
String[] arr = s.split("[\\+-]");
for(int i=0;i<arr.length;i++){
System.out.println(arr[i]);
}

Personally I like it better to have positive matches of patterns, especially if the split pattern itself is empty.
So for instance you could use a Pattern and Matcher like this:
Pattern p = Pattern.compile("(^|[+-])([^+-]*)");
Matcher m = p.matcher("4x^2+3x-2");
while (m.find()) {
System.out.printf("%s or %s %s%n", m.group(), m.group(1), m.group(2));
}
This matches the start of the string or a plus or minus: ^|[+-], followed by any amount of characters that are not a plus or minus: [^+-]*.
Do note that the ^ first matches the start of the string, and is then used to negate a character class when used between brackets. Regular expressions are tricky like that.
Bonus: you can also use the two groups (within the parenthesis in the pattern) to match the operators - if any.
All this is presuming that you want to use/test regular expressions; generally things like this require a parser rather than a regular expression.
A one-liner for persons thinking that this is too complex:
var expressions = Pattern.compile("^|[+-][^+-]*")
.matcher("4x^2+3x-2")
.results()
.map(r -> r.group())
.collect(Collectors.toList());

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"

Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result

This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.

You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):

So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}

Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}

To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.

The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

Splitting a nested string keeping quotation marks

I am working on a project in Java that requires having nested strings.
For an input string that in plain text looks like this:
This is "a string" and this is "a \"nested\" string"
The result must be the following:
[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"
Note that I want the \" sequences to be kept.
I have the following method:
public static String[] splitKeepingQuotationMarks(String s);
and I need to create an array of strings out of the given s parameter by the given rules, without using the Java Collection Framework or its derivatives.
I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?
UPDATE based on questions from comments:
each unescaped " has its closing unescaped " (they are balanced)
each escaping character \ also must be escaped if we want to create literal representing it (to create text representing \ we need to write it as \\).

You can use the following regex:
"[^"\\]*(?:\\.[^"\\]*)*"|\S+
See the regex demo
Java demo:
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Explanation:
"[^"\\]*(?:\\.[^"\\]*)*" - a double quote that is followed with any 0+ characters other than a " and \ ([^"\\]) followed with 0+ sequences of any escaped sequence (\\.) followed with any 0+ characters other than a " and \
| - or...
\S+ - 1 or more non-whitespace characters
NOTE
#Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+" (or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+" would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.
UPDATE
Since String[] type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:
int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
result[idx] = matcher.group(0);
idx++;
}
System.out.println(Arrays.toString(result));
See another IDEONE demo

Another regex approach that works uses a negative lookbehind: "words" (\w+) OR "quote followed by anything up to the next quote that ISN'T preceded by a backslash", and set your match to "global" (don't return on first match)
(\w+|".*?(?<!\\)")
see it here.

An alternative method that does not use a regex:
import java.util.ArrayList;
import java.util.Arrays;
public class SplitKeepingQuotationMarks {
public static void main(String[] args) {
String pattern = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
System.out.println(Arrays.toString(splitKeepingQuotationMarks(pattern)));
}
public static String[] splitKeepingQuotationMarks(String s) {
ArrayList<String> results = new ArrayList<>();
StringBuilder last = new StringBuilder();
boolean inString = false;
boolean wasBackSlash = false;
for (char c : s.toCharArray()) {
if (Character.isSpaceChar(c) && !inString) {
if (last.length() > 0) {
results.add(last.toString());
last.setLength(0); // Clears the s.b.
}
} else if (c == '"') {
last.append(c);
if (!wasBackSlash)
inString = !inString;
} else if (c == '\\') {
wasBackSlash = true;
last.append(c);
} else
last.append(c);
}
results.add(last.toString());
return results.toArray(new String[results.size()]);
}
}
Output:
[This, is, "a string", and, this, is, "a \"nested\" string"]

Splitting on comma outside quotes

My program reads a line from a file. This line contains comma-separated text like:
123,test,444,"don't split, this",more test,1
I would like the result of a split to be this:
123
test
444
"don't split, this"
more test
1
If I use the String.split(","), I would get this:
123
test
444
"don't split
this"
more test
1
In other words: The comma in the substring "don't split, this" is not a separator. How to deal with this?

You can try out this regex:
str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.
Explanation:
, // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)
You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:
String[] arr = str.split("(?x) " +
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);

Why Split when you can Match?
Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:
"[^"]*"|[^,]+
This will match all the desired fragments (see demo).
Explanation
With "[^"]*", we match complete "double-quoted strings"
or |
we match [^,]+ any characters that are not a comma.
A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.

Building upon #zx81's answer, cause matching idea is really nice, I've added Java 9 results call, which returns a Stream. Since OP wanted to use split, I've collected to String[], as split does.
Caution if you have spaces after your comma-separators (a, b, "c,d"). Then you need to change the pattern.
Jshell demo
$ jshell
-> String so = "123,test,444,\"don't split, this\",more test,1";
| Added variable so of type String with initial value "123,test,444,"don't split, this",more test,1"
-> Pattern.compile("\"[^\"]*\"|[^,]+").matcher(so).results();
| Expression value is: java.util.stream.ReferencePipeline$Head#2038ae61
| assigned to temporary variable $68 of type java.util.stream.Stream<MatchResult>
-> $68.map(MatchResult::group).toArray(String[]::new);
| Expression value is: [Ljava.lang.String;#6b09bb57
| assigned to temporary variable $69 of type String[]
-> Arrays.stream($69).forEach(System.out::println);
123
test
444
"don't split, this"
more test
1
Code
String so = "123,test,444,\"don't split, this\",more test,1";
Pattern.compile("\"[^\"]*\"|[^,]+")
.matcher(so)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
Explanation
Regex [^"] matches: a quote, anything but a quote, a quote.
Regex [^"]* matches: a quote, anything but a quote 0 (or more) times , a quote.
That regex needs to go first to "win", otherwise matching anything but a comma 1 or more times - that is: [^,]+ - would "win".
results() requires Java 9 or higher.
It returns Stream<MatchResult>, which I map using group() call and collect to array of Strings. Parameterless toArray() call would return Object[].

You can do this very easily without complex regular expression:
Split on the character ". You get a list of Strings
Process each string in the list: Split every string that is on an even position in the List (starting indexing with zero) on "," (you get a list inside a list), leave every odd positioned string alone (directly putting it in a list inside the list).
Join the list of lists, so you get only a list.
If you want to handle quoting of '"', you have to adapt the algorithm a little bit (joining some parts, you have incorrectly split of, or changing splitting to simple regexp), but the basic structure stays.
So basically it is something like this:
public class SplitTest {
public static void main(String[] args) {
final String splitMe="123,test,444,\"don't split, this\",more test,1";
final String[] splitByQuote=splitMe.split("\"");
final String[][] splitByComma=new String[splitByQuote.length][];
for(int i=0;i<splitByQuote.length;i++) {
String part=splitByQuote[i];
if (i % 2 == 0){
splitByComma[i]=part.split(",");
}else{
splitByComma[i]=new String[1];
splitByComma[i][0]=part;
}
}
for (String parts[] : splitByComma) {
for (String part : parts) {
System.out.println(part);
}
}
}
}
This will be much cleaner with lambdas, promised!

Please see the below code snippet. This code only considers happy flow. Change the according to your requirement
public static String[] splitWithEscape(final String str, char split,
char escapeCharacter) {
final List<String> list = new LinkedList<String>();
char[] cArr = str.toCharArray();
boolean isEscape = false;
StringBuilder sb = new StringBuilder();
for (char c : cArr) {
if (isEscape && c != escapeCharacter) {
sb.append(c);
} else if (c != split && c != escapeCharacter) {
sb.append(c);
} else if (c == escapeCharacter) {
if (!isEscape) {
isEscape = true;
if (sb.length() > 0) {
list.add(sb.toString());
sb = new StringBuilder();
}
} else {
isEscape = false;
}
} else if (c == split) {
list.add(sb.toString());
sb = new StringBuilder();
}
}
if (sb.length() > 0) {
list.add(sb.toString());
}
String[] strArr = new String[list.size()];
return list.toArray(strArr);
}

Problems with building this regex [1,2,3]

i have a problem to build following regex:
[1,2,3,4]
i found a work-around, but i think its ugly
String stringIds = "[1,2,3,4]";
stringIds = stringIds.replaceAll("\\[", "");
stringIds = stringIds.replaceAll("\\]", "");
String[] ids = stringIds.split("\\,");
Can someone help me please to build one regex, which i can use in the split function
Thanks for help
edit:
i want to get from this string "[1,2,3,4]" to an array with 4 entries. the entries are the 4 numbers in the string, so i need to eliminate "[","]" and ",". the "," isn't the problem.
the first and last number contains [ or ]. so i needed the fix with replaceAll. But i think if i use in split a regex for ",", i also can pass a regex which eliminates "[" "]" too. But i cant figure out, who this regex should look like.

This is almost what you're looking for:
String q = "[1,2,3,4]";
String[] x = q.split("\\[|\\]|,");
The problem is that it produces an extra element at the beginning of the array due to the leading open bracket. You may not be able to do what you want with a single regex sans shenanigans. If you know the string always begins with an open bracket, you can remove it first.
The regex itself means "(split on) any open bracket, OR any closed bracket, OR any comma."
Punctuation characters frequently have additional meanings in regular expressions. The double leading backslashes... ugh, the first backslash tells the Java String parser that the next backslash is not a special character (example: \n is a newline...) so \\ means "I want an honest to God backslash". The next backslash tells the regexp engine that the next character ([ for example) is not a special regexp character. That makes me lol.

Maybe substring [ and ] from beginning and end, then split the rest by ,
String stringIds = "[1,2,3,4]";
String[] ids = stringIds.substring(1,stringIds.length()-1).split(",");

Looks to me like you're trying to make an array (not sure where you got 'regex' from; that means something different). In this case, you want:
String[] ids = {"1","2","3","4"};
If it's specifically an array of integer numbers you want, then instead use:
int[] ids = {1,2,3,4};

Your problem is not amenable to splitting by delimiter. It is much safer and more general to split by matching the integers themselves:
static String[] nums(String in) {
final Matcher m = Pattern.compile("\\d+").matcher(in);
final List<String> l = new ArrayList<String>();
while (m.find()) l.add(m.group());
return l.toArray(new String[l.size()]);
}
public static void main(String args[]) {
System.out.println(Arrays.toString(nums("[1, 2, 3, 4]")));
}

If the first line your code is following:
String stringIds = "[1,2,3,4]";
and you're trying to iterate over all number items, then the follwing code-frag only could work:
try {
Pattern regex = Pattern.compile("\\b(\\d+)\\b", Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
for (int i = 1; i <= regexMatcher.groupCount(); i++) {
// matched text: regexMatcher.group(i)
// match start: regexMatcher.start(i)
// match end: regexMatcher.end(i)
}
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split a comma separated String while ignoring escaped commas? - java

Try: String array[] = str.split("(?<!\\\\),"); Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

split(/(?<!\\),/g) worked for me, but the accepted answer did not > var x = "test,test\,test\,test,test" undefined > x.split(/(?<!\\),/g) [ 'test', 'test\\,test\\,test', 'test' ] > x.split("(?<!\\\\),") [ 'test,test\\,test\\,test,test' ]

Related

How to split a string and save the 2 characters that I split with?

Java regex: Replace all characters with `+` except instances of a given string

Splitting a nested string keeping quotation marks

Splitting on comma outside quotes

Problems with building this regex [1,2,3]

Categories

Resources