Am I reinventing the wheel with this token replacing code?

Am I reinventing the wheel with this token replacing code? - java

I have use case where I have a line of text containing nesting tokens (like { and }), and I wish to transform certain substrings nested at specific depths.
Example, capitalize the word moo at depth 1:
moo [moo [moo moo]] moo ->
moo [MOO [moo moo]] moo
Achieved by:
replaceTokens(input, 1, "[", "]", "moo", String::toUpperCase);
Or real world example, supply "--options" not already colored with the color sequence cyan:
#|blue --ignoreLog|# works, but --ignoreOutput silences everything. ->
#|blue --ignoreLog|# works, but #|cyan --ignoreOutput|# silences everything.
Achieved by:
replaceTokens(input, 0, "#|", "|#", "--\\w*", s -> format("#|cyan %s|#", s));
I have implemented this logic and though I feel pretty good about it (except performance probably), I also feel I reinvented the wheel. Here's how I implemented it:
set currentPos to zero
while (input line not fully consumed) {
take the remaining line
if the open token is matched, add to output, increase counter and advance pos accordingly
else if the close token is matched, add to output, decrease counter and advance pos accordingly
else if the counter matches provided depth and given regex matches, invoke replacer function and advance pos accordingly
else just record the next character and advance pos by 1
}
Here's the actual implementation:
public static String replaceNestedTokens(String lineWithTokens, int nestingDepth, String tokenOpen, String tokenClose, String tokenRegexToReplace, Function<String, String> tokenReplacer) {
final Pattern startsWithOpen = compile(quote(tokenOpen));
final Pattern startsWithClose = compile(quote(tokenClose));
final Pattern startsWithTokenToReplace = compile(format("(?<token>%s)", tokenRegexToReplace));
final StringBuilder lineWithTokensReplaced = new StringBuilder();
int countOpenTokens = 0;
int pos = 0;
while (pos < lineWithTokens.length()) {
final String remainingLine = lineWithTokens.substring(pos);
if (startsWithOpen.matcher(remainingLine).lookingAt()) {
countOpenTokens++;
lineWithTokensReplaced.append(tokenOpen);
pos += tokenOpen.length();
} else if (startsWithClose.matcher(remainingLine).lookingAt()) {
countOpenTokens--;
lineWithTokensReplaced.append(tokenClose);
pos += tokenClose.length();
} else if (countOpenTokens == nestingDepth) {
Matcher startsWithTokenMatcher = startsWithTokenToReplace.matcher(remainingLine);
if (startsWithTokenMatcher.lookingAt()) {
String matchedToken = startsWithTokenMatcher.group("token");
lineWithTokensReplaced.append(tokenReplacer.apply(matchedToken));
pos += matchedToken.length();
} else {
lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
}
} else {
lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
}
assumeTrue(countOpenTokens >= 0, "Unbalanced token sets: closed token without open token\n\t" + lineWithTokens);
}
assumeTrue(countOpenTokens == 0, "Unbalanced token sets: open token without closed token\n\t" + lineWithTokens);
return lineWithTokensReplaced.toString();
}
I couldn't make it work with a regex like this or this (or Scanner) solution, but I feel I'm reinventing the wheel and could solve this with (vanilla Java) out-of-the-box classes with less code. Also, I'm pretty sure this is a performance nightmare with all the inline patterns/matcher instances and substrings.
Suggestions?

You could be using a parser like ANTLR to create a grammar to describe your language or syntax. Then use a listener or visitor to make an interpreter of tokens.
A sample of the grammar would be like this (what I can infer from your code):
grammar Expr;
prog: (expr NEWLINE)* ;
expr: id '[' expr ']'
| '#|' expr '|#'
| '--ignoreLog' expr
| '--ignoreOutput' expr
| string
;
string: [a-zA-Z0-9];
NEWLINE : [\r\n]+ ;

Related

Split with delimiter and ignore delimiter and quote character which is part of data [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

How to split emails which are seperated by comma(,) containing special characters in sender name? [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Split a string by commas except when the comma is part of the sentence [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Split comma separated string with quotes and commas within quotes and escaped quotes within quotes

I searched even on page 3 at google for this problem, but it seems there is no proper solution.
The following string
"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"
should be splitted by comma in Java. The quotes can be double quotes or single. I tried the following regex
,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)
but because of the escaped quote at 'marc o\'polo' it fails...
Can somebody help me out?
Code for tryout:
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
System.out.println(split);
}

You can do it like this:
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);
while(m.find()) {
result.add(m.group());
}

Splitting CSV with regex is not the right solution... which is probably why you are struggling to find one with split/csv/regex search terms.
Using a dedicated library with a state machine is typically the best solution. There are a number of them:
This closed question seems relevant: https://stackoverflow.com/questions/12410538/which-is-the-best-csv-parser-in-java
I have used opencsv in the past, and I beleive the apache csv tool is good too. I am sure there are others. I am specifically not linking any library because you should o your own research on what to use.
I have been involved in a number of commercail projects where the csv parser was custom-built, but I see no reason why that should still be done.
What I can say, is that regex and CSV get very, very complicated relatively quickly (as you have discovered), and that for performance reasons alone, a 'raw' parser is better.

If you are parsing CVS (or something very similar) than using one of the stablished frameworks normally is a good idea as they cover most corner-cases and are tested by a wider audience thorough usage in different projects.
If however libraries are no option you could go with e.g. this:
public class Curios {
public static void main(String[] args) {
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
List<String> result = splitValues(checkString);
System.out.println(result);
System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
}
public static List<String> splitValues(String checkString) {
List<String> result = new ArrayList<String>();
// Used for reporting errors and detecting quotes
int startOfValue = 0;
// Used to mark the next character as being escaped
boolean charEscaped = false;
// Is the current value quoted?
boolean quoted = false;
// Quote-character in use (only valid when quoted == true)
char quote = '\0';
// All characters read from current value
final StringBuilder currentValue = new StringBuilder();
for (int i = 0; i < checkString.length(); i++) {
final char charAt = checkString.charAt(i);
if (i == startOfValue && !quoted) {
// We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
if (charAt == '\'' || charAt == '"') {
// This will be a quoted String
quote = charAt;
quoted = true;
startOfValue++;
continue;
}
}
if (!charEscaped) {
if (charAt == '\\') {
charEscaped = true;
} else if (quoted && charAt == quote) {
if (i + 1 == checkString.length()) {
// So we don't throw an exception
quoted = false;
// Last value will be added to result outside loop
break;
} else if (checkString.charAt(i + 1) == ',') {
// Ensure we don't parse , again
i++;
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
quoted = false;
} else {
throw new IllegalStateException(String.format(
"Value was quoted with %s but prematurely terminated at position %d " +
"maybe a \\ is missing before this %s or a , after? " +
"Value up to this point: \"%s\"",
quote, i, quote, checkString.substring(startOfValue, i + 1)));
}
} else if (!quoted && charAt == ',') {
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
} else {
// a boring character
currentValue.append(charAt);
}
} else {
// So we don't forget to reset for next char...
charEscaped = false;
// Here we can do interpolations
switch (charAt) {
case 'n':
currentValue.append('\n');
break;
case 'r':
currentValue.append('\r');
break;
case 't':
currentValue.append('\t');
break;
default:
currentValue.append(charAt);
}
}
}
if(charEscaped) {
throw new IllegalStateException("Input ended with a stray \\");
} else if (quoted) {
throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
}
// Add the last value to the result
result.add(currentValue.toString());
return result;
}
}
Why not simply a regular expression?
Regular expressions don't understand nesting very well. While certainly the regular expression by Casimir does a good job, differences between quoted and unquoted values are easier to model in some form of a state-machine. You see how difficult it was to ensure you don't accidentally match an ecaped or quoted ,. Also while you are allready evaluating every character it is easy to interpret escape-sequences like \n
What to watch out for?
My function was not written for white-space arround values (this can be changed)
My function will interpret the escape-sequences \n, \r, \t, \\ like most C-style language interpreters while reading \x as x (this can easily be changed)
My function accepts quotes and escapes inside unquoted values (this can easily be changed)
I did only a few tests and tried my best to exhibit a good memory-management and timing, but you will need to see if it fits your needs.

Searching for Variable Scope { } in Text

I need to identify {scope} in text, such as source code.
I'm starting with just a single line and will expand to search multiple lines, and exclude comments. I already have working code using Pattern Matcher, but I would like critiquing on how to improve such a search.
String line = "{{outside{inside}{inside2}}};";
String scopeOf = "outside";
findscope(line,scopeOf);
private static void findscope(String line,
String scopeOf) {
int layer = 1;
Pattern p = Pattern.compile(scopeOf);
Matcher m = p.matcher(line);
if (m.find()) {
int scopestart = m.start();
int scopeEnd = Integer.MIN_VALUE;
m.usePattern(Pattern.compile("\\{|\\}"));
while (m.find()) {
String group = m.group();
if (group.equals("{")) {
layer++;
} else if (group.equals("}")) {
layer--;
}
if (layer == 0) {
scopeEnd = m.start();
break;
}
}
System.out.println("Scope of " + scopeOf + " starts at " + scopestart +
" finishes at " + scopeEnd);
}
}

Well, you are using the wrong tool for the job (assuming you are also looking for nested scopes)
Note that regex (in the traditional form of regex) stands for Regular Expression - which is a way to describe a Regular Language.
However, the language L = { all words with legal scopings } is irregular - and thus cannot be identified by regex.
This langauge is actually Conext Free Langauge, and can be represented by a Context Free Grammer.
For parsing:
For relatively simple langauges (scoping is among them) - a deterministic push-down automaton is enough to verify them.
Some languages require non deterministic push down automaton - which is not very efficiently created, but there is a dynamic programming algorithm to parse them as well.
As a side note, there are some tools such as JavaCC that you can use to parse (and generate code/output) - have a look on them, but if you are simply looking for the scoping issue - it is probably an overkill.
Edit - pseudo code:
curr <- 0
count <- 0 //integer imitates the stack for this simple usage
l <- string.length()
while (curr < l):
if string.charAt(curr) == '{':
count++;
else if string.charAt(curr) == '}':
if curr <= 0:
return ERROR;
count--;
curr++;
if count != 0:
return ERROR;
return SUCCESS;
Note that in here we can use an integer to imitate the stack, in here an increase is basically a push() and a decrease is a pop().

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Am I reinventing the wheel with this token replacing code? - java

Related

Split with delimiter and ignore delimiter and quote character which is part of data [duplicate]

How to split emails which are seperated by comma(,) containing special characters in sender name? [duplicate]

Split a string by commas except when the comma is part of the sentence [duplicate]

Split comma separated string with quotes and commas within quotes and escaped quotes within quotes

Searching for Variable Scope { } in Text

Categories

Resources