How to split a string in java using (,) with certain conditions [duplicate] - java

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.

This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.

The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO

While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here

Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.

MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}

#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

Related

Parsing BibTeX record with Java RegEx

I have to write simple BibTeX parser using Java regular expressions. Task is a bit simplified: every tag value is between quotation marks "", not brackets {}. The thing is, {} can be inside "".
I'm trying to cut single records from entire String file, e. g. I want to get #book{...} as String. The problem is that there can be no comma after last tag, so it can end like: author = "john"}.
I've tried #\w*\{[\s\S]*?\}, but it stops if I have } in any tag value between "". There is also no guarantee that } will be in separate line, it can be directly after last tag value (which may not end with " either, since it can be an integer).
Can you help me with this?
You could try the following expression as a basis: #\w+\{(?>\s*\w+\s*=\s*"[^"]*")*\}
Exlanation:
#\w+\{...\} would be the record, e.g. #book{...}
(?>...)* means a non-capturing group that can occur multiple times or not at all - this is meant to represent the tags
\s*\w+\s*=\s*"[^"]*" would mean a tag which could be preceded by whitespace (\s*). The tag's value has to be in double quotes and anything between double quotes will be consumed, even curly braces.
Note that there might be some more cases to take into account but this should be able to handle curly braces in tag values because it will "consume" every content between the double quotes, thus it wouldn't match if the closing curly brace were missing (e.g. it would match #book{ title="the use of { and }" author="John {curly} Johnson"} but not #book{ title="the use of { and }" author="John {curly} Johnson").
I've found a hack, it may help someone with same problem: there must be new line character after } sign. If end of value is only " (} sign doesn't end any value), then [\r\n] at the end of regex will suffice.

Regex to find missing double quote in csv

We are processing csv files which contain lines with non-closed double quoted entries. These blow up the csv parser, so I am trying to put together a regex which will identify these lines so we can delete them from the files before trying to process them.
In the following example, the csv parser gets to line 2 and includes everything up to the first double quote in line 3 before trying to close out the token and then blows up because there are non-whitespace characters after the "closing" double quote before the next comma.
Example Line 1,some data,"good line",processes fine,happy
Example Line 2,some data,"bad line,processes poorly,unhappy
Example Line 3,some data,"good line",dies before here,unhappy
I am trying to do something like:
.*,"[^(",)]*[\r\n]
The idea is finding a single line with anything followed by ," without an instance of ", which follows before the line ends.
The negation of the sequence is not working though. How is something like this done?
NOTE:
Since people keep suggesting essentially checking for an even number of double quotes, it's worth noting that a single double-quoted csv entry could contain a standalone double quote (e.g. ...,"Measurement: 1' 2"",...).
You can use:
int count = str.length() - str.replaceAll("\\"","").length();
if (count % 2 == 0) {
// do what you want
}
With your current requirements (including your concern about "Measurement: 1' 2"", this will select the bad lines:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"[^",]*")?)+(?:$|,.*)
The ^ anchors at the top of the string
The .*(?:^|,) eats up any characters up to the top of the string or a comma
We match a "...
and, once or more times, [^",]*(?:"[^",]*")? matches characters that are neither a " or a comma, and, optionally, a balanced set of quotes: "[^",]*"
We either match the end of the string, or a comma and anything that follows
A note about escaped double quotes
You may have, in your input, double-quoted strings that contain an escaped double quote, like this: "abc\"de" If so, we need to replace our expression for double-quoted strings (?:"[^",]*") with something more solid: (?:"(?:\\"|[^"])*")
Hence the whole regex would become:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"(?:\\"|[^"])*")?)+(?:$|,.*)
Something like this should work:
^[^"]*("[^"]*"[^"]*)*[^"]*$
The [^"]* that you see repeated all over the place means "any number of non-quote characters".
The ("[^"]*"[^"]*)* will match paired quotes while the [^"]*s will match the unquoted text before and after the final quotes.
The ^ and $ anchors ensure that we're matching the whole line, not just a portion of it.
Essentially: if there's an even number of quotes it will match. If there is an odd number of quotes, it will fail.
Here's an example of the regex in action.
If whatever solution you're working in has the option, there's a much simpler method that doesn't involve regular expressions. Simply count the number of double quotes in the CSV line. If it's odd, the line has a mismatched quote.
This was a regex someone else gave me the framework for that ended up working with a few modifications:
This will match anything followed by ," with or without spaces in between, not followed eventually by a ", (also with potential white space) and finally ending in a newline.
.*,[\s]*"(?!.*"[\s]*,).*\n
Regex doesn't really work reliably for that as there are many edge cases. You should try univocity-parsers as it is the only CSV parser I know that handles unescaped quotes properly.
It gives you the following options:
STOP_AT_CLOSING_QUOTE - If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
STOP_AT_DELIMITER - If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until a delimiter or line ending is found in the input.
SKIP_VALUE - If unescaped quotes are found in the input, the content parsed for the until the next delimiter is found, everything will, producing a null.
RAISE_ERROR - Throws an exception if unescaped quotes are found in the input
Use it like this:
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER);
CsvParser parser = new CsvParser(settings);
for(String row[] : parser.iterate(input)){
System.out.println(Arrays.toString(row));
}
Hope it helps. By default it runs with the STOP_AT_DELIMITER setting.
Disclaimer: I'm the author of this library. It's open-source and free (Apache 2.0 license)

Regex Expressions in Java

This is my first time using Regex and I'm finding some difficulties in validating a string against a regular expression of the sort (x,y,z)(y,z)(x,w) etc. This is the pattern that I have been trying to match with the string
String expressionmatcher = "[\\([[wxyz][,]]*[wxyz]{1,1}\\)]*";
boolean checker = expression.matches(expressionmatcher);
if (checker == true) {
System.out.println("Expression Valid");
}
else
{
System.out.println("Expression is not valid");
}
Although my pattern is accepted, the matcher also accepts everything that is included in the string regardless of the sequence. For example if I input 'x' or a '(' or a ',', it is accepted as a valid expression.
What should I do to fix this? Thank you
That's because the square brackets you have surrounding the entire thing indicate "one of the contents" -- so that if something matches any one of the inside groups, it'll work.
Easy fix is to replace the outer brackets with parentheses. And the brackets surrounding [wxyz][,] too, because if you replace the outer brackets without also replacing the brackets I just mentioned (,x) will also work. I believe you might also want to put a set of brackets around the outer parentheses followed by a + too -- this way you'll only match if you have at least one ordered something inside.
Few other improvements:
You don't need to have the parentheses in a pair of brackets
You don't need to say {1, 1} -- {1} works just fine
I'd recommend putting \\s* after the comma so you can put spaces (or any form of whitespace, for that matter) after the comma
This likely is not the most efficient regex you can get, as I'm not too experienced with them. It works, though!

Regex double new line

What I want to do is to take the left and right parts of double line.
Example
LEFT_PART\r\n\r\nRIGHT_PART
Left and right part can be anything but they will not contain double new line.
What I'm doing is not working (doesn't match the string I give it). This is what I've done so far.
^(.*)[\r\r|\n\n|\r\n\r\n]{1,1}(.*)$
It can start with anything, followed by exactly one double-new line, followed by anything.
I group the right and left because I need to use them aftewards.
EDIT
I use OR to cover all three types of new-line
Square brackets are used for character class and not grouping. Try using parens:
^(.*)(\r\r|\n\n|\r\n\r\n)(.*)$
And to avoid capturing the double newlines;
^(.*)(?:\r\r|\n\n|\r\n\r\n)(.*)$
The {1,1} is also redundant. I removed it.
It's not working because you have used a character class, which matches just a single character. You should use parenthesis. Also, you can simplify your regex by using {n} quantifier. To match \r\r, use \r{2}:
^(.*)(?:\r|\n|\r\n){2}(.*)$
Apart from that, I would rather get the line separator for my system using:
String lineSeparator = System.getProperty("line.separator");
String regex = "^(.*)" + Pattern.quote(lineSeparator) + "{2}(.*)$
Try the next:
^(.*)(?:(\r|\r?\n){2})(.*)$
Try this:
(?m)^(.*)$[\r\n]{1,2}^$[\r\n]{1,2}^(.*)$
The switch (?m) has the effect that caret and dollar match after and before newlines for the remainder of the regular expression
Here's a live demo of this regex working.

Regular expression to select all whitespace that isn't in quotes?

I'm not very good at RegEx, can someone give me a regex (to use in Java) that will select all whitespace that isn't between two quotes? I am trying to remove all such whitespace from a string, so any solution to do so will work.
For example:
(this is a test "sentence for the regex")
should become
(thisisatest"sentence for the regex")
Here's a single regex-replace that works:
\s+(?=([^"]*"[^"]*")*[^"]*$)
which will replace:
(this is a test "sentence for the regex" foo bar)
with:
(thisisatest"sentence for the regex"foobar)
Note that if the quotes can be escaped, the even more verbose regex will do the trick:
\s+(?=((\\[\\"]|[^\\"])*"(\\[\\"]|[^\\"])*")*(\\[\\"]|[^\\"])*$)
which replaces the input:
(this is a test "sentence \"for the regex" foo bar)
with:
(thisisatest"sentence \"for the regex"foobar)
(note that it also works with escaped backspaces: (thisisatest"sentence \\\"for the regex"foobar))
Needless to say (?), this really shouldn't be used to perform such a task: it makes ones eyes bleed, and it performs its task in quadratic time, while a simple linear solution exists.
EDIT
A quick demo:
String text = "(this is a test \"sentence \\\"for the regex\" foo bar)";
String regex = "\\s+(?=((\\\\[\\\\\"]|[^\\\\\"])*\"(\\\\[\\\\\"]|[^\\\\\"])*\")*(\\\\[\\\\\"]|[^\\\\\"])*$)";
System.out.println(text.replaceAll(regex, ""));
// output: (thisisatest"sentence \"for the regex"foobar)
Here is the regex which works for both single & double quotes (assuming that all strings are delimited properly)
\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)
It won't work with the strings which has quotes inside.
This just isn't something regexes are good at. Search-and-replace functions with regexes are always a bit limited, and any sort of nesting/containment at all becomes difficult and/or impossible.
I'd suggest an alternate approach: Split your string on quote characters. Go through the resulting array of strings, and strip the spaces from every other substring (whether you start with the first or second depends on whether you string started with a quote or not). Then join them back together, using quotes as separators. That should produce the results you're looking for.
Hope that helps!
PS: Note that this won't handle nested strings, but since you can't make nested strings with the ASCII double-qutoe character, I'm gonna assume you don't need that behaviour.
PPS: Once you're dealing with your substrings, then it's a good time to use regexes to kill those spaces - no containing quotes to worry about. Just remember to use the /.../g modifier to make sure it's a global replacement and not just the first match.
Groups of whitespace outside of quotes are separated by stuff that's a) not whitespace, or b) inside quotes.
Perhaps something like:
(\s+)([^ "]+|"[^"]*")*
The first part matches a sequence of spaces; the second part matches non-spaces (and non-quotes), or some stuff in quotes, either repeated any number of times. The second part is the separator.
This will give you two groups for each item in the result; just ignore the second element. (We need the parentheses for precidence rather than match grouping there.) Or, you could say, concatenate all the second elements -- though you need to match the first non-space word too, or in this example, make the spaces optional:
StringBuffer b = new StringBuffer();
Pattern p = Pattern.compile("(\\s+)?([^ \"]+|\"[^\"]*\")*");
Matcher m = p.matcher("this is \"a test\"");
while (m.find()) {
if (m.group(2) != null)
b.append(m.group(2));
}
System.out.println(b.toString());
(I haven't done much regex in Java so expect bugs.)
Finally This is how I'd do it if regexes were compulsory. ;-)
As well as Xavier's technique, you could simply do it the way you'd do it in C: just iterate over the input characters, and copy each to the new string if either it's non-space, or you've counted an odd number of quotes up to that point.
If there is only one set of quotes, you can do this:
String s = "(this is a test \"sentence for the regex\") a b c";
Matcher matcher = Pattern.compile("^[^\"]+|[^\"]+$").matcher(s);
while (matcher.find())
{
String group = matcher.group();
s = s.replace(group, group.replaceAll("\\s", ""));
}
System.out.println(s); // (thisisatest"sentence for the regex")abc
This isn't an exact solution, but you can accomplish your goal by doing the following:
STEP 1: Match the two segments
\\(([a-zA-Z ]\*)"([a-zA-Z ]\*)"\\)
STEP 2: remove spaces
temp = $1 replace " " with ""
STEP 3: rebuild your string
(temp"$2")

Categories

Resources