Related
I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .
We are processing csv files which contain lines with non-closed double quoted entries. These blow up the csv parser, so I am trying to put together a regex which will identify these lines so we can delete them from the files before trying to process them.
In the following example, the csv parser gets to line 2 and includes everything up to the first double quote in line 3 before trying to close out the token and then blows up because there are non-whitespace characters after the "closing" double quote before the next comma.
Example Line 1,some data,"good line",processes fine,happy
Example Line 2,some data,"bad line,processes poorly,unhappy
Example Line 3,some data,"good line",dies before here,unhappy
I am trying to do something like:
.*,"[^(",)]*[\r\n]
The idea is finding a single line with anything followed by ," without an instance of ", which follows before the line ends.
The negation of the sequence is not working though. How is something like this done?
NOTE:
Since people keep suggesting essentially checking for an even number of double quotes, it's worth noting that a single double-quoted csv entry could contain a standalone double quote (e.g. ...,"Measurement: 1' 2"",...).
You can use:
int count = str.length() - str.replaceAll("\\"","").length();
if (count % 2 == 0) {
// do what you want
}
With your current requirements (including your concern about "Measurement: 1' 2"", this will select the bad lines:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"[^",]*")?)+(?:$|,.*)
The ^ anchors at the top of the string
The .*(?:^|,) eats up any characters up to the top of the string or a comma
We match a "...
and, once or more times, [^",]*(?:"[^",]*")? matches characters that are neither a " or a comma, and, optionally, a balanced set of quotes: "[^",]*"
We either match the end of the string, or a comma and anything that follows
A note about escaped double quotes
You may have, in your input, double-quoted strings that contain an escaped double quote, like this: "abc\"de" If so, we need to replace our expression for double-quoted strings (?:"[^",]*") with something more solid: (?:"(?:\\"|[^"])*")
Hence the whole regex would become:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"(?:\\"|[^"])*")?)+(?:$|,.*)
Something like this should work:
^[^"]*("[^"]*"[^"]*)*[^"]*$
The [^"]* that you see repeated all over the place means "any number of non-quote characters".
The ("[^"]*"[^"]*)* will match paired quotes while the [^"]*s will match the unquoted text before and after the final quotes.
The ^ and $ anchors ensure that we're matching the whole line, not just a portion of it.
Essentially: if there's an even number of quotes it will match. If there is an odd number of quotes, it will fail.
Here's an example of the regex in action.
If whatever solution you're working in has the option, there's a much simpler method that doesn't involve regular expressions. Simply count the number of double quotes in the CSV line. If it's odd, the line has a mismatched quote.
This was a regex someone else gave me the framework for that ended up working with a few modifications:
This will match anything followed by ," with or without spaces in between, not followed eventually by a ", (also with potential white space) and finally ending in a newline.
.*,[\s]*"(?!.*"[\s]*,).*\n
Regex doesn't really work reliably for that as there are many edge cases. You should try univocity-parsers as it is the only CSV parser I know that handles unescaped quotes properly.
It gives you the following options:
STOP_AT_CLOSING_QUOTE - If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
STOP_AT_DELIMITER - If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until a delimiter or line ending is found in the input.
SKIP_VALUE - If unescaped quotes are found in the input, the content parsed for the until the next delimiter is found, everything will, producing a null.
RAISE_ERROR - Throws an exception if unescaped quotes are found in the input
Use it like this:
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER);
CsvParser parser = new CsvParser(settings);
for(String row[] : parser.iterate(input)){
System.out.println(Arrays.toString(row));
}
Hope it helps. By default it runs with the STOP_AT_DELIMITER setting.
Disclaimer: I'm the author of this library. It's open-source and free (Apache 2.0 license)
I am using String.Replaceall to replace forward slash / followed or preceded by a space with a comma followed by space ", " EXCEPT some patterns (for example n/v, n/d should not be affected)
ALL the following inputs
"nausea/vomiting"
"nausea /vomiting"
"nausea/ vomiting"
"nausea / vomiting"
Should be outputted as
nausea, vomiting
HOWEVER ALL the following inputs
"user have n/v but not other/ complications"
"user have n/d but not other / complications"
Should be outputted as follows
"user have n/v but not other, complications"
"user have n/d but not other, complications"
I have tried
String source= "nausea/vomiting"
String regex= "([^n/v])(\\s*/\\s*)";
source.replaceAll(regex, ", ");
But it cuts the a before / and gives me nause , vomiting
Does any body know a solution?
Your first capturing group, ([^n/v]), captures any single character that is not the letter n, the letter v, or a slash (/). In this case, it's matching the a at the end of nausea and capturing it to be replaced.
You need to be a bit more clear about what you are and are not replacing here. Do you just want to make sure there's a comma instead when it doesn't end in "vomiting" or "d"? You can use non-capturing groups to indicate this:
(?=asdf) does not capture but when placed at the end ensures that right after the match the string will contain asdf; (?!asdf) ensures that it will not. Whichever you use, the question mark after the initial parenthesis ensures that any text it matches will not be returned or replaced when the match is found.
Also, do not forget that in Java source you must always double up any backslashes you put in string literals.
[^n/v] is a character class, and means anything except a n, / or a v.
You are probably looking for something like a negative lookbehind:
String regex= "(?<!\\bn)(\\s*/\\s*)";
This will match any of your slash and space combinations that are not preceded by just an n, and works for all your examples. You can read more on lookaround here.
I have to remove \ from the string.
My String is "SEPIMOCO EUROPE\119"
I tried replace, indexOf, Pattern but I am not able to remove this \ from this string
String strconst="SEPIMOCO EUROPE\119";
System.out.println(strconst.replace("\\", " ")); // Gives SEPIMOCO EUROPE 9
System.out.println(strconst.replace("\\\\", " ")); // Gives SEPIMOCO EUROPE 9
System.out.println(strconst.indexOf("\\",0)); //Gives -1
Any solutions for this ?
Your string doesn't actually contain a backslash. This part: "\11" is treated as an octal escape sequence (so it's really a tab - U+0009). If you really want a backslash, you need:
String strconst="SEPIMOCO EUROPE\\119";
It's not really clear where you're getting your input data from or what you're trying to achieve, but that explains everything you're seeing at the moment.
You have to distinguish between the string literal, i.e. the thing you write in your source code, enclosed with double quotes, and the string value it represents. When turning the former into the latter, escape sequences are interpreted, causing a difference between these two.
Stripping from string literals
\11 in the literal represents the character with octal value 11, i.e. a tab character, in the actual string value. \11 is equivalent to \t.
There is no way to reliably obtain the escaped version of a string literal. In other words, you cannot know whether the source code contained \11 or \t, because that information isn't present in the class file any more. Therefore, if you wanted to “strip backslashes” from the sequence, you wouldn't know whether 11 or t was the correct replacement.
For this reason, you should try to fix the string literals, either to not include the backslashes if you don't want them at all, or to contain proper backslashes, by escaping them in the literal as well. \\ in a string literal gives a single \ in the string it expresses.
Runtime strings
As you comments to other answers indicate that you're actually receiving this string at runtime, I would expect the string to contain a real backslash instead of a tab character. Unless you employ some fancy input method which parses escape sequences, you will still have the raw backslash. In order to simulate that situation in testing code, you should include a real backslash in your string, i.e. a double backslash \\ in your string literal.
When you have a real backslash in your string, strconst.replace("\\", " ") should do what you want it to do:
String strconst="SEPIMOCO EUROPE\\119";
System.out.println(strconst.replace("\\", " ")); // Gives SEPIMOCO EUROPE 119
Where does your String come from? If you declare it like in the example you will want to add another escaping backslash before the one you have there.
Looking to find the appropriate regular expression for the following conditions:
I need to clean certain tags within free flowing text. For example, within the text I have two important tags: <2004:04:12> and <name of person>. Unfortunately some of tags have missing "<" or ">" delimiter.
For example, some are as follows:
1) <2004:04:12 , I need this to be <2004:04:12>
2) 2004:04:12>, I need this to be <2004:04:12>
3) <John Doe , I need this to be <John Doe>
I attempted to use the following for situation 1:
String regex = "<\\d{4}-\\d{2}-\\d{2}\\w*{2}[^>]";
String output = content.replaceAll(regex,"$0>");
This did find all instances of "<2004:04:12" and the result was "<2004:04:12 >".
However, I need to eliminate the space prior to the ending tag.
Not sure this is the best way. Any suggestions.
Thanks
Basically, you are looking for a negative look-ahead, like this:
String regex = "<\\d{4}-\\d{2}-\\d{2}(?!>)";
String output = content.replaceAll(regex,"$0>");
This will help with the numeric "tags", but since no regex can be intelligent enough to match an arbitrary name, you either must define very closely what a name can look like, or deal with the fact that the same approach is impossible for "name" tags.
For fixing the dates, you can match any date, with zero one or two angled brackets:
String regex = "(\\s?\\<?)(\\d{4}:\\d{2}:\\d{2})(\\>?\\s)";
String replace = " <$2> ";
To recognise a name, we assume parts of the name begin with a capital letter and the only separator is a space. We match the angled bracket explicitly at the start or end, and the preceeding/succeeding char before/after the name should be only a space or punctuation.
String regex = "(\\<[A-Z][a-zA-Z]*(\\s[A-Z][a-zA-Z])*)(?=[\\.!?:;\\s])";
String replace = "$1>";
String regex = "(?<=[\\.!?:;\\s])([A-Z][a-zA-Z]*(\\s[A-Z][a-zA-Z]*)*)";
String replace = "<$1";