Parsing CSV input with a RegEx in java - java

I know, now I have two problems. But I'm having fun!
I started with this advice not to try and split, but instead to match on what is an acceptable field, and expanded from there to this expression.
final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
The expression looks like this without the annoying escaped quotes:
"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)
This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,
the quick, "brown, fox jumps", over, "the",,"lazy dog"
breaks down into
the quick
"brown, fox jumps"
over
"the"
"lazy dog"
Great! Now I want to drop the quotes, so I added the lookahead and lookbehind non-capturing groups like I was doing for the commas.
final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");
again the expression is:
(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)
Instead of the desired result
the quick
brown, fox jumps
over
the
lazy dog
now I get this breakdown:
the quick
"brown
fox jumps"
,over,
"the"
,,
"lazy dog"
What am I missing?

Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead
Try:
(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)

(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)
This should do what you want.
Explanation:
(?:^|,)\s*
The pattern should start with a , or beginning of string. Also, ignore all whitespace at the beginning.
Lookahead and see if the rest starts with a quote
(?:(?=")"([^"].*?)")
If it does, then match non-greedily till next quote.
(?:(?!")(.*?))
If it does not begin with a quote, then match non-greedily till next comma or end of string.
(?=,|$)
The pattern should end with a comma or end of string.

When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. Here is my final code.
package regex.parser;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}

I know this isn't what the OP wants, but for other readers, one of the String.replace methods could be used to strip the quotes from each element in the result array of the OPs current regex.

Related

regex for letters or numbers in brackets

I am using Java to process text using regular expressions. I am using the following regular expression
^[\([0-9a-zA-Z]+\)\s]+
to match one or more letters or numbers in parentheses one or more times. For instance, I like to match
(aaa) (bb) (11) (AA) (iv)
or
(111) (aaaa) (i) (V)
I tested this regular expression on http://java-regex-tester.appspot.com/ and it is working. But when I use it in my code, the code does not compile. Here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("^[\([0-9a-zA-Z]+\)\s]+");
String[] words = pattern.split("(a) (1) (c) (xii) (A) (12) (ii)");
String w = pattern.
for(String s:words){
System.out.println(s);
}
}
}
I tried to use \ instead of \ but the regex gave different results than what I expected (it matches only one group like (aaa) not multiple groups like (aaa) (111) (ii).
Two questions:
How can I fix this regex and be able to match multiple groups?
How can I get the individual matches separately (like (aaa) alone and then (111) and so on). I tried pattern.split but did not work for me.
Firstly, you want to escape any backslashes in the quotation marks with another backslash. The Regex will treat it as a single backslash. (E.g. call a word character \w in quotation marks, etc.)
Secondly, you got to finish the line that reads:
String w = pattern.
That line explains why it doesn't compile.
Here is my final solution to match the individual groups of letters/numbers in brackets that appear at the beginning of a line and ignore the rest
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
static ArrayList<String> listOfEnums;
public static void main(String[] args) {
listOfEnums = new ArrayList<String>();
Pattern pattern = Pattern.compile("^\\([0-9a-zA-Z^]+\\)");
String p = "(a) (1) (c) (xii) (A) (12) (ii) and the good news (1)";
Matcher matcher = pattern.matcher(p);
boolean isMatch = matcher.find();
int index = 0;
//once you find a match, remove it and store it in the arrayList.
while (isMatch) {
String s = matcher.group();
System.out.println(s);
//Store it in an array
listOfEnums.add(s);
//Remove it from the beginning of the string.
p = p.substring(listOfEnums.get(index).length(), p.length()).trim();
matcher = pattern.matcher(p);
isMatch = matcher.find();
index++;
}
}
}
1) Your regex is incorrect. You want to match individual groups of letters / numbers in brackets, and the current regex will match only a single string of one or more such groups. I.e. it will match
(abc) (def) (123)
as a single group rather than three separate groups.
A better regex that would match only up to the closing bracket would be
\([0-9a-zA-Z^\)]+\)
2) Java requires you to escape all backslashes with another backslash
3) The split() method will not do what you want. It will find all matches in your string then throw them away and return an array of what is left over. You want to use matcher() instead
Pattern pattern = Pattern.compile("\\([0-9a-zA-Z^\\)]+\\)");
Matcher matcher = pattern.matcher("(a) (1) (c) (xii) (A) (12) (ii)");
while (matcher.find()) {
System.out.println(matcher.group());
}

Replace whole tokens that may contain regular expression

I want to do a startStr.replaceAll(searchStr, replaceStr) and I have two requirements.
The searchStr must be a whole word, meaning it must have a space, beginning of string or end of string character around it.
e.g.
startStr = "ON cONfirmation, put ON your hat"
searchStr = "ON"
replaceStr = ""
expected = " cONfirmation, put your hat"
The searchStr may contain a regex pattern
e.g.
startStr = "remove this * thing"
searchStr = "*"
replaceStr = ""
expected = "remove this thing"
For requirement 1, I've found that this works:
startStr.replaceAll("\\b"+searchStr+"\\b",replaceStr)
For requirement 2, I've found that this works:
startStr.replaceAll(Pattern.quote(searchStr), replaceStr)
But I can't get them to work together:
startStr.replaceAll("\\b"+Pattern.quote(searchStr)+"\\b", replaceStr)
Here is the simple test case that's failing
startStr = "remove this * thing but not this*"
searchStr = "*"
replaceStr = ""
expected = "remove this thing but not this*"
actual = "remove this * thing but not this*"
What am I missing?
Thanks in advance
First off, the \b, or word boundary, is not going to work for you with the asterisks. The reason is that \b only detects boundaries of word characters. A regex parser won't acknowledge * as a word character, so a wildcard-endowed word that begins or ends with a regex won't be surrounded by valid word boundaries.
Reference pages:
http://www.regular-expressions.info/wordboundaries.html
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
An option you might like is to supply wildcard permutations in a regex:
(?<=\s|^)(ON|\*N|O\*|\*)(?=\s|$)
Here's a Java example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegExTest
{
public static void main(String[] args){
String sourcestring = "ON cONfirmation, put * your hat";
sourcestring = sourcestring.replaceAll("(?<=\\s|^)(ON|\\*N|O\\*|\\*)(?=\\s|$)","").replaceAll(" "," ").trim();
System.out.println("sourcestring=["+sourcestring+"]");
}
}
You can write a little function to generate the wildcard permutations automatically. I admit I cheated a little with the spaces, but I don't think that was a requirement anyway.
Play with it online here: http://ideone.com/7uGfIS
The pattern "\\b" matches a word boundary, with a word character on one side and a non-word character on the other. * is not a word character, so \\b\\*\\b won't work. Look-behind and look-ahead match but do not consume patterns. You can specify that the beginning of the string or whitespace must come before your pattern and that whitespace or the end of the string must follow:
startStr.replaceAll("(?<=^|\\s)"+Pattern.quote(searchStr)+"(?=\\s|$)", replaceStr)
Try this,
For removing "ON"
StringBuilder stringBuilder = new StringBuilder();
String[] splittedValue = startStr.split(" ");
for (String value : splittedValue)
{
if (!value.equalsIgnoreCase("ON"))
{
stringBuilder.append(value);
stringBuilder.append(" ");
}
}
System.out.println(stringBuilder.toString().trim());
For removing "*"
String startStr1 = "remove this * thing";
System.out.println(startStr1.replaceAll("\\*[\\s]", ""));
You can use (^| )\*( |$) instead of using \\b
Try this startStr.replaceAll("(^| )youSearchString( |$)", replaceStr);

How would I do this in Java Regex?

Trying to make a regex that grabs all words like lets just say, chicken, that are not in brackets. So like
chicken
Would be selected but
[chicken]
Would not. Does anyone know how to do this?
String template = "[chicken]";
String pattern = "\\G(?<!\\[)(\\w+)(?!\\])";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while (m.find())
{
System.out.println(m.group());
}
It uses a combination of negative look-behind and negative look-aheads and boundary matchers.
(?<!\\[) //negative look behind
(?!\\]) //negative look ahead
(\\w+) //capture group for the word
\\G //is a boundary matcher for marking the end of the previous match
(please read the following edits for clarification)
EDIT 1:
If one needs to account for situations like:
"chicken [chicken] chicken [chicken]"
We can replace the regex with:
String regex = "(?<!\\[)\\b(\\w+)\\b(?!\\])";
EDIT 2:
If one also needs to account for situations like:
"[chicken"
"chicken]"
As in one still wants the "chicken", then you could use:
String pattern = "(?<!\\[)?\\b(\\w+)\\b(?!\\])|(?<!\\[)\\b(\\w+)\\b(?!\\])?";
Which essentially accounts for the two cases of having only one bracket on either side. It accomplishes this through the | which acts as an or, and by using ? after the look-ahead/behinds, where ? means 0 or 1 of the previous expression.
I guess you want something like:
final Pattern UNBRACKETED_WORD_PAT = Pattern.compile("(?<!\\[)\\b\\w+\\b(?!])");
private List<String> findAllUnbracketedWords(final String s) {
final List<String> ret = new ArrayList<String>();
final Matcher m = UNBRACKETED_WORD_PAT.matcher(s);
while (m.find()) {
ret.add(m.group());
}
return Collections.unmodifiableList(ret);
}
Use this:
/(?<![\[\w])\w+(?![\w\]])/
i.e., consecutive word characters with no square bracket or word character before or after.
This needs to check both left and right for both a square bracket and a word character, else for your input of [chicken] it would simply return
hicke
Without look around:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatchingTest
{
private static String x = "pig [cow] chicken bull] [grain";
public static void main(String[] args)
{
Pattern p = Pattern.compile("(\\[?)(\\w+)(\\]?)");
Matcher m = p.matcher(x);
while(m.find())
{
String firstBracket = m.group(1);
String word = m.group(2);
String lastBracket = m.group(3);
if ("".equals(firstBracket) && "".equals(lastBracket))
{
System.out.println(word);
}
}
}
}
Output:
pig
chicken
A bit more verbose, sure, but I find it more readable and easier to understand. Certainly simpler than a huge regular expression trying to handle all possible combinations of brackets.
Note that this won't filter out input like [fence tree grass]; it will indicate that tree is a match. You cannot skip tree in that without a parser. Hopefully, this is not a case you need to handle.

Need help in Regex to exclude splitting string within "

I need to split a String based on comma as seperator, but if the part of string is enclosed with " the splitting has to stop for that portion from starting of " to ending of it even it contains commas in between.
Can anyone please help me to solve this using regex with look around.
Resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds very similar to ["regex-match a pattern unless..."][4]
\"[^\"]*\"|(,)
The left side of the alternation matches complete double-quoted strings. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Here is working code (see online demo):
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) {
String subject = "\"Messages,Hello\",World,Hobbies,Java\",Programming\"";
Pattern regex = Pattern.compile("\"[^\"]*\"|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b = new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits)
System.out.println(split);
} // end main
} // end Program
Reference
How to match pattern except in situations s1, s2, s3
Please try this:
(?<!\G\s*"[^"]*),
If you put this regex in your program, it should be:
String regex = "(?<!\\G\\s*\"[^\"]*),";
But 2 things are not clear:
Does the " only start near the ,, or it can start in the middle of content, such as AAA, BB"CC,DD" ? The regex above only deal with start neer , .
If the content has " itself, how to escape? use "" or \"? The regex above does not deal any escaped " format.

How to identify string pattern within a string but ignore if the match falls inside of identified pattern

I want to search a string for occurences of a string that matches a specific pattern.
Then I will write that unique list of found strings separated by commas.
The pattern is to look for "$FOR_something" as long as that pattern does not fall inside of "#LOOKING( )" or "/* */" and the _something part does not have any other special characters.
For example, if I have this string,
"Not #LOOKING( $FOR_one $FOR_two) /* $FOR_three */ not $$$FOR_four or $FOR_four_b, but $FOR_five; and $FOR_six and not $FOR-seven or $FOR_five again"
The resulting list of found patterns I'm looking for from the above quoted string would be:
$FOR_five, $FOR_six
I started with this example:
import java.lang.StringBuffer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class testIt {
public static void main(String args[]) {
String myWords = "Not #LOOKING( $FOR_one $FOR_two) /* $FOR_three */ not $$$FOR_four or $FOR_four_b, but $FOR_five; and $FOR_six and not $FOR-seven or $FOR_five again";
StringBuffer sb = new StringBuffer(0);
if ( myWords.toUpperCase().contains("$FOR") )
{
Pattern p = Pattern.compile("\\$FOR[\\_][a-zA-Z_0-9]+[\\s]*", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(myWords);
String myFors = "";
while (m.find())
{
myFors = myWords.substring( m.start() , m.end() ).trim();
if ( sb.length() == 0 ) sb = sb.append(myFors);
else
{
if ( !(sb.toString().contains(myFors))) sb = sb.append(", " + myFors );
}
}
}
System.out.println(sb);
}
}
But it is not giving me what I want. What I want is:
$FOR_five, $FOR_six
Instead, I get all of the $FOR_somethings. I don't know how to ignore the occurences inside of the /**/ or the #LOOKING().
Any suggestions?
This problem goes beyond regular regex I would say. The $$$ patterns can be fixed with negative lookbehind, the others won't as easily.
What I would recommend you to do is to first use tokenizing / manual string parsing to discard unwanted data, such as /* ... */ or #LOOKING( .... ). This could however also be removed by another regex such as:
myWords.replaceAll("/\\*[^*/]+\\*/", ""); // removes /* ... */
myWords.replaceAll("#LOOKING\\([^)]+\\)", ""); // removes #LOOKING( ... )
Once stripped of context-based content you can use e..g, the following regex:
(?<!\\$)\\$FOR_\\p{Alnum}+(?=[\\s;])
Explanation:
(?<!\\$) // Match iff not prefixed with $
\\$FOR_ // Matches $FOR_
\\p{Alnum}+ // Matches one or more alphanumericals [a-zA-Z0-9]
(?=[\\s;]) // Match iff followed by space or ';'
Note that the employed (?...) are known as lookahead/lookbehind expressions which are not captured in the result itself. They act only as prefix/suffix conditions in the above sample.

Categories

Resources