Java CSV import regex help needed

Java CSV import regex help needed - java

I have used many different regex strings, all of them do the same thing.
One line of my .csv looks like this:
"999","Location","Alt. fare key","Table ID","Address","Line 2","City","State",19111,,,H,,, etc......(there are 139 columns.
As you can see, some of the entries are separated by quotation marks while others are not.
Also, quotation marks or not. Every entry is separated by a comma.
Here are two examples of regex strings that I've used:
String regex = "(?:(?<=\")([^\"]*)(?=\"))|(?<=,|^)([^,]*)(?=,|$)"
Object[] tokens = strLine.split(regex);
model.addRow(tokens);
jTable1.setModel(model);
and
String regex = ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
Object[] tokens = strLine.split(regex);
model.addRow(tokens);
jTable1.setModel(model);
Both of these do the same thing.
Pretending the |(s) below are the lines of my jTable:
"999"|"Location"|"Alt. fare key"|"Table ID"|"Address"|"Line 2"|"City"|"State"|19111| | |H|
I want it to come out like this:
999|Location|Alt. fare key|Table ID|Address|Line 2|City|State|19111| | |H| etc.....
What else does the regex need to remove the unwanted parenthesis?
Thanks in advance for help.
JB

But does it handle embedded commas? the OpenCSV library will and you just do this (copied form opencsv doc):
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1] + "etc...");
}

Related

How to splitting records based white spaces when different lines have spaces at different positions

I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?

May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.

Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).

Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.

Java split by comma in string containing whitespace

I have a below string which I want to split by ',' only and also want to separate 3rd index which is (1005,111222) of each line .
1002,USD,04/09/2019,1005,1,cref,,,,,,,,,
1001,USD,11/04/2018,111222,10,reftpt001,SHA,Remittance Code,BCITIT31745,,,RTGS,,,,
I am using code down below :
List<String> elements = new ArrayList<String>();
List<String> elements2 = new ArrayList<String>();
StringTokenizer st = new StringTokenizer((String) object);
while(st.hasMoreTokens()) {
String[] row = st.nextToken().split(",");
if (row.length == 5) {
elements.add(row[3]);
}
if (row.length == 12) {
elements2.add(row[3]);
}
}
In the above string, There is a space between 'Remittance Code' but it is splitting till remittance and after that, it counts the code a new line or string. Please advise how can I skip the white space as it is.

There is no apparent need for StringTokenizer here, and the nextToken() call stops at the first space. Instead I suggest calling output.split(",") directly like
String[] row = ((String) object).split("\\s*,\\s*", -1);
And remove the StringTokenizer, note the JavaDoc explicitly says StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

First you can split with, and then use trim operation

String stringToSplit= "1001,ZAR,11/04/2018,111222,10,reftpt001,SHA,Remittance Code,BCITIT31745,,,RTGS,,,,";
StringTokenizer tokenizer = new StringTokenizer(stringToSplit, ",");
while (tokenizer.hasMoreTokens()) { System.out.println(tokenizer.nextToken()); }
Output :
1001 ZAR 11/04/2018 111222 10 reftpt001 SHA Remittance Code
BCITIT31745 RTGS

I tried with this code:
1st approach :
String str = "1001,ZAR,11/04/2018,111222,10,reftpt001,SHA,Remittance Code,BCITIT31745";
String[] words = str.split(",");
for(String word : words) {
System.out.println(word);
}
2nd approach :
String str = "1001,ZAR,11/04/2018,111222,10,reftpt001,SHA,Remittance Code,BCITIT31745";
StringTokenizer tokenizer = new StringTokenizer(str, ",");
while(tokenizer.hasMoreTokens())
{
System.out.println(tokenizer.nextToken());
}
Output :
11/04/2018
111222
10
reftpt001
SHA
Remittance Code
BCITIT31745
Hope this helps you. :)

How to merge many List<String> elements in one based on double quote delimiter in java

I have a CSV file generated in other platform (Salesforce), by default it seems Salesforce is not handling break lines in the file generation in some large text fields, so in my CSV file I have some rows with break lines like this that I need to fix:
"column1","column2","my column with text
here the text continues
more text in the same field
here we finish this","column3","column4"
Same idea using this piece of code:
List<String> listWords = new ArrayList<String>();
listWords.add("\"Hi all");
listWords.add("This is a test");
listWords.add("of how to remove");
listWords.add("");
listWords.add("breaklines and merge all in one\"");
listWords.add("\"This is a new Line with the whole text in one row\"");
in this case I would like to merge the elements. My first approach was to check for the lines were the last char is not a ("), concatenates the next line and just like that until we see the las char contains another double quote.
this is a non working sample of what I was trying to achieve but I hope it gives you an idea
String[] csvLines = csvContent.split("\n");
Integer iterator = 0;
String mergedRows = "";
for(String row:csvLines){
newCsvfile.add(row);
if(row != null){
if(!row.isEmpty()){
String lastChar = String.valueOf(row.charAt(row.length()-1));
if(!lastChar.contains("\"")){
//row += row+" "+csvLines[iterator+1].replaceAll("\r", "").replaceAll("\n", "").replaceAll("","").replaceAll("\r\n?|\n", "");
mergedRows += row+" "+csvLines[iterator+1].replaceAll("\r", "").replaceAll("\n", "").replaceAll("","").replaceAll("\r\n?|\n", "");
row = mergedRows;
csvLines[iterator+1] = null;
}
}
newCsvfile.add(row);
}
iterator++;
}
My final result should look like (based on the list sample):
"Hi all This is a test of how to remove break lines and merge all in one"
"This is a new Line with the whole text in one row".
What is the best approach to achieve this?

In case you don't want to use a CSV reading library like #RealSkeptic suggested...
Going from your listWords to your expected solution is fairly simple:
List<String> listSentences = new ArrayList<>();
String tmp = "";
for (String s : listWords) {
tmp = tmp.concat(" " + s);
if (s.endsWith("\"")){
listSentences.add(tmp);
tmp = "";
}
}

CSV with tab as quote character

I have tried several CSV parsers for Java but non of them handled the following line properly:
String str = "\tvalue1\t,,\tv1,",',v3\t,value2"
The format is comma separated with TAB as escape character. Part of fields empty, part not escaped.
Any suggestion for parser which handles this format good?
For example I would expect that the above string will be parsed as:
value1
null
v1,",',v3
value2
But it's producing the following:
value1
null
v1
"
'
v3
value2
Java Example:
import java.lang.String;
import com.univocity.parsers.csv.CsvParser;
import com.univocity.parsers.csv.CsvParserSettings;
public class StamMain {
public static void main(String[] args){
String str = "\tvalue1\t,,\tv1,',",v3\t,value2";
System.out.println(str);
CsvParserSettings settings = new CsvParserSettings();
settings.getFormat().setQuote('\t');
CsvParser parser = new CsvParser(settings);
String[] fields = parser.parseLine(str);
for (String f : fields)
System.out.println(f);
}
}
The best results achieved if TAB replaced by quote, but quoting quotes is interesting task by itself.
Any ideas appreciated.

Apache Commons CSV can handle it just fine.
String str = "\tvalue1\t,,\tv1,\",',v3\t,value2";
CSVFormat csvFormat = CSVFormat.DEFAULT.withQuote('\t');
for (CSVRecord record : CSVParser.parse(str, csvFormat))
for (String value : record)
System.out.println(value);
Output
value1
v1,",',v3
value2
You can even add .withNullString("") to get that null value, if you want.
value1
null
v1,",',v3
value2
Very flexible CSV parser.

Just add this line before parsing to get the result you expect:
settings.trimValues(false);
This is required because by default the parser removes white spaces around delimiters, but your "quote" character happens to be a white space. Regardless, this is something the parser should handle. I opened this bug report to have it fixed in the next version of uniVocity-parsers.

Works with Super CSV
ICsvListReader reader = new CsvListReader(
new FileReader("weird.csv"),
CsvPreference.Builder('\t', ',', "\r\n").build()
);
List<String> record = reader.read();
for(String value : record)
System.out.println(value);
Output:
value1
null
v1,",',v3
value2

One option is to:
1) Replace all the double quotes in your string with some "good" replacement string that you know won't be in the actual data (e.g. "REPLACE_QUOTES_TEMP")
2) Replace all tabs with double quotes.
3) Run the parser as normal.
4) Replace back the "REPLACE_QUOTES_TEMP" strings (or whatever you chose), in the individual fields, with the actual double quote.

The String "\tvalue1\t,,\tv1,",',v3\t,value2" is not valid. to include '"' as character you need to write '\"'.
For parsing this code should work:
String st = "\tvalue1\t,,\tv1,\",',v3\t,value2";
String[] arr = st.split("\t");

Split string with alternative comma (,)

I know how to tokenize the String, but the Problem is I want to tokenize the as shown below.
String st = "'test1, test2','test3, test4'";
What I've tried is as below:
st.split(",");
This is giving me output as:
'test1
test2'
'test3
test4'
But I want output as:
'test1, test2'
'test3, test4'
How do i do this?

Since single quotes are not mandatory, split will not work, because Java's regex engine does not allow variable-length lookbehind expressions. Here is a simple solution that uses regex to match the content, not the delimiters:
String st = "'test1, test2','test3, test4',test5,'test6, test7',test8";
Pattern p = Pattern.compile("('[^']*'|[^,]*)(?:,?)");
Matcher m = p.matcher(st);
while (m.find()) {
System.out.println(m.group(1));
}
Demo on ideone.
You can add syntax for escaping single quotes by altering the "content" portion of the quoted substring (currently, it's [^']*, meaning "anything except a single quote repeated zero or more times).

The easiest and reliable solution would be to use a CSV parser. Maybe Commons CSV would help.
It will scape the strings based on CSV rules. So even '' could be used within the value without breaking it.
A sample code would be like:
ByteArrayInputStream baos = new ByteArrayInputStream("'test1, test2','test3, test4'".getBytes());
CSVReader reader = new CSVReader(new InputStreamReader(baos), ',', '\'');
String[] read = reader.readNext();
System.out.println("0: " + read[0]);
System.out.println("1: " + read[1]);
reader.close();
This would print:
0: test1, test2
1: test3, test4
If you use maven you can just import the dependency:
<dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.0</version>
</dependency>
And start using it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java CSV import regex help needed - java

Related

How to splitting records based white spaces when different lines have spaces at different positions

Java split by comma in string containing whitespace

How to merge many List<String> elements in one based on double quote delimiter in java

CSV with tab as quote character

Split string with alternative comma (,)

Categories

Resources