I try to parse a csv with java and have the following issue: The second column is a String (which may also contain comma) enclosed in double-quotes, except if the string itself contains a double quote, then the entire string is enclosed with a single quote. e.g.
Lines may lokk like this:
someStuff,"hello", someStuff
someStuff,"hello, SO", someStuff
someStuff,'say "hello, world"', someStuff
someStuff,'say "hello, world', someStuff
someStuff are placeholders for other elements, which can also include quotes in the same style
I'm looking for a generic way to split the lines at commas UNLESS enclosed in single OR double quotes in order to get the second column as a String. With second column I mean the fields:
hello
hello, SO
say "hello, world"
say "hello, world
I tried OpenCSV but fail as one can only specifiy one type of quote:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVReader reader = new CSVReader(new FileReader(file));
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
The solution with opencsv fails on the last line where there is only one double quote enclosed in single quotes:
someStuff | hello | someStuff
someStuff | hello, SO | someStuff
someStuff | 'say "hello, world"' | someStuff
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
If you truly cannot use a real CSV parser you could use a regex. This is generally not a good idea as there are always edge cases that you cannot handle but if the formatting is strictly as you describe then this may work.
public void test() {
String[] tests = {"numeStuff,\"hello\", someStuff, someStuff",
"numeStuff,\"hello, SO\", someStuff, someStuff",
"numeStuff,'say \"hello, world\"', someStuff, someStuff"
};
/* Matches a field and a potentially empty separator.
*
* ( - Field Group
* \" - Start with a quote
* [^\"]*? - Non-greedy match on anything that is not a quote
* \" - End with a quote
* | - Or
* ' - Start with a strop
* [^']*? - Non-greedy match on anything that is not a strop
* ' - End with a strop
* | - Or
* [^\"'] - Not starting with a quote or strop
* [^,$]*? - Non-greedy match on anything that is not a comma or end-of-line
* ) - End field group
* ( - Separator group
* [,$] - Comma separator or end of line
* ) - End separator group
*/
Pattern p = Pattern.compile("(\"[^\"]*?\"|'[^\']*?\'|[^\"'][^,\r\n]*?)([,\r\n]|$)");
for (String t : tests) {
System.out.println("Matching: " + t);
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
It does not appear that opencsv supports this out of the box. You could extend com.opencsv.CSVParser and implement your own algorithm for handling two types of quotes. This is the source of the method you would be changing and here is a stub to get you started.
class MyCSVParser extends CSVParser{
#Override
private String[] parseLine(String nextLine, boolean multi) throws IOException{
//Your algorithm here
}
}
Basically you only need to track ," and ,' (trimming what's in the middle).
When you encounter one of those, set the appropriate flag (eg. singleQuoteOpen, doubleQuoteOpen) to true to indicate they're open and you are in ignore-commas mode.
When you meet the appropriate closing quote, reset the flag and keep slicing the elements.
To perform the check, stop at every comma (when not in ignore-commas mode) and look at the next char (if any, and trimming).
Note: the regex solution is good and also shorter, but less customizable for edge cases (at least without big headaches).
If the use of single and double quotes is consistent per line, one could chose the corresponding type of quote per line:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVParser doubleParser = new CSVParser(',', '"');
CSVParser singleParser = new CSVParser(',', '\'');
String[] nextLine;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
if (line.contains(",'") && line.contains("',")) {
nextLine = singleParser.parseLine(line);
} else {
nextLine = doubleParser.parseLine(line);
}
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
}
It doesn't seem opencv supports this. However, have a look at this previous question and my answer as well as the other answers in case they help
you: https://stackoverflow.com/a/15905916/1688441
Below an example, please not notInsideComma actually meant "Inside quotes". The following code could be extended to check for both quotes and double quotes.
public static ArrayList<String> customSplitSpecific(String s)
{
ArrayList<String> words = new ArrayList<String>();
boolean notInsideComma = true;
int start =0, end=0;
for(int i=0; i<s.length()-1; i++)
{
if(s.charAt(i)==',' && notInsideComma)
{
words.add(s.substring(start,i));
start = i+1;
}
else if(s.charAt(i)=='"')
notInsideComma=!notInsideComma;
}
words.add(s.substring(start));
return words;
}
Related
I have a txt file with the following form:
1 | Argentina |Y|POSTAL_C |CAPITAL|STATES
I would like to convert each of these positions separated by the "|", be it a position within an example array like this:
0 | 1 |2|3 |4 |5
1 | Argentina |Y|POSTAL_C |CAPITAL|STATES
and work them inside an array
My code
public void ReadFile2() throws IOException {
String referencePath = "C:\\Users\\Admin\\Desktop\\PRUEBA.txt";
BufferedReader br;
br = new BufferedReader(new FileReader(referencePath));
String lines = br.readLine();
lines.split("\n", 6);
this.logger.info("Starting...");
while (lines != null) {
if (!lines.isEmpty() && lines.length() >7) {
String[] values = lines.split("|");
System.out.println(values[0]); ----> should return 1
}
lines = br.readLine();
}
br.close();
}
grateful for your comments
Because String you pass to split method should be a regex you need to use lines.split("\\|").
In regex pipe sign have special meaning (it works like OR operator) so you need to escape it by backslash. In java string literal backslash have special meaning because it's escape following character so you that's way two backslashes here. Result String is in fact just \|.
Try this:
String test = "1 | Argentina |Y|POSTAL_C |CAPITAL|STATES";
String[] values = test.split("\\|");
for (String value : values) {
System.out.println(value);
}
This returns output:
1
Argentina
Y
POSTAL_C
CAPITAL
STATES
Probably you would like also to .trim() particular values to remove unnecessary white spaces.
I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?
May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.
Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).
Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.
I'm working with a CSV file that in places, has multiple commas and pound signs. My question is about how to remove the multiple commas and the pound signs, while leaving a single comma between fields.
The part of this task I am on is, using only java and no external libraries to sort through the csv file sort the array by price. I am to input a number as an input parameter and return that number of rows, ordered by price.
What I have currently is around 1000 lines of data that looks like this:
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,
I need to remove the double commas and the pound sign, but for the life of me haven't been able to get it to work.
This is the line I am using for the regex.
String currentLine = line.replaceAll("[,{2}|£]", "");
This outputs a line which looks like this:
100086 Norway Maple WayMadelleGeorgeotmgeorgeotrr#hao13.com417175.60
A larger chunk of the code looks like this and by no means is it nearly finished:
public String[] getTopProperties(int n){
String[] properties = new String[n];
String file = "data.csv";
String line = "";
String splitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
while ((line = br.readLine()) != null) {
String currentLine = line.replaceAll("[,{2}|£]", "");
System.out.println("Current line is: " + currentLine);
String[] user = currentLine.split(splitBy);
}
} catch (IOException e) {
e.printStackTrace();
}
return properties;
}
Issue is it's now removed all the commas and where the price and double commas used to be, they now connect.
Could use some help finding some regex that keeps a single comma between each field, as well as removing the pound sign.
You could simplify this by parsing the CSV file into a 2D array and ignoring the empty column which results from the double comma. Then parsing the currency column is a snap: just ignore the first character.
In your regex .replaceAll("[,{2}|£]", ""); the square-brackets creates a character class, so this means "replace any characters ,, {, 2, }, |, or £ with nothing".
What you really want is to replace the sequence ,,£ with a single comma, which would be .replaceAll(",,£", ",")
In java script this would be...
var line="18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,";
console.log(' original line: ' + line);
console.log('replacement line: ' + line.replace(/,,£/, ","));
update
Converting this to Java as a stand-alone test program to demonstrate that this does work, I get the following:
public class so50419207
{
public static void main(String... args)
{
String input = "18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,";
String replaced = input.replace(",,£", ",");
System.out.println("original string: " + input);
System.out.println("replaced string: " + replaced);
}
}
Running this...
$ javac so50419207.java ; java so50419207
original string: 18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,
replaced string: 18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
Tried the regex (,,)(£)? and tested it in ideone :
Please find the code below:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
final String regex = "(,,)(£)?";
final String string = "18,,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,\n"
+ "18,,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,\n"
+ "18,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,\n"
+ "18,,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,";
final String subst = ",";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);
System.out.println("Substitution result: " + result);
}
}
Output:
Substitution result: 18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
INPUT
Input can be in any of the form shown below with following mandatory content TXT{Any comma separated strings in any format}
String loginURL = "http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}";
String loginURL1 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}";
String loginURL2 = "http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}";
String loginURL3 = "http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}";
String loginURL4 = "http://ip:port/path?username=abcd&password={PASS}";
Required Output
1. OutputURL corresponding to loginURL.
String outputURL = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL1 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL2 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL3 = "http://ip:port/path?";
String outputURL4 = "http://ip:port/path?username=abcd&password={PASS}";
2. Deleted pattern(if any)
String deletedPatteren = TXT{UE-IP,UE-Username,UE-Password}
My Attempts
String loginURLPattern = TXT+"\\{([\\w-,]*)\\}&*";
System.out.println("1. ");
getListOfTemplates(loginURL, loginURLPattern);
System.out.println();
System.out.println("2. ");
getListOfTemplates(loginURL1, loginURLPattern);
System.out.println();
private static void getListOfTemplates(String inputSequence,String pattern){
System.out.println("Input URL : " + inputSequence);
Matcher templateMatcher = Pattern.compile(pattern).matcher(inputSequence);
if (templateMatcher.find() && templateMatcher.group(1).length() > 0) {
System.out.println(templateMatcher.group(1));
System.out.println("OutputURL : " + templateMatcher.replaceAll(""));
}
}
OUTPUT obtained
1.
Input URL : http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}
UE-IP,UE-Username,UE-Password}&password={PASS
OutputURL : http://ip:port/path?username=abcd&location={LOCATION}&
2.
Input URL : http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}
UE-IP,UE-Username,UE-Password
OutputURL : http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&
DRAWBACK OF ABOVE PATTERN
If i add any String containing character like #,%,# in between TXT{} then my code breaks.
How can i achieve it using java.util.regex library so that user can input any comma separated String between TXT{Any Comma Separated Strings}.
I would recommend using Matcher.appendReplacement:
public static void main(final String[] args) throws Exception {
final String[] loginURLs = {
"http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}",
"http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}",
"http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}",
"http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}",
"http://ip:port/path?username=abcd&password={PASS}"};
final Pattern patt = Pattern.compile("(\\?)?&?(TXT\\{[^}]++})(&)?");
for (final String loginURL : loginURLs) {
System.out.printf("%1$-10s %2$s%n", "Processing", loginURL);
final StringBuffer sb = new StringBuffer();
final Matcher matcher = patt.matcher(loginURL);
while (matcher.find()) {
final String found = matcher.group(2);
System.out.printf("%1$-10s %2$s%n", "Found", found);
if (matcher.group(1) != null && matcher.group(3) != null) {
matcher.appendReplacement(sb, "$1");
} else {
matcher.appendReplacement(sb, "$3");
}
}
matcher.appendTail(sb);
System.out.printf("%1$-10s %2$s%n%n", "Processed", sb.toString());
}
}
Output:
Processing http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path
Processing http://ip:port/path?username=abcd&password={PASS}
Processed http://ip:port/path?username=abcd&password={PASS}
As you rightly point out, there are 3 possible cases:
"?{TEXT}&" -> "?"
"&{TEXT}&" -> "&"
"?{TEXT}" -> ""
So what we need to do is test for those cases in the regex. Here is the pattern:
(\\?)?&?(TXT\\{[^}]++})(&)?
Explanation:
(\\?)? optionally matches and captures a ?
&? optionally captures an &
(TXT\\{[^}]++}) matches and captures TXT, followed by {, followed by one or most not } (possessively), followed by } (closing brackets don't need to be escaped
(&)? optionally matches and captures a &
We have 3 groups:
potentially a ?
the required text
potentially an &
Now when we find a match we need to replace with the appropriate capture of case 1..3
if (matcher.group(1) != null && matcher.group(3) != null) {
matcher.appendReplacement(sb, "$1");
} else {
matcher.appendReplacement(sb, "$3");
}
If groups 1 and 3 are both present:
We must be in case 1; we must replace with "?" which is in group 1 so $1.
Otherwise we are in case 2 or 3:
In case 2 we need to replace with "&" and in 3 with "".
In case 2 group 3 will hold "&" and in case 3 it will hold "" so we can replace with $3 in both these cases.
Here I only capture the TXT{...} part using a match group. This means that although the leading ? or & is replaced it is not in the String found. I you only want the bit between {} then just move the parenthesis.
Note that I reuse the Pattern - you can also reuse the Matcher if performance is a concern. You should always reuse the Pattern as it is (very) expensive to create. Store it in a static final if you can - it's threadsafe, matchers are not. The usual way to do it is to store the Pattern in a static final and then reuse the Matcher in the context of a method.
Also, the use of Matcher.appendReplacement is much more efficient than your current approach as it only needs to process the input once. Your approach parses the string twice.
I've never been good with regex and I can't seem to get this...
I am trying to match statements along these lines (these are two lines in a text file I'm reading)
Lname Fname 12.35 1
Jones Bananaman 7.1 3
Currently I am using this for a while statement
reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")
But it doesn't enter the while statement.
The program reads the text file just fine when I remove the while.
The code segment is this:
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath);
while(reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")){
employeeInfo.add(new EmployeeFile(reader.next(), reader.next(), reader.nextDouble(), reader.nextInt(), new employeeRemove()));
}
for(EmployeeFile element: employeeInfo){
output.add(element);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
Use the \s character class for the spaces between words:
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]"))
Update:
According to the javadoc for the Scanner class, by default it splits it's tokens using whitespace. You can change the delimiter it uses with the useDelimiter(String pattern) method of Scanner.
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath).useDelimiter("\\n");
...
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]")){
...
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
From what I can see (And correct me if I'm wrong, because regex always seems to trick my brain :p), you're not handling the spaces correctly. You need to use \s, not just the standard ' ' character
EDIT: Sorry, \s. Someone else beat me to it :p
Actually
\w+
is going to catch [Lname, Fname, 12, 35, 1] for Lname Fname 12.35 1. So you can just store reader.nextLine() and then extract all regex matches from there. From there, you can abstract it a bit for instance by :
class EmployeeFile {
.....
public EmployeeFile(String firstName, String lastName,
Double firstDouble, int firstInt,
EmployeeRemove er){
.....
}
public EmployeeFile(String line) {
//TODO : extract all the required info from the string array
// instead of doing it while reading at the same time.
// Keep input parsing separate from input reading.
// Turn this into a string array using the regex pattern
// mentioned above
}
}
I created my own version, without files and the last loop, that goes like that:
private static void initializeFileData() {
String[] testStrings = {"Lname Fname 12.35 1", "Jones Bananaman 7.1 3"};
Pattern myPattern = Pattern.compile("(\\w+)\\s+(\\w+)\\s+(\\d*\\.\\d{1,2})\\s+([0-5])");
for (String s : testStrings) {
Matcher myMatcher = myPattern.matcher(s);
if (myMatcher.groupCount() == 4) {
String lastName = myMatcher.group(1);
String firstName = myMatcher.group(2);
double firstValue = Double.parseDouble(myMatcher.group(3) );
int secondValue = Integer.parseInt(myMatcher.group(4));
//employeeInfo.add(new EmployeeFile(lastName, firstName, firstValue, secondValue, new employeeRemove()));
}
}
}
Notice that I removed the slash before the dot (you want a dot, not any character) and inserted the parenthesis, in order to create the groups.
I hope it helps.