My csv is getting read into the System.out, but I've noticed that any text with a space gets moved into the next line (as a return \n)
Here's how my csv starts:
first,last,email,address 1, address 2
john,smith,blah#blah.com,123 St. Street,
Jane,Smith,blech#blech.com,4455 Roger Cir,apt 2
After running my app, any cell with a space (address 1), gets thrown onto the next line.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class main {
public static void main(String[] args) {
// -define .csv file in app
String fileNameDefined = "uploadedcsv/employees.csv";
// -File class needed to turn stringName to actual file
File file = new File(fileNameDefined);
try{
// -read from filePooped with Scanner class
Scanner inputStream = new Scanner(file);
// hashNext() loops line-by-line
while(inputStream.hasNext()){
//read single line, put in string
String data = inputStream.next();
System.out.println(data + "***");
}
// after loop, close scanner
inputStream.close();
}catch (FileNotFoundException e){
e.printStackTrace();
}
}
}
So here's the result in the console:
first,last,email,address
1,address
2
john,smith,blah#blah.com,123
St.
Street,
Jane,Smith,blech#blech.com,4455
Roger
Cir,apt
2
Am I using Scanner incorrectly?
Please stop writing faulty CSV parsers!
I've seen hundreds of CSV parsers and so called tutorials for them online.
Nearly every one of them gets it wrong!
This wouldn't be such a bad thing as it doesn't affect me but people who try to write CSV readers and get it wrong tend to write CSV writers, too. And get them wrong as well. And these ones I have to write parsers for.
Please keep in mind that CSV (in order of increasing not so obviousness):
can have quoting characters around values
can have other quoting characters than "
can even have other quoting characters than " and '
can have no quoting characters at all
can even have quoting characters on some values and none on others
can have other separators than , and ;
can have whitespace between seperators and (quoted) values
can have other charsets than ascii
should have the same number of values in each row, but doesn't always
can contain empty fields, either quoted: "foo","","bar" or not: "foo",,"bar"
can contain newlines in values
can not contain newlines in values if they are not delimited
can not contain newlines between values
can have the delimiting character within the value if properly escaped
does not use backslash to escape delimiters but...
uses the quoting character itself to escape it, e.g. Frodo's Ring will be 'Frodo''s Ring'
can have the quoting character at beginning or end of value, or even as only character ("foo""", """bar", """")
can even have the quoted character within the not quoted value; this one is not escaped
If you think this is obvious not a problem, then think again. I've seen every single one of these items implemented wrongly. Even in major software packages. (e.g. Office-Suites, CRM Systems)
There are good and correctly working out-of-the-box CSV readers and writers out there:
opencsv
Ostermiller Java Utilities
Apache Commons CSV
If you insist on writing your own at least read the (very short) RFC for CSV.
scanner.useDelimiter(",");
This should work.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class TestScanner {
public static void main(String[] args) throws FileNotFoundException {
Scanner scanner = new Scanner(new File("/Users/pankaj/abc.csv"));
scanner.useDelimiter(",");
while(scanner.hasNext()){
System.out.print(scanner.next()+"|");
}
scanner.close();
}
}
For CSV File:
a,b,c d,e
1,2,3 4,5
X,Y,Z A,B
Output is:
a|b|c d|e
1|2|3 4|5
X|Y|Z A|B|
Scanner.next() does not read a newline but reads the next token, delimited by whitespace (by default, if useDelimiter() was not used to change the delimiter pattern). To read a line use Scanner.nextLine().
Once you read a single line you can use String.split(",") to separate the line into fields. This enables identification of lines that do not consist of the required number of fields. Using useDelimiter(","); would ignore the line-based structure of the file (each line consists of a list of fields separated by a comma). For example:
while (inputStream.hasNextLine())
{
String line = inputStream.nextLine();
String[] fields = line.split(",");
if (fields.length >= 4) // At least one address specified.
{
for (String field: fields) System.out.print(field + "|");
System.out.println();
}
else
{
System.err.println("Invalid record: " + line);
}
}
As already mentioned, using a CSV library is recommended. For one, this (and useDelimiter(",") solution) will not correctly handle quoted identifiers containing , characters.
I agree with Scheintod that using an existing CSV library is a good idea to have RFC-4180-compliance from the start. Besides the mentioned OpenCSV and Oster Miller, there are a series of other CSV libraries out there. If you're interested in performance, you can take a look at the uniVocity/csv-parsers-comparison. It shows that
uniVocity CSV parser
SimpleFlatMapper CSV parser
Jackson CSV parser
are consistently the fastest using either JDK 6, 7, 8, or 9. The study did not find any RFC 4180 compatibility issues in any of those three. Both OpenCSV and Oster Miller are found to be about twice as slow as those.
I'm not in any way associated with the author(s), but concerning the uniVocity CSV parser, the study might be biased due to its author being the same as of that parser.
To note, the author of SimpleFlatMapper has also published a performance comparison comparing only those three.
Split nextLine() by this delimiter:
(?=([^\"]*\"[^\"]*\")*[^\"]*$)").
I have seen many production problems caused by code not handling quotes ("), newline characters within quotes, and quotes within the quotes; e.g.: "he said ""this""" should be parsed into: he said "this"
Like it was mentioned earlier, many CSV parsing examples out there just read a line, and then break up the line by the separator character. This is rather incomplete and problematic.
For me and probably those who prefer build verses buy (or use somebody else's code and deal with their dependencies), I got down to classic text parsing programming and that worked for me:
/**
* Parse CSV data into an array of String arrays. It handles double quoted values.
* #param is input stream
* #param separator
* #param trimValues
* #param skipEmptyLines
* #return an array of String arrays
* #throws IOException
*/
public static String[][] parseCsvData(InputStream is, char separator, boolean trimValues, boolean skipEmptyLines)
throws IOException
{
ArrayList<String[]> data = new ArrayList<String[]>();
ArrayList<String> row = new ArrayList<String>();
StringBuffer value = new StringBuffer();
int ch = -1;
int prevCh = -1;
boolean inQuotedValue = false;
boolean quoteAtStart = false;
boolean rowIsEmpty = true;
boolean isEOF = false;
while (true)
{
prevCh = ch;
ch = (isEOF) ? -1 : is.read();
// Handle carriage return line feed
if (prevCh == '\r' && ch == '\n')
{
continue;
}
if (inQuotedValue)
{
if (ch == -1)
{
inQuotedValue = false;
isEOF = true;
}
else
{
value.append((char)ch);
if (ch == '"')
{
inQuotedValue = false;
}
}
}
else if (ch == separator || ch == '\r' || ch == '\n' || ch == -1)
{
// Add the value to the row
String s = value.toString();
if (quoteAtStart && s.endsWith("\""))
{
s = s.substring(1, s.length() - 1);
}
if (trimValues)
{
s = s.trim();
}
rowIsEmpty = (s.length() > 0) ? false : rowIsEmpty;
row.add(s);
value.setLength(0);
if (ch == '\r' || ch == '\n' || ch == -1)
{
// Add the row to the result
if (!skipEmptyLines || !rowIsEmpty)
{
data.add(row.toArray(new String[0]));
}
row.clear();
rowIsEmpty = true;
if (ch == -1)
{
break;
}
}
}
else if (prevCh == '"')
{
inQuotedValue = true;
}
else
{
if (ch == '"')
{
inQuotedValue = true;
quoteAtStart = (value.length() == 0) ? true : false;
}
value.append((char)ch);
}
}
return data.toArray(new String[0][]);
}
Unit Test:
String[][] data = parseCsvData(new ByteArrayInputStream("foo,\"\",,\"bar\",\"\"\"music\"\"\",\"carriage\r\nreturn\",\"new\nline\"\r\nnext,line".getBytes()), ',', true, true);
for (int rowIdx = 0; rowIdx < data.length; rowIdx++)
{
System.out.println(Arrays.asList(data[rowIdx]));
}
generates the output:
[foo, , , bar, "music", carriage
return, new
line]
[next, line]
If you absolutely must use Scanner, then you must set its delimiter via its useDelimiter(...) method. Else it will default to using all white space as its delimiter. Better though as has already been stated -- use a CSV library since this is what they do best.
For example, this delimiter will split on commas with or without surrounding whitespace:
scanner.useDelimiter("\\s*,\\s*");
Please check out the java.util.Scanner API for more on this.
Well, I do my coding in NetBeans 8.1:
First: Create a new project, select Java application and name your project.
Then modify your code after public class to look like the following:
/**
* #param args the command line arguments
* #throws java.io.FileNotFoundException
*/
public static void main(String[] args) throws FileNotFoundException {
try (Scanner scanner = new Scanner(new File("C:\\Users\\YourName\\Folder\\file.csv"))) {
scanner.useDelimiter(",");
while(scanner.hasNext()){
System.out.print(scanner.next()+"|");
}}
}
}
Related
I am trying to run a mapreduce job on hadoop which reads the fifth entry of a tab delimited file (fifth entry are user reviews) and then do some sentiment analysis and word count on them.
However, as you know with user reviews, they usually include line breaks and empty lines. My code iterates through the words of each review to find keywords and check sentiment if keyword is found.
The problem is as the code iterates through the review, it gives me ArrayIndexOutofBoundsException Error because of these line breaks and empty lines in one review.
I have tried using replaceAll("\r", " ") and replaceAll("\n", " ") to no avail.
I have also tried if(tokenizer.countTokens() == 2){
word.set(tokenizer.nextToken());}
else {
}
also to no avail. Below is my code:
public class KWSentiment_Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> keywordsList = new ArrayList<String>();
ArrayList<String> posWordsList = new ArrayList<String>();
ArrayList<String> tokensList = new ArrayList<String>();
int e;
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
String Review = line[4].replaceAll("[\\-\\+\\\\)\\.\\(\"\\{\\$\\^:,]", "").toLowerCase();
StringTokenizer tokenizer = new StringTokenizer(Review);
while (tokenizer.hasMoreTokens()) {
// 1- first read the review line and store the tokens in an arraylist, 2-
// iterate through review to check for KW if found
// 3-check if there's PosWord near (upto +3 and -2)
// 4- setWord & context.write 5- null the review line arraylist
String CompareString = tokenizer.nextToken();
tokensList.add(CompareString);
}
{
for (int i = 0; i < tokensList.size(); i++)
{
for (int j = 0; j < keywordsList.size(); j++) {
boolean flag = false;
if (tokensList.get(i).startsWith(keywordsList.get(j)) == true) {
for (int e = Math.max(0, i - 2); e < Math.min(tokensList.size(), i + 4); e++) {
if (posWordsList.contains(tokensList.get(e))) {
word.set(keywordsList.get(j));
context.write(word, one);
flag = true;
break; // breaks out of e loop }}
}
}
}
if (flag)
break;
}
}
tokensList.clear();
}
}
Expected results are such that:
Take these two cases of reviews where error occurs:
Case 1: "Beautiful and spacious!
I highly recommend this place and great host."
Case 2: "The place in general was really silent but we didn't feel stayed.
Aside from this, the bathroom is big and the shower is really nice but there problem. "
The system should read the whole review as one line and iterate through the words in it. However, it just stops as it finds a line break or an empty line as in case 2.
Case 1 should be read such as: "Beautiful and spacious! I highly recommend this place and great host."
Case 2 should be:"The place in general was really silent but we didn't feel stayed. Aside from this, the bathroom is big and the shower is really nice but there problem. "
I am running out of time and would really appreciate help here.
Thanks!
So, I hope I am understanding what what you are trying to do....
If I am reading what you have above correctly, the value of 'value' passed into your map function above contains the delimited value that you would like to parse the user reviews out of. If that is the case, I believe we can make use of the escaping functionality in the opencsv library using tabs as your delimiting character instead of commas to correctly populate the user review field:
http://opencsv.sourceforge.net
In this example we are reading one line from the input that is passed in and parsing it into 'columns' base on the tab character and placing the results in the 'nextLine' array. This will allow us to use the escaping functionality of the CSVReader without reading an actual file and instead using the value of the text passed into your map function.
StringReader reader = new StringReader(value.toString());
CSVReader csvReader = new CSVReader(reader, '\t', '\"', '\\', 0);
String [] nextLine = csvReader.readNext();
if(nextLine != null && nextLine.length >= 5) {
// Do some stuff
}
In the example that you pasted above, I think even that split("\n") will be problematic as tabs within a user review split into two results in the result in addition to new lines being treated as new records. But, both of these characters are legal as long as they are inside a quoted value (as they should be in a properly escaped file and as they are in your example). CSVReader should handle all of these.
Validate each line at the start of the map method, so that you know line[4] exists and isn't null.
if (value == null || value.toString == null) {
return;
}
String[] line = value.toString().split("\t");
if (line == null || line.length() < 5 || line[4] == null) {
return;
}
As for line breaks, you'll need to show some sample input. By default MapReduce passes each line into the map method independently, so if you do want to read multiple lines as one message, you'll have to write a custom InputSplit, or pre-format your data so that all data for each review is on the same line.
Before writing this i searched for any documentation that tell me which characters will corrupt the final csv that is generated from java. I didn't find anything good and complete. I make a method in my code to replace all possibile minimum characters from a string before create a csv:
public static String getPureNote(String dirtyNote) {
StringBuffer s = new StringBuffer();
for (int i = 0; i < dirtyNote.length(); i++) {
char c = dirtyNote.charAt(i);
if (c == '\n') {//new line make a new line in my csv and i want just to stay in a single cell
s.append(" ");
} else if (c == '\r') {
s.append(" ");
} else if (c == '\t') {//make a huge tab
s.append(" ");
} else if (c == ';') {//the input continue in a adjacent cell and don't stay in a single one
s.append(",");
} else {
s.append(c);
}
}
return s.toString();
}
String Example and CSV look like:
ok (implementing the method getPureNote):
Com Code Desc Struct Note
62 001 first 1 first structure on
63 002 second 2 second structure off
ko (if after structure on note is /n character and not implementing the method):
Com Code Desc Struct Note
62 001 first 1 first structure
on
63 002 second 2 second structure
off
This method is ok for now but i want to know which characters i should replace/remove always from string before creating a csv file? I can't test all possibile character that will corrupt my csv file. Final users will open it on double click and not importing it in Excel.
Thank you
You are missing quotes (i.e. "). You can probably replace those by single quotes (i.e. ').
However, if your value contains a delimiter already (i.e. the comma: ,) you will have to enclose the entire value within quotes at the end.
Looking at your code, you should do this:
boolean wrapInQuotes = false;
int recordStart = 0;
for (int i = 0; i < dirtyNote.length(); i++) {
char c = dirtyNote.charAt(i);
... // your original code here
} else if (c == ',') { //value contains comma, we need to put it in quotes.
s.append(c);
wrapInQuotes = true;
} else if (c == ';') { //looks like you want to create a new record
if(wrapInQuotes){
s.insert(recordStart, '"'); //puts a quote before the field
s.append('"'); //puts the closing quote after the field
}
s.append(",");
recordStart = s.length();
wrapInQuotes = false; //starts over
} else if (c == '"') {
s.append('\''); //replace double quotes by single quotes.
} else {
s.append(c);
}
if(wrapInQuotes){
s.insert(recordStart, '"'); //puts a quote before the field
s.append('"'); //puts the closing quote after the field
}
}
I didn't actually test this but it should do the trick. As you can see processing CSV is not exactly straightforward. If things get too tricky or slow maybe try using a CSV library such as univocity-parsers to do the job for you (I'm the author of this library by the way).
Hope this helps
I am trying to write one java program. This program take a string from the user as an input and display the output by removing the special characters in it. And display the each strings in new line
Let's say I have this string Abc#xyz,2016!horrible_just?kidding after reading this string my program should display the output by removing the special characters like
Abc
xyz
2016
horrible
just
kidding
Now I know there are already API available like Matcher and Patterns API in java to do this. But I don't want to use the API since I am a beginner to java so I am just trying to crack the code bit by bit.
This is what I have tried so far. What I have done here is I am taking the string from the user and stored the special characters in an array and doing the comparison till it get the special character. And also storing the new character in StringBuilder class.
Here is my code
import java.util.*;
class StringTokens{
public void display(String string){
StringBuilder stringToken = new StringBuilder();
stringToken.setLength(0);
char[] str = {' ','!',',','?','.','_','#'};
for(int i=0;i<string.length();i++){
for(int j =0;j<str.length;j++){
if((int)string.charAt(i)!=(int)str[j]){
stringToken.append(str[j]);
}
else {
System.out.println(stringToken.toString());
stringToken.setLength(0);
}
}
}
}
public static void main(String[] args){
if(args.length!=1)
System.out.println("Enter only one line string");
else{
StringTokens st = new StringTokens();
st.display(args[0]);
}
}
}
When I run this code I am only getting the special characters, I am not getting the each strings in new line.
One easy way - use a set to hold all invalid characters:
Set<Character> invalidChars = new HashSet<>(Arrays.asList('$', ...));
Then your check boils down to:
if(invaidChars.contains(string.charAt(i)) {
... invalid char
} else {
valid char
}
But of course, that still means: you are re-inventing the wheel. And one does only re-invent the wheel, if one has very good reasons to. One valid reason would be: your assignment is to implement your own solution.
But otherwise: just read about replaceAll. That method does exactly what your current code; and my solution would be doing. But in a straight forward way; that every good java programmer will be able to understand!
So, to match your question: yes, you can implement this yourself. But the next step is to figure the "canonical" solution to the problem. When you learn Java, then you also have to focus on learning how to do things "like everybody else", with least amount of code to solve the problem. That is one of the core beauties of Java: for 99% of all problems, there is already a straight-forward, high-performance, everybody-will-understand solution out there; most often directly in the Java standard libraries themselves! And knowing Java means to know and understand those solutions.
Every C coder can put down 150 lines of low-level array iterating code in Java, too. The true virtue is to know the ways of doing the same thing with 5 or 10 lines!
I can't comment because I don't have the reputation required. Currently you are appending str[j] which represents special character. Instead you should be appending string.charAt(i). Hope that helps.
stringToken.append(str[j]);
should be
stringToken.append(string.charAt(i));
Here is corrected version of your code, but there are better solutions for this problem.
public class StringTokens {
static String specialChars = new String(new char[]{' ', '!', ',', '?', '.', '_', '#'});
public static void main(String[] args) {
if (args.length != 1) {
System.out.println("Enter only one line string");
} else {
display(args[0]);
}
}
public static void display(String string) {
StringBuilder stringToken = new StringBuilder();
stringToken.setLength(0);
for(char c : string.toCharArray()) {
if(!specialChars.contains(String.valueOf(c))) {
stringToken.append(c);
} else {
stringToken.append('\n');
}
}
System.out.println(stringToken);
}
}
public static void main(String[] args) {
String a=",!?#_."; //Add other special characters too
String test="Abc#xyz,2016!horrible_just?kidding"; //Make this as user input
for(char c : test.toCharArray()){
if(a.contains(c+""))
{
System.out.println(); //to avoid printing the special character and to print newline
}
else{
System.out.print(c);
}
}
}
you can run a simple loop and check ascii value of each character. If its something other than A-Z and a-z print newline skip the character and move on. Time complexity will be O(n) + no extra classes used.
String str = "Abc#xyz,2016!horrible_just?kidding";
char charArray[] = str.toCharArray();
boolean flag=true;;
for (int i = 0; i < charArray.length; i++) {
int temp2 = (int) charArray[i];
if (temp2 >= (int) 'A' && temp2 <= (int) 'Z') {
System.out.print(charArray[i]);
flag=true;
} else if (temp2 >= (int) 'a' && temp2 <= (int) 'z') {
System.out.print(charArray[i]);
flag=true;
} else {
if(flag){
System.out.println("");
flag=false;
}
}
}
Hi im having this assignment that I don't really understand how to pull off.
Ive been programing java for 2.5 weeks so Im really new.
Im supposed to import a text document into my program and then do these operations, count letters, sentences and average length of words. I've to perform the counting task letter by letter, I'm not allowed to scan the entire document at the same time. Ive managed to import the text and also print it out, but my problem is I cant use my string "line" to do any of these operations. Ive tried converting it to arrays, strings and after a lot of failed attempts im giving up. So how do I convert my input to something I can use, because i always get the error message "line is not a variable" or smth like that.
Jesper
UPDATE WITH MY SOLUTION! also some of it is in Swedish, sorry for that.
Somehow the Format is wrong so I uploaded the code here instead, really don't feel to argue with this wright now!
http://txs.io/3eIb
To count letters, check each character. If it's a space or punctuation, ignore it. Otherwise, it's a letter and we should this increment.
Every word should have a space after it unless it is the last word of the sentence. To get the number of words, track the number of spaces + number of sentences. To get number of sentences, find the number of ! ? and .
I would do that by looking at the ascii value of each character.
int numSentences = 0;
int numWords = 0;
while (line = ...){
for(int i = 0; i <line.length(); i++){
int curCharAsc = (int)(line.at(i)) //get ascii value by casting char to int
if((curCharAsc >= 65 && curCharAsc <= 90) || (curCharAsc >= 97 && curCharAsc <= 122) //check if letter is uppercase or lowercase
numLetters++;
if(curCharAsc == 32){ //ascii for space
numWords++;
}
else if (curCharAsc == 33 || curCharAsc == 46 || curCharAsc == 63){
numWords++;
numSentences++;
}
}
}
double avgWordLength = ((double)(letters))/numWords; //cast to double before dividing to avoid round-off
Your code as presented works fine, it loads a file and prints out the contents line by line. What you probably need to do is capture each of those lines. Java has two useful classes for this StringBuilder or StringBuffer (pick one).
BufferedReader input = new BufferedReader(new FileReader(args[0]));
String line;
StringBuffer buffer = new StringBuffer();
while ((line = input.readLine()) != null) {
System.out.println(line);
buffer.append(line+" ");
}
input.close();
performOperations(buffer.toString());
The only other possibility is (if your own code is not running for you) - possibly you aren't passing the input file name as a parameter when you run this class?
UPDATE
NB - I've modified the line
buffer.append(line+"\n");
to add a space instead of a line break, so that it is compatible with algorithms in the #faraza answer
The method performOperations doesn't exist yet. So you should / could add something like this
public static void performOperations(String data){
}
You method could in turn make calls out to separate methods for each operation
public static void performOperations(String data){
countWords(data);
countLetters(data);
averageWordLength(data);
}
To take it to the next level, and introduce Object Orientation, you could create a class TextStatsCollector.
public class TextStatsCollector{
private final String data;
public TextStatsCollector(final String data) {
this.data = data;
}
public int countWords(){
//word count impl here
}
public int countLetters(){
//letter count impl here
}
public int averageWordLength(){
//average word length impl here
}
public void performOperations(){
System.out.println("Number of Words is " + countWords());
System.out.println("Number of Letters is " + countLetters());
System.out.println("Average word length is " + averageWordLength());
}
}
Then you could use TextStatsCollector like the following in your main method
new TextStatsCollector(buffer.toString()).performOperations();
My csv is getting read into the System.out, but I've noticed that any text with a space gets moved into the next line (as a return \n)
Here's how my csv starts:
first,last,email,address 1, address 2
john,smith,blah#blah.com,123 St. Street,
Jane,Smith,blech#blech.com,4455 Roger Cir,apt 2
After running my app, any cell with a space (address 1), gets thrown onto the next line.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class main {
public static void main(String[] args) {
// -define .csv file in app
String fileNameDefined = "uploadedcsv/employees.csv";
// -File class needed to turn stringName to actual file
File file = new File(fileNameDefined);
try{
// -read from filePooped with Scanner class
Scanner inputStream = new Scanner(file);
// hashNext() loops line-by-line
while(inputStream.hasNext()){
//read single line, put in string
String data = inputStream.next();
System.out.println(data + "***");
}
// after loop, close scanner
inputStream.close();
}catch (FileNotFoundException e){
e.printStackTrace();
}
}
}
So here's the result in the console:
first,last,email,address
1,address
2
john,smith,blah#blah.com,123
St.
Street,
Jane,Smith,blech#blech.com,4455
Roger
Cir,apt
2
Am I using Scanner incorrectly?
Please stop writing faulty CSV parsers!
I've seen hundreds of CSV parsers and so called tutorials for them online.
Nearly every one of them gets it wrong!
This wouldn't be such a bad thing as it doesn't affect me but people who try to write CSV readers and get it wrong tend to write CSV writers, too. And get them wrong as well. And these ones I have to write parsers for.
Please keep in mind that CSV (in order of increasing not so obviousness):
can have quoting characters around values
can have other quoting characters than "
can even have other quoting characters than " and '
can have no quoting characters at all
can even have quoting characters on some values and none on others
can have other separators than , and ;
can have whitespace between seperators and (quoted) values
can have other charsets than ascii
should have the same number of values in each row, but doesn't always
can contain empty fields, either quoted: "foo","","bar" or not: "foo",,"bar"
can contain newlines in values
can not contain newlines in values if they are not delimited
can not contain newlines between values
can have the delimiting character within the value if properly escaped
does not use backslash to escape delimiters but...
uses the quoting character itself to escape it, e.g. Frodo's Ring will be 'Frodo''s Ring'
can have the quoting character at beginning or end of value, or even as only character ("foo""", """bar", """")
can even have the quoted character within the not quoted value; this one is not escaped
If you think this is obvious not a problem, then think again. I've seen every single one of these items implemented wrongly. Even in major software packages. (e.g. Office-Suites, CRM Systems)
There are good and correctly working out-of-the-box CSV readers and writers out there:
opencsv
Ostermiller Java Utilities
Apache Commons CSV
If you insist on writing your own at least read the (very short) RFC for CSV.
scanner.useDelimiter(",");
This should work.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class TestScanner {
public static void main(String[] args) throws FileNotFoundException {
Scanner scanner = new Scanner(new File("/Users/pankaj/abc.csv"));
scanner.useDelimiter(",");
while(scanner.hasNext()){
System.out.print(scanner.next()+"|");
}
scanner.close();
}
}
For CSV File:
a,b,c d,e
1,2,3 4,5
X,Y,Z A,B
Output is:
a|b|c d|e
1|2|3 4|5
X|Y|Z A|B|
Scanner.next() does not read a newline but reads the next token, delimited by whitespace (by default, if useDelimiter() was not used to change the delimiter pattern). To read a line use Scanner.nextLine().
Once you read a single line you can use String.split(",") to separate the line into fields. This enables identification of lines that do not consist of the required number of fields. Using useDelimiter(","); would ignore the line-based structure of the file (each line consists of a list of fields separated by a comma). For example:
while (inputStream.hasNextLine())
{
String line = inputStream.nextLine();
String[] fields = line.split(",");
if (fields.length >= 4) // At least one address specified.
{
for (String field: fields) System.out.print(field + "|");
System.out.println();
}
else
{
System.err.println("Invalid record: " + line);
}
}
As already mentioned, using a CSV library is recommended. For one, this (and useDelimiter(",") solution) will not correctly handle quoted identifiers containing , characters.
I agree with Scheintod that using an existing CSV library is a good idea to have RFC-4180-compliance from the start. Besides the mentioned OpenCSV and Oster Miller, there are a series of other CSV libraries out there. If you're interested in performance, you can take a look at the uniVocity/csv-parsers-comparison. It shows that
uniVocity CSV parser
SimpleFlatMapper CSV parser
Jackson CSV parser
are consistently the fastest using either JDK 6, 7, 8, or 9. The study did not find any RFC 4180 compatibility issues in any of those three. Both OpenCSV and Oster Miller are found to be about twice as slow as those.
I'm not in any way associated with the author(s), but concerning the uniVocity CSV parser, the study might be biased due to its author being the same as of that parser.
To note, the author of SimpleFlatMapper has also published a performance comparison comparing only those three.
Split nextLine() by this delimiter:
(?=([^\"]*\"[^\"]*\")*[^\"]*$)").
I have seen many production problems caused by code not handling quotes ("), newline characters within quotes, and quotes within the quotes; e.g.: "he said ""this""" should be parsed into: he said "this"
Like it was mentioned earlier, many CSV parsing examples out there just read a line, and then break up the line by the separator character. This is rather incomplete and problematic.
For me and probably those who prefer build verses buy (or use somebody else's code and deal with their dependencies), I got down to classic text parsing programming and that worked for me:
/**
* Parse CSV data into an array of String arrays. It handles double quoted values.
* #param is input stream
* #param separator
* #param trimValues
* #param skipEmptyLines
* #return an array of String arrays
* #throws IOException
*/
public static String[][] parseCsvData(InputStream is, char separator, boolean trimValues, boolean skipEmptyLines)
throws IOException
{
ArrayList<String[]> data = new ArrayList<String[]>();
ArrayList<String> row = new ArrayList<String>();
StringBuffer value = new StringBuffer();
int ch = -1;
int prevCh = -1;
boolean inQuotedValue = false;
boolean quoteAtStart = false;
boolean rowIsEmpty = true;
boolean isEOF = false;
while (true)
{
prevCh = ch;
ch = (isEOF) ? -1 : is.read();
// Handle carriage return line feed
if (prevCh == '\r' && ch == '\n')
{
continue;
}
if (inQuotedValue)
{
if (ch == -1)
{
inQuotedValue = false;
isEOF = true;
}
else
{
value.append((char)ch);
if (ch == '"')
{
inQuotedValue = false;
}
}
}
else if (ch == separator || ch == '\r' || ch == '\n' || ch == -1)
{
// Add the value to the row
String s = value.toString();
if (quoteAtStart && s.endsWith("\""))
{
s = s.substring(1, s.length() - 1);
}
if (trimValues)
{
s = s.trim();
}
rowIsEmpty = (s.length() > 0) ? false : rowIsEmpty;
row.add(s);
value.setLength(0);
if (ch == '\r' || ch == '\n' || ch == -1)
{
// Add the row to the result
if (!skipEmptyLines || !rowIsEmpty)
{
data.add(row.toArray(new String[0]));
}
row.clear();
rowIsEmpty = true;
if (ch == -1)
{
break;
}
}
}
else if (prevCh == '"')
{
inQuotedValue = true;
}
else
{
if (ch == '"')
{
inQuotedValue = true;
quoteAtStart = (value.length() == 0) ? true : false;
}
value.append((char)ch);
}
}
return data.toArray(new String[0][]);
}
Unit Test:
String[][] data = parseCsvData(new ByteArrayInputStream("foo,\"\",,\"bar\",\"\"\"music\"\"\",\"carriage\r\nreturn\",\"new\nline\"\r\nnext,line".getBytes()), ',', true, true);
for (int rowIdx = 0; rowIdx < data.length; rowIdx++)
{
System.out.println(Arrays.asList(data[rowIdx]));
}
generates the output:
[foo, , , bar, "music", carriage
return, new
line]
[next, line]
If you absolutely must use Scanner, then you must set its delimiter via its useDelimiter(...) method. Else it will default to using all white space as its delimiter. Better though as has already been stated -- use a CSV library since this is what they do best.
For example, this delimiter will split on commas with or without surrounding whitespace:
scanner.useDelimiter("\\s*,\\s*");
Please check out the java.util.Scanner API for more on this.
Well, I do my coding in NetBeans 8.1:
First: Create a new project, select Java application and name your project.
Then modify your code after public class to look like the following:
/**
* #param args the command line arguments
* #throws java.io.FileNotFoundException
*/
public static void main(String[] args) throws FileNotFoundException {
try (Scanner scanner = new Scanner(new File("C:\\Users\\YourName\\Folder\\file.csv"))) {
scanner.useDelimiter(",");
while(scanner.hasNext()){
System.out.print(scanner.next()+"|");
}}
}
}