Knowing delimiters for CSV file

Knowing delimiters for CSV file - java

This may be a simple question but I have not been able to find a satisfactory answer. I am writing a class in Java that needs to take in a .csv file filled with doubles in three columns. Obviously a .csv file uses commas as the delimiters, but when I try setting them with my scanner, the scanner finds nothing. Any advice?
Scanner s = null;
try {
s = new Scanner(source);
//s.useDelimiter("[\\s,\r\n]+"); //This one works if I am using a .txt file
//s.useDelimiter(", \n"); // This is what I thought would work for a .csv file
...
} catch (FileNotFoundException e) {
System.err.format("FileNotFoundException: %s%s", e);
} catch (IOException e) {
System.err.format("IOException: %s%n", e);
}
A sample input would be:
12.3 11.2 27.0
0.5 97.1 18.3
etc.
Thank you for your time!
EDIT: fixed! Found the correct delimiters and realized I was using hasNextInt() instead of hasNextDouble(). /facepalm

Consider the following:
first,second,"the third",fourth,"the,fifth"
Should only be five - the last comma is in a quote block, which should not get split.
Don't reinvent the wheel. There are open source libraries to handle this behavior.
A quick google search yielded http://opencsv.sourceforge.net/ and I'm sure there's others.

If you are trying to read each individual item, try:
s.useDelimiter(",");
Then s.next() would return an item from the CSV.

Why have you got a \n in your CSV delimiter? Java doesn't have a difference between CSV and TXT files, if they have the same content.
I would think you would want
s.useDelimiter(",");
or
s.useDelimiter("[\\s]+,[\\s\r\n]*");

There are several methods to workaround:
Method 1:
use conditional statements ( if-else / switch ) in file extension.
if(ext == 'csv') {
s.useDelimiter(", \n");
} else if(ext == 'txt') {
s.useDelimiter("[\\s,\r\n]+");
}
Method 2:
as other answers suggested, use this:
s.useDelimiter(",");

Related

PrintWriter not creating output with supposed number of rows

I have a java program that is supposed to output data, take in data again, read and then output with a few extra columns of result. (So two outputs in total) To test my program I just tried to read and print out the exact same csv to see if it works. However, my first output returns 786718 rows of data, which is complete and correct, but when it gets read again to output the second time, the data is cut at row 786595 and even that row is missing some column data. The file size is also 74868KB vs 74072KB of data. Is this because of the lack of memory from my java program or excel/the .csv file's problem?
PrintWriter writer = null;
try {
writer = new PrintWriter(saveFileName + " updated.csv", "UTF-8");
for (Map.Entry<String, ArrayList> entry : readOutputCSV(saveFileName).entrySet()) {
FindOutput.find(entry.getKey(), entry.getValue(), checkInMRTWriter);
}
} finally {
if (writer != null) {
writer.flush();
writer.close();
}
}

The most likely reason is you are not flushing nor closing the PrintWriter.
From the Java source
public PrintWriter(OutputStream out) {
this(out, false);
}
public PrintWriter(OutputStream out, boolean autoFlush) {
this(new BufferedWriter(new OutputStreamWriter(out)), autoFlush);
You can see that PrintWriter is buffered by default.
The default buffer size is 8 KiB so if you leave this data in the buffer and don't write it out you can lose up to the last 8 KiB of your data.

Some things might influence here:
input/output encoding
line separators (you might be reading a file with '\r\n' and writing '\n' back
CSV escape - values might be escaped or not depending on how you are handling the special cases (values with newlines, comma, or quote). You might be reading valid CSV with a parser but printing out unescaped (and broken) CSV.
whitespaces. Some libraries clear the whitespace when parsing automatically.
The best way to verify is to use a CSV parsing library, such as univocity-parsers and use it to read/write your data with a fixed format configuration. Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Java: easiest way to read from an Excel-style document?

I'm trying to find the best way to read in data from a file similar to an Excel document. It doesn't necessarily need to be an actual excel document, just any file that allows you to enter data in a grid format.
Something where I would be able to do manipulation similar to this:
String val = file.readString(column,row);
float val2 = file.readFloat(column,row);
I'm sorry, I usually try to do more research before I post a question here but I was having a hard time finding much info. A lot of what I saw was 3rd party libraries that read excel files. I'm really hoping if possible I can avoid downloading libraries and hopefully use built in ones.
So I guess my questions in short are:
What's the most appropriate file format for this?
What's the best way to read data from that file?

The first thing that comes to my mind is CSV. CSV files are just regular text files with the .csv filename extension. Data is stored in this format:
cell,anothercell,athirdcell
anotherrow,anothercellonthenewrow,thirdcellofsecondrow
For more specifics, read the CSV specs here.

Option 1
Store your data in a CSV and read with any kind of reader (e.g. BufferedReader). This might be the easiest and fastest solution, if you want to use Excel/LibreOffice for entering data.
Please check out the answers in these threads for various solutions.
String csvfile = path;
BufferedReader br = null;
String line = "";
String cvsSplitby = ";";
try {
br = new BufferedReader(new FileReader(csvfile));
while ((line = br.readLine()) != null) {
String[] i = line.split(cvsSplitby);
// do stuff
}
} catch (all kind of exceptions e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Hope I didn't miss anything important.
Option 2
Use POI Apache.
Option 3
I've made some decent experience with JXL, but I understand that you don'T want to include too many external libs. (I just saw that it hasn't been updated in while. Consider the other options!)

Scanner's nextLine(), Only fetching partial

So, using something like:
for (int i = 0; i < files.length; i++) {
if (!files[i].isDirectory() && files[i].canRead()) {
try {
Scanner scan = new Scanner(files[i]);
System.out.println("Generating Categories for " + files[i].toPath());
while (scan.hasNextLine()) {
count++;
String line = scan.nextLine();
System.out.println(" ->" + line);
line = line.split("\t", 2)[1];
System.out.println("!- " + line);
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(line).getAsJsonObject();
Set<Entry<String, JsonElement>> entrySet = object.entrySet();
exploreSet(entrySet);
}
scan.close();
// System.out.println(keyset);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
as one goes over a Hadoop output file, one of the JSON objects in the middle is breaking... because scan.nextLine() is not fetching the whole line before it brings it to split. ie, the output is:
->0 {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
Most of the above data has been sanitized (not the URL (for the most part) however... )
and the URL continues as:
$(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007
in the file....
So its slightly miffing.
This also is entry #112, and I have had other files parse without errors... but this one is screwing with my mind, mostly because I dont see how scan.nextLine() isnt working...
By debug output, the JSON error is caused by the string not being split properly.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
EDIT:
Also blows up if I remove the offending line in about the same place.
Attempted with JVM 1.6 and 1.7
Workaround Solution:
BufferedReader scan = new BufferedReader(new FileReader(files[i]));
instead of scanner....

Based on your code, the best explanation I can come up with is that the line really does end after the "~~" according to the criteria used by Scanner.nextLine().
The criteria for an end-of-line are:
Something that matches this regex: "\r\n|[\n\r\u2028\u2029\u0085]" or
The end of the input stream
You say that the file continues after the "~~", so lets put EOF aside, and look at the regex. That will match any of the following:
The usual line separators:
<CR>
<NL>
<CR><NL>
... and three unusual forms of line separator that Scanner also recognizes.
0x0085 is the <NEL> or "next line" control code in the "ISO C1 Control" group
0x2028 is the Unicode "line separator" character
0x2029 is the Unicode "paragraph separator" character
My theory is that you've got one of the "unusual" forms in your input file, and this is not showing up in .... whatever tool it is that you are using to examine the files.
I suggest that you examine the input file using a tool that can show you the actual bytes of the file; e.g. the od utility on a Linux / Unix system. Also, check that this isn't caused by some kind of character encoding mismatch ... or trying to read or write binary data as text.
If these don't help, then the next step should be to run your application using your IDE's Java debugger, and single-step it through the Scanner.hasNextLine() and nextLine() calls to find out what the code is actually doing.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
That's interesting. But if the tool you are using to extract the line is the same one that is not showing the (hypothesized) unusual line separator, then this evidence is not reliable. The process of extraction may be altering the "stuff" that is causing the problems.

Java, Turn Each Row of a CSV into a string

Hello I'm a newbie in Java using BlueJ. I have a csv file that contains a load of data in a table arrangement. I'm trying to find a way to take this information and find out how many comma separated values there are in the first row then regardless of rows put each comma separated value into an array.
Does anyone have any advice on how to do this?
Thanks in advance, Harry.

CSV parsing can be tricky because of the need to support quoted values. I suggest not writing your own CSV parser, but using an existing library such as http://opencsv.sourceforge.net/.

You can use the Scanner file, to read each line of the file, something similar to:
// create a File object by giving the filepath
File file = new File("C:\\data.csv");
try {
// Create a new scanner class that will read the file
Scanner scanner = new Scanner(file);
// while the file has lines to read
while (scanner.hasNextLine()) {
// read the line to a string
String line = scanner.nextLine();
// do what you need with that line
}
// catch the exception if no file can be found
} catch (FileNotFoundException e) {
e.printStackTrace();
}

Here is another library to handle csv file. javacsv
Code example is here

How to preserve correct offset of string which is read from a file

I have a text.txt file which contains following txt.
Kontagent Announces Partnership with Global Latino Social Network Quepasa
Released By Kontagent
I read this text file into a string documentText.
documentText.subString(0,9) gives Kontagent, which is good.
But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.
To read file as string I used all the functions talked about here
How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.
Currently I am using this function to read the file into documentText String:
public static String readFileAsString(String fileName)
{
File file = new File(fileName);
StringBuilder fileContents = new StringBuilder((int)file.length());
Scanner scanner = null;
try {
scanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String lineSeparator = System.getProperty("line.separator");
try {
while(scanner.hasNextLine()) {
fileContents.append(scanner.nextLine() + lineSeparator);
}
return fileContents.toString();
} finally {
scanner.close();
}
}
EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode.
Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.

The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.
EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:
String s = CharStreams.toString(new FileReader(fileName));

On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.
Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.
Good luck!

Based on input you guys provided, I wrote something like this
documentText = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");
to strip off extra \r if a file has \r.
Now,I am getting expect result in windows environment as well as unix. Problem solved!!!
It works fine irrespective of what mode file has been copied.
:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Knowing delimiters for CSV file - java

If you are trying to read each individual item, try: s.useDelimiter(","); Then s.next() would return an item from the CSV.

Why have you got a \n in your CSV delimiter? Java doesn't have a difference between CSV and TXT files, if they have the same content. I would think you would want s.useDelimiter(","); or s.useDelimiter("[\\s]+,[\\s\r\n]*");

There are several methods to workaround: Method 1: use conditional statements ( if-else / switch ) in file extension. if(ext == 'csv') { s.useDelimiter(", \n"); } else if(ext == 'txt') { s.useDelimiter("[\\s,\r\n]+"); } Method 2: as other answers suggested, use this: s.useDelimiter(",");

Related

PrintWriter not creating output with supposed number of rows

Java: easiest way to read from an Excel-style document?

Scanner's nextLine(), Only fetching partial

Java, Turn Each Row of a CSV into a string

How to preserve correct offset of string which is read from a file

Categories

Resources