How to read csv since header inclusively in apache commons csv? - java

Following approach allows to read with skipping header:
Iterable<CSVRecord> records = CSVFormat.EXCEL.withHeader().parse(in);
for (CSVRecord record : records) {
//here first record is not header
}
How can I read csv since header line inclusively ?
P.S.
approach:
CSVFormat.EXCEL.withHeader().withSkipHeaderRecord(false).parse(in)
doesn't work and has the same behaviour

For me the followings all seem to have the header record as the first one (using commons-csv 1.5):
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
Iterable<CSVRecord> records = CSVFormat.EXCEL.withSkipHeaderRecord().parse(in); //???
Iterable<CSVRecord> records = CSVFormat.EXCEL.withSkipHeaderRecord(false).parse(in);
Iterable<CSVRecord> records = CSVFormat.EXCEL.withSkipHeaderRecord(true).parse(in); //???
And as you have stated the following does NOT seem to have the header record as the first one:
Iterable<CSVRecord> records = CSVFormat.EXCEL.withHeader().parse(in); //???
It is beyond my understanding why withSkipHeaderRecord() and withSkipHeaderRecord(true) do include the header while withHeader() does not; seems to be the opposite behaviour as to what the method names suggest.

The withHeader() method tells the parser that the file has a header. Perhaps the method name is confusing.
The withFirstRecordAsHeader() method may also be useful.
From the CSVFormat (Apache Commons CSV 1.8 API) JavaDoc page:
Referencing columns safely
If your source contains a header record, you can simplify your code and safely reference columns, by using withHeader(String...) with no arguments:
CSVFormat.EXCEL.withHeader();
This causes the parser to read the first record and use its values as column names. Then, call one of the CSVRecord get method that takes a String column name argument:
String value = record.get("Col1");
This makes your code impervious to changes in column order in the CSV file.

Related

Split one column to multiple columns

I have one column contains below data . want to split data in to multiple columns using java code . problem i am facing was in string I have double quotes with comma it was falling in to another column. I have to split data as follows(target). Can any one help to fix this ?
I/P:
Column:
abc,"test,data",valid
xyz,"sample,data",invalid
Target:
Col1|Col2|Col3
abc|"test,data"|valid
xyz|"sample_data"|invalid
I highly recommend that you use a library to handle instead doing it yourself.
I guess your data is in CSV format, so you should take a look at common-csv.
You can resolve your problem with simple code:
CSVParser records = CSVParser.parse("abc,\"test,data\",valid", CSVFormat.DEFAULT);
for (CSVRecord csvRecord : records) {
for (String value : csvRecord) {
System.out.println(value);
}
}
Output:
abc
test,data
valid
Read more at https://www.baeldung.com/apache-commons-csv

Validate CSV file columns with Spark

I am trying to read a CSV file (which is supposed to have a header) in Spark and load the data into an existing table (with predefined columns and datatypes). The csv file can be very large, so it would be great if I could avoid doing it if the columns header from the csv is not "valid".
When I'm currently reading the file, I'm specyfing a StructType as the schema, but this does not validate that the header contains the right columns in the right order.
This is what I have so far (I'm building the "schema" StructType in another place):
sqlContext
.read()
.format("csv")
.schema(schema)
.load("pathToFile");
If I add the .option("header", "true)" line it will skill over the first line of the csv file and use the names I'm passing in the StructType's add method. (e.g. if I build the StructType with "id" and "name" and the first row in the csv is "idzzz,name", the resulting dataframe will have columns "id" and "name". I want to be able to validate that the csv header has the same name for columns as the table I'm planning on loading the csv.
I tried reading the file with .head(), and doing some checks on that first row, but that downloads the whole file.
Any suggestion is more than welcomed.
From what I understand, you want to validate the schema of the CSV you read. The problem with the schema option is that its goal is to tell spark that it is the schema of your data, and not to check that it is.
There is an option however that infers the said schema when reading a CSV and that could be very useful (inferSchema) in your situation. Then, you can either compare that schema with the one you expect with equals, or do the small workaround that I will introduce to be a little bit more permissive.
Let's see how it works the following file:
a,b
1,abcd
2,efgh
Then, let's read the data. I used the scala REPL but you should be able to convert all that in Java very easily.
val df = spark.read
.option("header", true) // reading the header
.option("inferSchema", true) // infering the sschema
.csv(".../file.csv")
// then let's define the schema you would expect
val schema = StructType(Array(StructField("a", IntegerType),
StructField("b", StringType)))
// And we can check that the schema spark inferred is the same as the one
// we expect:
schema.equals(df.schema)
// res14: Boolean = true
going further
That's in a perfect world. Indeed, if you schema contains non nullable columns for instance or other small differences, this solution that's based on strict equality of object will not work.
val schema2 = StructType(Array(StructField("a", IntegerType, false),
StructField("b", StringType, true)))
// the first column is non nullable, it does not work because all the columns
// are nullable when inferred by spark:
schema2.equals(df.schema)
// res15: Boolean = false
In that case you may need to implement a schema comparison method that would suit you like:
def equalSchemas(s1 : StructType, s2 : StructType) = {
s1.indices
.map(i => s1(i).name.toUpperCase.equals(s2(i).name.toUpperCase) &&
s1(i).dataType.equals(s2(i).dataType))
.reduce(_ && _)
}
equalSchemas(schema2, df.schema)
// res23: Boolean = true
I am checking that the names and the types of the columns are matching and that the order is the same. You could need to implement a different logic depending on what you want.

ResultSet returns blank column in CSV

I'm using JOOQ and Postgres.
In Postgres I have a column gender:
'gender' AS gender,
(the table itself is a view and the gender column is a placeholder for a value that gets calculated in Java)
In Java when I .fetch() the view, I do some calculations on each record:
for (Record r : skillRecords) {
idNumber=function(r)
r.set(id, idNumber);
r.set(gender,getGender(idNumber));
}
All looks good and if println the values they're all correct.
However, when I call intoResultSet() on my skillsRecord, the gender column has an asterisks next to all the values, eg "*Male".
Then, I use the resultset as input into an OpenCSV CSV writer and when I open the CSV the gender column comes out as null.
Any suggestions?
UPDATE:
Following the input from Lukas regarding the asterisks, I realise the issue is likely with opencsv.
My code is as follows:
File tempFile = new File("/tmp/file.csv");
BufferedWriter out = new BufferedWriter(new FileWriter(tempFile));
CSVWriter writer = new CSVWriter(out);
//Code for getting records sits here
for (Record r : skillRecords) {
idNumber=function(r)
r.set(id, idNumber);
r.set(gender,getGender(idNumber));
}
writer.writeAll(skillRecords.intoResultSet(), true);
return tempFile;
All the columsn in the CSV come back as expected, except the gender column, which has the header "gender" but the column values are empty.
I have the necessary try/catches in the code above but I've excluded them for brevity.
The asterisk in *Male
The asterisk that you see in the ResultSet.toString() output (or in Result.toString()) reflects the record's internal Record.changed(Field) flag, i.e. the information on each record that says that the record was modified after it was retrieved form the database (which you did).
That is just visual information which you can safely ignore.
Solution:
So I found the solution. It turns out with postgres if I have something like:
'gender' AS gender,
The type is unknown, not text. So the solution was to define as:
'gender'::text AS gender
After doing so OpenCSV was happy.

How to read a CSV file column wise using hadoop?

i am trying to read a csv file which does not contains coma separated values , these are columns for NASDAQ Stocks, i want to read a particular column, assume (3rd), do not know , how to get the column items. IS there any method to read Column wise data in hadoop? pls help here.
My CSV File Format is:
exchange stock_symbol date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close
NASDAQ ABXA 12/9/2009 2.55 2.77 2.5 2.67 158500 2.67
NASDAQ ABXA 12/8/2009 2.71 2.74 2.52 2.55 131700 2.55
Edited Here:
Column A : exchange
Column B : stock_symbol
Column C : date
Column D : stock_price_open
Column E : stock_price_high
and similarly.
These are Columns and not a comma separated values. i need to read this file as column wise.
In Pig it will look like this:
Q1 = LOAD 'file.csv' USING PigStorage('\t') AS (exchange, stock_symbol, stock_date:double, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close);
Q2 = FOREACH Q1 GENERATE stock_date;
DUMP C;
You can try to format excel sheet like, adding columns to a single text by using formula like:
=CONCATENATE(A2,";",B2,";",C2,";"D2,";",E2,";",F2,";",G2,";",H2,";",I2)
and concatenate these columns by your required separator, i have used ;, here. use what you want there to be.

What is the best approach to persist varying data (in multiple formats) into a common db table?

I have data available to me in CSV file. Each CSV is different from another i.e. column names are different. For example in FileA unique identifier is called ID but in FileB it is called UID. Similarly, in FileA amount is called AMT but in FileB it is called CUST_AMT. The meaning is same but column names are different.
I want to create a general solution for saving this varying data from CSV files into a DB table. The solution must take into consideration additional formats that may become available in future.
Is there a best approach for such a scenario?
There are many solutions to this problem. But I think the easiest might be to generate a mapping from each input file format to a combined row format. You could create a configuration file that has column name to database field name mappings, and create a program that, given a CSV and a mapping file, can insert all the data into the database.
However, you would still have to alter the table for every new column you want to add.
More design work would require more details on how the data will be used after it enters the database.
I can think of the "Chain of responsibility" pattern at the start of the execution. So you read the header and let the chain of responsibility get the appropriate parser for that file.
Code could look like this:
interface Parser {
// returns true if this parser recognizes this format.
boolean accept(String fileHeader);
// Each parser can convert a line in the file into insert parameters to be
// used with PreparedStatement
Object[] getInsertParameters(String row);
}
This allows you to add new file formats by adding a new Parser object to the chain.
You would first initialize the Chain as follows:
List<Parser> parserChain = new ArrayList<Parser>();
parserChain.add(new ParserImplA());
parserChain.add(new ParserImplB());
parserChain.add(new ParserImplB());
....
Then you will use it as follows:
// read the header row from file
Parser getParser (String header) {
for (Parser parser: parserChain) {
if (parser.accept(header)
return parser;
}
throw new Exception("Unrecognized format!");
}
Then you can create a prepared statement for inserting a row into the table.
Processing each row of file would be :
preparedStatement.execute(parser.getInsertParameters(row));

Categories

Resources