Validate CSV file columns with Spark

Validate CSV file columns with Spark - java

I am trying to read a CSV file (which is supposed to have a header) in Spark and load the data into an existing table (with predefined columns and datatypes). The csv file can be very large, so it would be great if I could avoid doing it if the columns header from the csv is not "valid".
When I'm currently reading the file, I'm specyfing a StructType as the schema, but this does not validate that the header contains the right columns in the right order.
This is what I have so far (I'm building the "schema" StructType in another place):
sqlContext
.read()
.format("csv")
.schema(schema)
.load("pathToFile");
If I add the .option("header", "true)" line it will skill over the first line of the csv file and use the names I'm passing in the StructType's add method. (e.g. if I build the StructType with "id" and "name" and the first row in the csv is "idzzz,name", the resulting dataframe will have columns "id" and "name". I want to be able to validate that the csv header has the same name for columns as the table I'm planning on loading the csv.
I tried reading the file with .head(), and doing some checks on that first row, but that downloads the whole file.
Any suggestion is more than welcomed.

From what I understand, you want to validate the schema of the CSV you read. The problem with the schema option is that its goal is to tell spark that it is the schema of your data, and not to check that it is.
There is an option however that infers the said schema when reading a CSV and that could be very useful (inferSchema) in your situation. Then, you can either compare that schema with the one you expect with equals, or do the small workaround that I will introduce to be a little bit more permissive.
Let's see how it works the following file:
a,b
1,abcd
2,efgh
Then, let's read the data. I used the scala REPL but you should be able to convert all that in Java very easily.
val df = spark.read
.option("header", true) // reading the header
.option("inferSchema", true) // infering the sschema
.csv(".../file.csv")
// then let's define the schema you would expect
val schema = StructType(Array(StructField("a", IntegerType),
StructField("b", StringType)))
// And we can check that the schema spark inferred is the same as the one
// we expect:
schema.equals(df.schema)
// res14: Boolean = true
going further
That's in a perfect world. Indeed, if you schema contains non nullable columns for instance or other small differences, this solution that's based on strict equality of object will not work.
val schema2 = StructType(Array(StructField("a", IntegerType, false),
StructField("b", StringType, true)))
// the first column is non nullable, it does not work because all the columns
// are nullable when inferred by spark:
schema2.equals(df.schema)
// res15: Boolean = false
In that case you may need to implement a schema comparison method that would suit you like:
def equalSchemas(s1 : StructType, s2 : StructType) = {
s1.indices
.map(i => s1(i).name.toUpperCase.equals(s2(i).name.toUpperCase) &&
s1(i).dataType.equals(s2(i).dataType))
.reduce(_ && _)
}
equalSchemas(schema2, df.schema)
// res23: Boolean = true
I am checking that the names and the types of the columns are matching and that the order is the same. You could need to implement a different logic depending on what you want.

Related

Apache Beam Group by Aggregate Fields

I have a PCollection reading data from AvroIO. I want to apply aggregation such that after grouping by a specific key, I want to count unique counts of some fields within that group.
With normal Pig or SQL this is just applying groupby and doing a distinct count, but unable to properly understand how to do it in Beam.
So far I have been able to write this:
Schema schema = new Schema.Parser().parse(new File(options.getInputSchema()));
Pipeline pipeline = Pipeline.create(options);
PCollection<GenericRecord> inputData= pipeline.apply(AvroIO.readGenericRecords(schema).from(options.getInput()));
PCollection<Row> filteredData = inputData.apply(Select.fieldNames("user_id", "field1", "field2"));
PCollection<Row> groupedData = filteredData.apply(Group.byFieldNames("user_id")
.aggregateField("field1",Count.perElement(),"out_field1")
.aggregateField("field2",Count.perElement(),"out_field2"));
But this does not accept the arguments in aggregateField method.
Can someone help in providing the correct way to do this.
Thanks!

You can replace Count.perElement() with the CountCombineFn() fn which is a subclass of CombineFn class as seen here
filteredData.apply(Group.byFieldNames("user_id")
.aggregateField("field1", CountCombineFn(), "out_field1")
.aggregateField("field2", CountCombineFn(), "out_field2"));

Converting String to Double with TableSource, Table or DataSet object in Java

I have imported data from a CSV file into Flink Java. One of the attributes I had to import as string (attribute Result) because of parsing errors. Now I want to convert the String to a Double. But I dont know how to do this with a object of the TableSource, Table or DataSet class. See my code below for this.
I've looked into flink documentation and tried some solutions with Map and FlatMap classes. But I did not find the solution for this.
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(fbEnv);
//Get H data from CSV file.
TableSource csvSource = CsvTableSource.builder()
.path("Path")
.fieldDelimiter(";")
.field("ID", Types.INT())
.field("result", Types.STRING())
.field("unixDateTime", Types.LONG())
.build();
// register the TableSource
tableEnv.registerTableSource("HTable", csvSource);
Table HTable = tableEnv.scan("HTable");
DataSet<Row> result = tableEnv.toDataSet(HTable, Row.class);

I think it should work to use a combination of replace and cast to convert the strings to doubles, as in "SELECT id, CAST(REPLACE(result, ',', '.') AS DOUBLE) AS result, ..." or the equivalent using the table API.

Duplicate column name in spark read csv

I read csv file, which has a duplicate column.
I want to preserve the name of the column in dataframe.
I tried to add this option in my sparkcontext conf spark.sql.caseSensitive and put it true , but unfortunately it has no effect.
The duplicate column name is NU_CPTE. Spark tried to rename it by adding number of column 0, 7
NU_CPTE0|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE7
SparkSession spark= SparkSession
.builder()
.master("local[2]")
.appName("Application Test")
.getOrCreate();
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
Dataset<Row> df=spark.read().option("header","true").option("delimiter",";").csv("FILE_201701.csv");
df.show(10);
I want something like this as result:
NU_CPTE|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE

Spark is fixed to allow the duplicate column names with the number appended. Hence you are getting the numbers appended to the duplicate column names. Please find the below link
https://issues.apache.org/jira/browse/SPARK-16896

The way you're trying to set the caseSensitive property will indeed be ineffective. Try replacing:
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
with:
spark.sql("set spark.sql.caseSensitive=true");
However, this still assumes your original columns have some sort of difference in casing. If they have the same casing, they will still be identical and will be suffixed with the column number.

Perform group by on RDD in Spark and write each group as individual Parquet file

I have an RDD in memory. I would like to group the RDD using some arbitrary function and then write out each individual group as an individual Parquet file.
For instance, if my RDD is comprised of JSON strings of the form:
{"type":"finish","resolution":"success","csr_id": 214}
{"type":"create","resolution":"failure","csr_id": 321}
{"type":"action","resolution":"success","csr_id": 262}
I would want to group the JSON strings by the "type" property, and write each group of strings with the same "type" to the same Parquet file.
I can see that the DataFrame API enables writing out Parquet files as follows (for instance if the RDD is comprised of JSON Strings):
final JavaRDD<String> rdd = ...
final SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
final DataFrame dataFrame = sqlContext.read().json(rdd);
dataFrame.write().parquet(location);
This would mean that the entire DataFrame is written to the Parquet file though, so the Parquet file would contain records with different values for the "type" property.
The Dataframe API also supplies a groupBy function:
final GroupedData groupedData = dataFrame.groupBy(this::myFunction);
But the GroupedData API appears to not to provide any function for writing out each group to an individual file.
Any ideas?

You cannot write GroupedData but you can partition data on write:
dataFrame.write.partitionBy("type").format("parquet").save("/tmp/foo")
Each type will be written to its own directory with ${column}=${value} format. These can be loaded separately:
sqlContext.read.parquet("/tmp/foo/type=action").show
// +------+----------+
// |csr_id|resolution|
// +------+----------+
// | 262| success|
// +------+----------+

What is the best approach to persist varying data (in multiple formats) into a common db table?

I have data available to me in CSV file. Each CSV is different from another i.e. column names are different. For example in FileA unique identifier is called ID but in FileB it is called UID. Similarly, in FileA amount is called AMT but in FileB it is called CUST_AMT. The meaning is same but column names are different.
I want to create a general solution for saving this varying data from CSV files into a DB table. The solution must take into consideration additional formats that may become available in future.
Is there a best approach for such a scenario?

There are many solutions to this problem. But I think the easiest might be to generate a mapping from each input file format to a combined row format. You could create a configuration file that has column name to database field name mappings, and create a program that, given a CSV and a mapping file, can insert all the data into the database.
However, you would still have to alter the table for every new column you want to add.
More design work would require more details on how the data will be used after it enters the database.

I can think of the "Chain of responsibility" pattern at the start of the execution. So you read the header and let the chain of responsibility get the appropriate parser for that file.
Code could look like this:
interface Parser {
// returns true if this parser recognizes this format.
boolean accept(String fileHeader);
// Each parser can convert a line in the file into insert parameters to be
// used with PreparedStatement
Object[] getInsertParameters(String row);
}
This allows you to add new file formats by adding a new Parser object to the chain.
You would first initialize the Chain as follows:
List<Parser> parserChain = new ArrayList<Parser>();
parserChain.add(new ParserImplA());
parserChain.add(new ParserImplB());
parserChain.add(new ParserImplB());
....
Then you will use it as follows:
// read the header row from file
Parser getParser (String header) {
for (Parser parser: parserChain) {
if (parser.accept(header)
return parser;
}
throw new Exception("Unrecognized format!");
}
Then you can create a prepared statement for inserting a row into the table.
Processing each row of file would be :
preparedStatement.execute(parser.getInsertParameters(row));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Validate CSV file columns with Spark - java

Related

Apache Beam Group by Aggregate Fields

Converting String to Double with TableSource, Table or DataSet object in Java

Duplicate column name in spark read csv

Perform group by on RDD in Spark and write each group as individual Parquet file

What is the best approach to persist varying data (in multiple formats) into a common db table?

Categories

Resources