Duplicate column name in spark read csv - java

I read csv file, which has a duplicate column.
I want to preserve the name of the column in dataframe.
I tried to add this option in my sparkcontext conf spark.sql.caseSensitive and put it true , but unfortunately it has no effect.
The duplicate column name is NU_CPTE. Spark tried to rename it by adding number of column 0, 7
NU_CPTE0|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE7
SparkSession spark= SparkSession
.builder()
.master("local[2]")
.appName("Application Test")
.getOrCreate();
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
Dataset<Row> df=spark.read().option("header","true").option("delimiter",";").csv("FILE_201701.csv");
df.show(10);
I want something like this as result:
NU_CPTE|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE

Spark is fixed to allow the duplicate column names with the number appended. Hence you are getting the numbers appended to the duplicate column names. Please find the below link
https://issues.apache.org/jira/browse/SPARK-16896

The way you're trying to set the caseSensitive property will indeed be ineffective. Try replacing:
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
with:
spark.sql("set spark.sql.caseSensitive=true");
However, this still assumes your original columns have some sort of difference in casing. If they have the same casing, they will still be identical and will be suffixed with the column number.

Related

How to Append Two Spark Dataframes with different columns in Java

I have One Dataframe on which I am performing one UDF operation and then operation of UDF gives only one column in Dataframe.
How i can append it to previous Datafram.
Example:
Dataframe 1:
sr_no , name, salary
Dataframe 2: UDF is giving output as ABS(Salary) - only one column as output from UDF applied on Dataframe1
How i can have output dataframe as Dataframe1 + Dataframe2 in JAVA
i.e sr_no, name, salary, ABS(Salary) output
Looks like you are searching for .withColumn method:
df1.withColumn("ABS(salary)", yourUdf.apply(col("salary")))
(Snippet requires to import static method col from org.apache.spark.sql.functions)
Got the ans.
Just do it like this : df= df.selectExpr("*","ABS(salary)"); This will give you output of udf with your entire dataframe. Else it will give only one column.

Validate CSV file columns with Spark

I am trying to read a CSV file (which is supposed to have a header) in Spark and load the data into an existing table (with predefined columns and datatypes). The csv file can be very large, so it would be great if I could avoid doing it if the columns header from the csv is not "valid".
When I'm currently reading the file, I'm specyfing a StructType as the schema, but this does not validate that the header contains the right columns in the right order.
This is what I have so far (I'm building the "schema" StructType in another place):
sqlContext
.read()
.format("csv")
.schema(schema)
.load("pathToFile");
If I add the .option("header", "true)" line it will skill over the first line of the csv file and use the names I'm passing in the StructType's add method. (e.g. if I build the StructType with "id" and "name" and the first row in the csv is "idzzz,name", the resulting dataframe will have columns "id" and "name". I want to be able to validate that the csv header has the same name for columns as the table I'm planning on loading the csv.
I tried reading the file with .head(), and doing some checks on that first row, but that downloads the whole file.
Any suggestion is more than welcomed.
From what I understand, you want to validate the schema of the CSV you read. The problem with the schema option is that its goal is to tell spark that it is the schema of your data, and not to check that it is.
There is an option however that infers the said schema when reading a CSV and that could be very useful (inferSchema) in your situation. Then, you can either compare that schema with the one you expect with equals, or do the small workaround that I will introduce to be a little bit more permissive.
Let's see how it works the following file:
a,b
1,abcd
2,efgh
Then, let's read the data. I used the scala REPL but you should be able to convert all that in Java very easily.
val df = spark.read
.option("header", true) // reading the header
.option("inferSchema", true) // infering the sschema
.csv(".../file.csv")
// then let's define the schema you would expect
val schema = StructType(Array(StructField("a", IntegerType),
StructField("b", StringType)))
// And we can check that the schema spark inferred is the same as the one
// we expect:
schema.equals(df.schema)
// res14: Boolean = true
going further
That's in a perfect world. Indeed, if you schema contains non nullable columns for instance or other small differences, this solution that's based on strict equality of object will not work.
val schema2 = StructType(Array(StructField("a", IntegerType, false),
StructField("b", StringType, true)))
// the first column is non nullable, it does not work because all the columns
// are nullable when inferred by spark:
schema2.equals(df.schema)
// res15: Boolean = false
In that case you may need to implement a schema comparison method that would suit you like:
def equalSchemas(s1 : StructType, s2 : StructType) = {
s1.indices
.map(i => s1(i).name.toUpperCase.equals(s2(i).name.toUpperCase) &&
s1(i).dataType.equals(s2(i).dataType))
.reduce(_ && _)
}
equalSchemas(schema2, df.schema)
// res23: Boolean = true
I am checking that the names and the types of the columns are matching and that the order is the same. You could need to implement a different logic depending on what you want.

What should be included/imported to recognize "$" operation join function of in my spark -java code?

Using spark-sql-2.4.1v with java8.
I am trying to join two data sets as below:
computed_df.as('s).join(accumulated_results_df.as('f),$"s.company_id" === $"f.company_id","inner")
Which is working fine in databrick's notebooks.
But when I try to implement the same in my spark java code in my Ide.
It wont recognize the "$" function/operator even after including
import static org.apache.spark.sql.functions.*;
So what should be done to use it in my spark java code ?
thanks
The answer is org.apache.spark.sql.Column. See This.
public class Column
...
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
expr("a + 1") // A column that is constructed from a parsed SQL Expression.
lit("abc") // A column that produces a literal (constant) value.

Missing leading zeroes of date in Hive partition using Spark Dataframe

I am adding a partition column to Spark Dataframe. New column(s) contains year month and day.
I have a timestamp column in my dataframe.
DataFrame dfPartition = df.withColumn("year", df.col("date").substr(0, 4));
dfPartition = dfPartition.withColumn("month", dfPartition.col("date").substr(6, 2));
dfPartition = dfPartition.withColumn("day", dfPartition.col("date").substr(9, 2));
I can see the correct values of columns when I output the dataframe eg : 2016 01 08
But When I export this dataframe to hive table like
dfPartition.write().partitionBy("year", "month","day").mode(SaveMode.Append).saveAsTable("testdb.testtable");
I see that directory structure generated misses leading zeroes.
I tried to cast column to String but did not work.
Is there is a way to capture two digits date/month in hive partition
Thanks
Per Spark documentation, partition-column-type inference is a feature enabled by default. OP string values, since they are interpretable as ints, were converted as such. If this is undesirable in the Spark session as a whole, one can disable it by setting the corresponding spark configuration attribute to false:
SparkSession.builder.config("spark.sql.sources.partitionColumnTypeInference.enabled", value = false)
or by running the corresponding SET key=value command using SQL. Otherwise, one can individually-counteract it at the column level w/ the corresponding Spark-native format-string function as J.Doe suggests.
Refer to Add leading zeros to Columns in a Spark Data Frame
you can see the answer of how to add leading 0's with this answer:
val df2 = df
.withColumn("month", format_string("%02d", $"month"))
I tried this on my code using the snippet below and it worked!
.withColumn("year", year(col("my_time")))
.withColumn("month", format_string("%02d",month(col("my_time")))) //pad with leading 0's
.withColumn("day", format_string("%02d",dayofmonth(col("my_time")))) //pad with leading 0's
.withColumn("hour", format_string("%02d",hour(col("my_time")))) //pad with leading 0's
.writeStream
.partitionBy("year", "month", "day", "hour")

Error in importing a tsv to hbase

I created a table in hbase using:
create 'Province','ProvinceINFO'
Now, I want to import my data from a tsv file to it. My table in tsv have two columns: ProvinceID (as pk), ProvinceName
I am using the below code for import:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,'
-Dimporttsv.columns= HBASE_ROW_KEY, ProvinceINFO:ProvinceName Province /usr/data
/Province.csv
but it gives me this error:
ERROR: No columns specified. Please specify with -Dimporttsv.columns=...
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be t he row key, and you must specify a column name for every column that exists in
the
input data. Another special columnHBASE_TS_KEY designates that this column should be
used as timestamp for each record. Unlike HBASE_ROW_KEY, HBASE_TS_KEY is optional.
You must specify at most one column as timestamp key for each imported record.
Record with invalid timestamps (blank, non-numeric) will be treated as bad record.
Note: if you use this option, then 'importtsv.timestamp' option will be ignored.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of
org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
-Dmapred.job.name=jobName - use the specified mapreduce job name for the import
For performance consider the following options:
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
Maybe also try wrapping column into a string, i.e.
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns="HBASE_ROW_KEY, ProvinceINFO:ProvinceName" Province /usr/data
/Province.csv
You should try something like:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns= HBASE_ROW_KEY, ProvinceINFO:ProvinceName Province /usr/data
/Province.csv
Try to remove the spaces in -Dimporttsv.columns=a,b,c.

Categories

Resources