Missing leading zeroes of date in Hive partition using Spark Dataframe - java

I am adding a partition column to Spark Dataframe. New column(s) contains year month and day.
I have a timestamp column in my dataframe.
DataFrame dfPartition = df.withColumn("year", df.col("date").substr(0, 4));
dfPartition = dfPartition.withColumn("month", dfPartition.col("date").substr(6, 2));
dfPartition = dfPartition.withColumn("day", dfPartition.col("date").substr(9, 2));
I can see the correct values of columns when I output the dataframe eg : 2016 01 08
But When I export this dataframe to hive table like
dfPartition.write().partitionBy("year", "month","day").mode(SaveMode.Append).saveAsTable("testdb.testtable");
I see that directory structure generated misses leading zeroes.
I tried to cast column to String but did not work.
Is there is a way to capture two digits date/month in hive partition
Thanks

Per Spark documentation, partition-column-type inference is a feature enabled by default. OP string values, since they are interpretable as ints, were converted as such. If this is undesirable in the Spark session as a whole, one can disable it by setting the corresponding spark configuration attribute to false:
SparkSession.builder.config("spark.sql.sources.partitionColumnTypeInference.enabled", value = false)
or by running the corresponding SET key=value command using SQL. Otherwise, one can individually-counteract it at the column level w/ the corresponding Spark-native format-string function as J.Doe suggests.

Refer to Add leading zeros to Columns in a Spark Data Frame
you can see the answer of how to add leading 0's with this answer:
val df2 = df
.withColumn("month", format_string("%02d", $"month"))
I tried this on my code using the snippet below and it worked!
.withColumn("year", year(col("my_time")))
.withColumn("month", format_string("%02d",month(col("my_time")))) //pad with leading 0's
.withColumn("day", format_string("%02d",dayofmonth(col("my_time")))) //pad with leading 0's
.withColumn("hour", format_string("%02d",hour(col("my_time")))) //pad with leading 0's
.writeStream
.partitionBy("year", "month", "day", "hour")

Related

How to Append Two Spark Dataframes with different columns in Java

I have One Dataframe on which I am performing one UDF operation and then operation of UDF gives only one column in Dataframe.
How i can append it to previous Datafram.
Example:
Dataframe 1:
sr_no , name, salary
Dataframe 2: UDF is giving output as ABS(Salary) - only one column as output from UDF applied on Dataframe1
How i can have output dataframe as Dataframe1 + Dataframe2 in JAVA
i.e sr_no, name, salary, ABS(Salary) output
Looks like you are searching for .withColumn method:
df1.withColumn("ABS(salary)", yourUdf.apply(col("salary")))
(Snippet requires to import static method col from org.apache.spark.sql.functions)
Got the ans.
Just do it like this : df= df.selectExpr("*","ABS(salary)"); This will give you output of udf with your entire dataframe. Else it will give only one column.

Duplicate column name in spark read csv

I read csv file, which has a duplicate column.
I want to preserve the name of the column in dataframe.
I tried to add this option in my sparkcontext conf spark.sql.caseSensitive and put it true , but unfortunately it has no effect.
The duplicate column name is NU_CPTE. Spark tried to rename it by adding number of column 0, 7
NU_CPTE0|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE7
SparkSession spark= SparkSession
.builder()
.master("local[2]")
.appName("Application Test")
.getOrCreate();
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
Dataset<Row> df=spark.read().option("header","true").option("delimiter",";").csv("FILE_201701.csv");
df.show(10);
I want something like this as result:
NU_CPTE|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE
Spark is fixed to allow the duplicate column names with the number appended. Hence you are getting the numbers appended to the duplicate column names. Please find the below link
https://issues.apache.org/jira/browse/SPARK-16896
The way you're trying to set the caseSensitive property will indeed be ineffective. Try replacing:
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
with:
spark.sql("set spark.sql.caseSensitive=true");
However, this still assumes your original columns have some sort of difference in casing. If they have the same casing, they will still be identical and will be suffixed with the column number.

How to read time in custom format from csv file?

I am parsing a csv file having data as:
2016-10-03, 18.00.00, 2, 6
When I am reading file creating schema as below:
StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("Date", DataTypes.DateType, false),
DataTypes.createStructField("Time", DataTypes.TimestampType, false),
DataTypes.createStructField("CO(GT)", DataTypes.IntegerType, false),
DataTypes.createStructField("PT08.S1(CO)", DataTypes.IntegerType, false)))
Dataset<Row> df = spark.read().format("csv").schema(schema).load("src/main/resources/AirQualityUCI/sample.csv");
Its producing below error as:
Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Unknown Source)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
I feel that it is due to time format error. What are the ways of converting them into specific formats or changes to be made into StructType for its proper meaning?
The format I expect is in form of hh:mm:ss as it will be helpful via spark sql to convert it into timestamp format by concatenating columns.
2016-10-03, 18:00:00, 2, 6
If you read both Date and Time as string, then you can easily merge and convert them to a Timestamp. You do not need to change "." to a ":" in the Time column as the format can be specified when creating the Timestamp. Example of an solution in Scala:
val df = Seq(("2016-10-03", "00.00.17"),("2016-10-04", "00.01.17"))
.toDF("Date", "Time")
val df2 = df.withColumn("DateTime", concat($"Date", lit(" "), $"Time"))
.withColumn("Timestamp", unix_timestamp($"DateTime", "yyyy-MM-dd HH.mm.ss"))
Which will give you:
+----------+--------+-------------------+----------+
| Date| Time| DateTime| Timestamp|
+----------+--------+-------------------+----------+
|2016-10-03|00.00.17|2016-10-03 00.00.17|1475424017|
|2016-10-04|00.01.17|2016-10-04 00.01.17|1475510477|
+----------+--------+-------------------+----------+
Of course, if you want you can still convert the Time column to use ":" instead of ".". It can be done by using regexp_replace:
df.withColumn("Time2", regexp_replace($"Time", "\\.", ":"))
If you do this before converting to a Timestamp, you need to change the specified format above.

Apache Spark - how to get unmatched rows from two RDDs

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]
It seems we could use subtract or subtractbyKey.
Sample Input:
**File 1:**
sam,23,cricket
alex,34,football
ann,21,football
**File 2:**
ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa
**expected output:**
sam,23,cricket
Update:
Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).
What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?
Please advise on this.
Regards,
Shankar.
You could bring both RDDs to the same shape and use subtract to remove the common elements.
Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:
val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}
val in1andNotin2 = rdd1 subtract userScore2
val in2andNotIn1 = userScore2 subtract rdd1

How to read a CSV file column wise using hadoop?

i am trying to read a csv file which does not contains coma separated values , these are columns for NASDAQ Stocks, i want to read a particular column, assume (3rd), do not know , how to get the column items. IS there any method to read Column wise data in hadoop? pls help here.
My CSV File Format is:
exchange stock_symbol date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close
NASDAQ ABXA 12/9/2009 2.55 2.77 2.5 2.67 158500 2.67
NASDAQ ABXA 12/8/2009 2.71 2.74 2.52 2.55 131700 2.55
Edited Here:
Column A : exchange
Column B : stock_symbol
Column C : date
Column D : stock_price_open
Column E : stock_price_high
and similarly.
These are Columns and not a comma separated values. i need to read this file as column wise.
In Pig it will look like this:
Q1 = LOAD 'file.csv' USING PigStorage('\t') AS (exchange, stock_symbol, stock_date:double, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close);
Q2 = FOREACH Q1 GENERATE stock_date;
DUMP C;
You can try to format excel sheet like, adding columns to a single text by using formula like:
=CONCATENATE(A2,";",B2,";",C2,";"D2,";",E2,";",F2,";",G2,";",H2,";",I2)
and concatenate these columns by your required separator, i have used ;, here. use what you want there to be.

Categories

Resources