How to Append Two Spark Dataframes with different columns in Java - java

I have One Dataframe on which I am performing one UDF operation and then operation of UDF gives only one column in Dataframe.
How i can append it to previous Datafram.
Example:
Dataframe 1:
sr_no , name, salary
Dataframe 2: UDF is giving output as ABS(Salary) - only one column as output from UDF applied on Dataframe1
How i can have output dataframe as Dataframe1 + Dataframe2 in JAVA
i.e sr_no, name, salary, ABS(Salary) output

Looks like you are searching for .withColumn method:
df1.withColumn("ABS(salary)", yourUdf.apply(col("salary")))
(Snippet requires to import static method col from org.apache.spark.sql.functions)

Got the ans.
Just do it like this : df= df.selectExpr("*","ABS(salary)"); This will give you output of udf with your entire dataframe. Else it will give only one column.

Related

Java Spark withColumn - custom function

Problem, please give any solutions in Java(not scala or python)
I have a DataFrame with the following data
colA, colB
23,44
24,64
What i want is a dataframe like this
colA, colB, colC
23,44, result of myFunction(23,24)
24,64, result of myFunction(23,24)
Basically i would like to add a column to the dataframe in java, where the value of the new column is found by putting the values of colA and colB through a complex function which returns a string.
Here is what i've tried, but the parameter to complexFunction only seems to be the name 'colA', rather than the value in colA.
myDataFrame.withColumn("ststs", (complexFunction(myDataFrame.col("colA")))).show();
As suggested in the comments, you should use a User Defined Function.
Let's suppose that you have a myFunction method which does the complex processing :
val myFunction : (Int, Int) => String = (colA, colB) => {...}
Then All you need to do is to transform your function into a udf and apply it on the columns A and B :
import org.apache.spark.sql.functions.{udf, col}
val myFunctionUdf = udf(myFunction)
myDataFrame.withColumn("colC", myFunctionUdf(col("colA"), col("colB")))
I hope it helps

What should be included/imported to recognize "$" operation join function of in my spark -java code?

Using spark-sql-2.4.1v with java8.
I am trying to join two data sets as below:
computed_df.as('s).join(accumulated_results_df.as('f),$"s.company_id" === $"f.company_id","inner")
Which is working fine in databrick's notebooks.
But when I try to implement the same in my spark java code in my Ide.
It wont recognize the "$" function/operator even after including
import static org.apache.spark.sql.functions.*;
So what should be done to use it in my spark java code ?
thanks
The answer is org.apache.spark.sql.Column. See This.
public class Column
...
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
expr("a + 1") // A column that is constructed from a parsed SQL Expression.
lit("abc") // A column that produces a literal (constant) value.

Duplicate column name in spark read csv

I read csv file, which has a duplicate column.
I want to preserve the name of the column in dataframe.
I tried to add this option in my sparkcontext conf spark.sql.caseSensitive and put it true , but unfortunately it has no effect.
The duplicate column name is NU_CPTE. Spark tried to rename it by adding number of column 0, 7
NU_CPTE0|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE7
SparkSession spark= SparkSession
.builder()
.master("local[2]")
.appName("Application Test")
.getOrCreate();
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
Dataset<Row> df=spark.read().option("header","true").option("delimiter",";").csv("FILE_201701.csv");
df.show(10);
I want something like this as result:
NU_CPTE|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE
Spark is fixed to allow the duplicate column names with the number appended. Hence you are getting the numbers appended to the duplicate column names. Please find the below link
https://issues.apache.org/jira/browse/SPARK-16896
The way you're trying to set the caseSensitive property will indeed be ineffective. Try replacing:
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
with:
spark.sql("set spark.sql.caseSensitive=true");
However, this still assumes your original columns have some sort of difference in casing. If they have the same casing, they will still be identical and will be suffixed with the column number.

Apache Spark - how to get unmatched rows from two RDDs

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]
It seems we could use subtract or subtractbyKey.
Sample Input:
**File 1:**
sam,23,cricket
alex,34,football
ann,21,football
**File 2:**
ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa
**expected output:**
sam,23,cricket
Update:
Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).
What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?
Please advise on this.
Regards,
Shankar.
You could bring both RDDs to the same shape and use subtract to remove the common elements.
Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:
val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}
val in1andNotin2 = rdd1 subtract userScore2
val in2andNotIn1 = userScore2 subtract rdd1

How to read a CSV file column wise using hadoop?

i am trying to read a csv file which does not contains coma separated values , these are columns for NASDAQ Stocks, i want to read a particular column, assume (3rd), do not know , how to get the column items. IS there any method to read Column wise data in hadoop? pls help here.
My CSV File Format is:
exchange stock_symbol date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close
NASDAQ ABXA 12/9/2009 2.55 2.77 2.5 2.67 158500 2.67
NASDAQ ABXA 12/8/2009 2.71 2.74 2.52 2.55 131700 2.55
Edited Here:
Column A : exchange
Column B : stock_symbol
Column C : date
Column D : stock_price_open
Column E : stock_price_high
and similarly.
These are Columns and not a comma separated values. i need to read this file as column wise.
In Pig it will look like this:
Q1 = LOAD 'file.csv' USING PigStorage('\t') AS (exchange, stock_symbol, stock_date:double, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close);
Q2 = FOREACH Q1 GENERATE stock_date;
DUMP C;
You can try to format excel sheet like, adding columns to a single text by using formula like:
=CONCATENATE(A2,";",B2,";",C2,";"D2,";",E2,";",F2,";",G2,";",H2,";",I2)
and concatenate these columns by your required separator, i have used ;, here. use what you want there to be.

Categories

Resources