Java Spark withColumn - custom function - java

Problem, please give any solutions in Java(not scala or python)
I have a DataFrame with the following data
colA, colB
23,44
24,64
What i want is a dataframe like this
colA, colB, colC
23,44, result of myFunction(23,24)
24,64, result of myFunction(23,24)
Basically i would like to add a column to the dataframe in java, where the value of the new column is found by putting the values of colA and colB through a complex function which returns a string.
Here is what i've tried, but the parameter to complexFunction only seems to be the name 'colA', rather than the value in colA.
myDataFrame.withColumn("ststs", (complexFunction(myDataFrame.col("colA")))).show();

As suggested in the comments, you should use a User Defined Function.
Let's suppose that you have a myFunction method which does the complex processing :
val myFunction : (Int, Int) => String = (colA, colB) => {...}
Then All you need to do is to transform your function into a udf and apply it on the columns A and B :
import org.apache.spark.sql.functions.{udf, col}
val myFunctionUdf = udf(myFunction)
myDataFrame.withColumn("colC", myFunctionUdf(col("colA"), col("colB")))
I hope it helps

Related

Find elements in one RDD but not in ther other RDD

I have two JavaRDD A and B. I want to only keep longs that are in A but not in B. How should I do that? Thanks!
I am posting a solution in scala. Should be almost similar in Java.
Do a leftOuterJoin which would give all the records in the first rdd alongwith matching records from the second rdd. Like WrappedArray((168,(def,None)), (192,(abc,Some(abc)))). But to keep the record only present in first rdd, we apply a filter over None.
val data = spark.sparkContext.parallelize(Seq((192, "abc"),(168, "def")))
val data2 = spark.sparkContext.parallelize(Seq((192, "abc")))
val result = data
.leftOuterJoin(data2)
.filter(record => record._2._2 == None)
println(result.collect.toSeq)
Output> WrappedArray((168,(def,None)))
If you use the Dataframe API - RDD is old and does not have a lot of the optimisation of Tungsten engine - you can use an antijoin (it could exist also on RDD api, but let's use the good one ;-) )
val dataA = Seq((192, "abc"),(168, "def") ).toDF("MyLong", "MyString")
val dataB = Seq((192, "abc")).toDF.toDF("MyLong", "MyString")
dataA.join(dataB, Seq("MyLong"), "leftanti").show(false)
+------+--------+
|MyLong|MyString|
+------+--------+
|168 |def |
+------+--------+

How to Append Two Spark Dataframes with different columns in Java

I have One Dataframe on which I am performing one UDF operation and then operation of UDF gives only one column in Dataframe.
How i can append it to previous Datafram.
Example:
Dataframe 1:
sr_no , name, salary
Dataframe 2: UDF is giving output as ABS(Salary) - only one column as output from UDF applied on Dataframe1
How i can have output dataframe as Dataframe1 + Dataframe2 in JAVA
i.e sr_no, name, salary, ABS(Salary) output
Looks like you are searching for .withColumn method:
df1.withColumn("ABS(salary)", yourUdf.apply(col("salary")))
(Snippet requires to import static method col from org.apache.spark.sql.functions)
Got the ans.
Just do it like this : df= df.selectExpr("*","ABS(salary)"); This will give you output of udf with your entire dataframe. Else it will give only one column.

Converting String to Double with TableSource, Table or DataSet object in Java

I have imported data from a CSV file into Flink Java. One of the attributes I had to import as string (attribute Result) because of parsing errors. Now I want to convert the String to a Double. But I dont know how to do this with a object of the TableSource, Table or DataSet class. See my code below for this.
I've looked into flink documentation and tried some solutions with Map and FlatMap classes. But I did not find the solution for this.
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(fbEnv);
//Get H data from CSV file.
TableSource csvSource = CsvTableSource.builder()
.path("Path")
.fieldDelimiter(";")
.field("ID", Types.INT())
.field("result", Types.STRING())
.field("unixDateTime", Types.LONG())
.build();
// register the TableSource
tableEnv.registerTableSource("HTable", csvSource);
Table HTable = tableEnv.scan("HTable");
DataSet<Row> result = tableEnv.toDataSet(HTable, Row.class);
I think it should work to use a combination of replace and cast to convert the strings to doubles, as in "SELECT id, CAST(REPLACE(result, ',', '.') AS DOUBLE) AS result, ..." or the equivalent using the table API.

Implementing a user defined aggregation function to be used in RelationalGroupedDataset.agg() using Java

It seems like you can aggregate multiple columns like this:
Dataset<Row> df = spark.read().textFile(inputFile);
List<Row> result = df.groupBy("id")
.agg(sum(df.col("price")), avg(df.col("weight")))
.collectAsList();
Now, I want to write my own custom aggregation function instead of sum or avg. How can I do that?
The Spark documentation shows how to create a custom aggregation function. But that one is registered and then used in the SQL and I don't think if it can be used in the .agg() function. Since agg accepts Column instances and the custom aggregation function is not one.
If you have a class GeometricMean which extends UserDefinedAggregationFunction, then you can use it like this (taken from https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html) :
// Create an instance of UDAF GeometricMean.
val gm = new GeometricMean
// Show the geometric mean of values of column "id".
df.groupBy("group_id").agg(gm(col("id")).as("GeometricMean")).show()
Should be easy to transleta this into Java

Mapping a Map type in Dataset to columns

I have a UDF in spark which returns a Map output.
Dataset<Row> dataSet = sql.sql("select *, address(col1,col2) as udfoutput from input");
I want to append the values returned in the map to columns.
Ex - if the input table had 2 columns and the UDF map returns 2 key value pairs, then total 4 columns should be created with the Dataset.
How about
select
*,
address(col1,col2)['key1'] as key1,
address(col1,col2)['key2'] as key2
from input
Or use with to call your UDF only once:
with
raw as (select *, address(col1,col2) address from input)
select
*,
address['key1'],
address['key2']
from raw
That would be the hive way.
In spark you can use all the imperative transformations (as opposed to declarative SQL) using Dataset API. In Scala it could look like this. In Java, I believe, it's a little bit more verbose:
// First your schemas as case classes (POJOs)
case class MyModelClass(col1: String, col2: String)
case class MyModelClassWithAddress(col1: String, col2: String, address: Map[String, String])
// in spark any function is a udf
def address(col1: String, col2: String): Map[String, String] = ???
// Now imperative Spark code
import spark.implicits._
val dataSet: Dataset[Row] = ??? // you can read table from Hive Metastore, or using spark.read ...
dataSet
.as[MyModelClass]
.map(myModel => MyModelWithAddress(myModel.col1, myModel.col1, address(myModel.col1, myModel.col2))
.save(...) //wherever needs to be done later

Categories

Resources