I have a custom function which is depended on the order of the data. I want to apply this function for each group in spark in parallel (parallel groups). How can I do?
For example,
public ArrayList<Integer> my_logic(ArrayList<Integer> glist) {
Boolean b = true;
ArrayList<Integer> result = new ArrayList<>();
for (int i=1; i<glist.size();I++) { // Size is around 30000
If b && glist[i-1] > glist[i] {
// some logic then set b to false
result.add(glist[i]);
} else {
// some logic then set b to true
}
}
return result;
}
My data,
Col1 Col2
a 1
b 2
a 3
c 4
c 3
…. ….
I want something similar to below
df.group_by(col(“Col1”)).apply(my_logic(col(“Col2”)));
// output
a [1,3,5…]
b [2,5,8…]
…. ….
In Spark, you can use Window Aggregate Functions directly, I will show that here in Scala.
Here is your input data (my preparation):
import scala.collection.JavaConversions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schema = StructType(
StructField("Col1", StringType, false) ::
StructField("Col2", IntegerType, false) :: Nil
)
val row = Seq(Row("a", 1),Row("b", 8),Row("b", 2),Row("a", 5),Row("b", 5),Row("a", 3))
val df = spark.createDataFrame(row, schema)
df.show(false)
//input:
// +----+----+
// |Col1|Col2|
// +----+----+
// |a |1 |
// |b |8 |
// |b |2 |
// |a |5 |
// |b |5 |
// |a |3 |
// +----+----+
Here is the code to obtain desired logic :
import org.apache.spark.sql.expressions.Window
df
// NEWCOLUMN: EVALUATE/CREATE LIST OF VALUES FOR EACH RECORD OVER THE WINDOW AS FRAME MOVES
.withColumn(
"collected_list",
collect_list(col("Col2")) over Window
.partitionBy(col("Col1"))
.orderBy(col("Col2"))
)
// NEWCOLUMN: MAX SIZE OF COLLECTED LIST IN EACH WINDOW
.withColumn(
"max_size",
max(size(col("collected_list"))) over Window.partitionBy(col("Col1"))
)
// FILTER TO GET ONLY HIGHEST SIZED ARRAY ROW
.where(col("max_size") - size(col("collected_list")) === 0)
.orderBy(col("Col1"))
.drop("Col2", "max_size")
.show(false)
// output:
// +----+--------------+
// |Col1|collected_list|
// +----+--------------+
// |a |[1, 3, 5] |
// |b |[2, 5, 8] |
// +----+--------------+
Note:
you can just use collect_list() Aggregate function with groupBy directly but, you can not get the collection list ordered.
collect_set() Aggregate function you can explore if you want to eliminate duplicates (with some changes to the above query).
EDIT 2 : You can write your custom collect_list() as a UDAF (UserDefinedAggregateFunction) like this in Scala Spark for DataFrames
Online Docs
For Spark2.3.0
For Latest Version
Below Code Spark Version == 2.3.0
object Your_Collect_Array extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(
StructField("yourInputToAggFunction", LongType, false) :: Nil
)
override def dataType: ArrayType = ArrayType(LongType, false)
override def deterministic: Boolean = true
override def bufferSchema: StructType = {
StructType(
StructField("yourCollectedArray", ArrayType(LongType, false), false) :: Nil
)
}
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = new Array[Long](0)
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer.update(
0,
buffer.getAs[mutable.WrappedArray[Long]](0) :+ input.getLong(0)
)
}
override def merge(
buffer1: MutableAggregationBuffer,
buffer2: Row
): Unit = {
buffer1.update(
0,
buffer1.getAs[mutable.WrappedArray[Long]](0) ++ buffer2
.getAs[mutable.WrappedArray[Long]](0)
)
}
override def evaluate(buffer: Row): Any =
buffer.getAs[mutable.WrappedArray[Long]](0)
}
//Below is the query with just one line change i.e., calling above written custom udf
df
// NEWCOLUMN : USING OUR CUSTOM UDF
.withColumn(
"your_collected_list",
Your_Collect_Array(col("Col2")) over Window
.partitionBy(col("Col1"))
.orderBy(col("Col2"))
)
// NEWCOLUMN: MAX SIZE OF COLLECTED LIST IN EACH WINDOW
.withColumn(
"max_size",
max(size(col("your_collected_list"))) over Window.partitionBy(col("Col1"))
)
// FILTER TO GET ONLY HIGHEST SIZED ARRAY ROW
.where(col("max_size") - size(col("your_collected_list")) === 0)
.orderBy(col("Col1"))
.drop("Col2", "max_size")
.show(false)
//Output:
// +----+-------------------+
// |Col1|your_collected_list|
// +----+-------------------+
// |a |[1, 3, 5] |
// |b |[2, 5, 8] |
// +----+-------------------+
Note:
UDFs are not that efficient in spark hence, use them only when you absolutely need them. They are mainly focused for data analytics.
I have to pivot the data in a file and then store it in another file. I am having some difficulty pivoting the data.
I have multiple files, that contain data which looks somewhat like I show below. The columns are variable lengths. I am trying to merge the files, first. But for some reason, the output is not correct. I haven't even tried the pivot method, but am not sure how to use it either.
How can this be achieved?
File 1:
0,26,27,30,120
201008,100,1000,10,400
201009,200,2000,20,500
201010,300,3000,30,600
File 2:
0,26,27,30,120,145
201008,100,1000,10,400,200
201009,200,2000,20,500,100
201010,300,3000,30,600,150
File 3:
0,26,27,120,145
201008,100,10,400,200
201009,200,20,500,100
201010,300,30,600,150
Output:
201008,26,100
201008,27,1000
201008,30,10
201008,120,400
201008,145,200
201009,26,200
201009,27,2000
201009,30,20
201009,120,500
201009,145,100
.....
I am not quite familiar with Spark, but am trying to use flatMap and flatMapValues. I am not sure how I can use it for now, but would appreciate some guidance.
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.SparkSession;
import lombok.extern.slf4j.Slf4j;
#Slf4j
public class ExecutionTest {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
Logger.getLogger("org.spark_project").setLevel(Level.WARN);
Logger.getLogger("io.netty").setLevel(Level.WARN);
log.info("Starting...");
// Step 1: Create a SparkContext.
boolean isRunLocally = Boolean.valueOf(args[0]);
String filePath = args[1];
SparkConf conf = new SparkConf().setAppName("Variable File").set("serializer",
"org.apache.spark.serializer.KryoSerializer");
if (isRunLocally) {
log.info("System is running in local mode");
conf.setMaster("local[*]").set("spark.executor.memory", "2g");
}
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
jsc.textFile(filePath, 2)
.map(new Function<String, String[]>() {
private static final long serialVersionUID = 1L;
#Override
public String[] call(String v1) throws Exception {
return StringUtils.split(v1, ",");
}
})
.foreach(new VoidFunction<String[]>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String[] t) throws Exception {
for (String string : t) {
log.info(string);
}
}
});
}
}
Solution in Scala as I am not a JAVA person, you should be able to adapt. And add sorting, cache, etc.
Data is as follows, 3 files with duplicate entry evident, get rid of that if you do not want.
0, 5,10, 15 20
202008, 5,10, 15, 20
202009,10,20,100,200
8 rows generated above.
0,888,999
202008, 5, 10
202009, 10, 20
4 rows generated above.
0, 5
202009,10
1 row, which is a duplicate.
// Bit lazy with columns names, but anyway.
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val inputPath: String = "/FileStore/tables/g*.txt"
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.zipWithIndex
val rdd3 = rdd2.map(x => (x._1._1, x._2, x._1._2.split(",").toList.map(_.toInt)))
val rdd4 = rdd3.map { case (pfx, pfx2, list) => (pfx,pfx2,list.zipWithIndex) }
val df = rdd4.toDF()
df.show(false)
df.printSchema()
val df2 = df.withColumn("rankF", row_number().over(Window.partitionBy($"_1").orderBy($"_2".asc)))
df2.show(false)
df2.printSchema()
val df3 = df2.withColumn("elements", explode($"_3"))
df3.show(false)
df3.printSchema()
val df4 = df3.select($"_1", $"rankF", $"elements".getField("_1"), $"elements".getField("_2")).toDF("fn", "line_num", "val", "col_pos")
df4.show(false)
df4.printSchema()
df4.createOrReplaceTempView("df4temp")
val df51 = spark.sql("""SELECT hdr.fn, hdr.line_num, hdr.val AS pfx, hdr.col_pos
FROM df4temp hdr
WHERE hdr.line_num <> 1
AND hdr.col_pos = 0
""")
df51.show(100,false)
val df52 = spark.sql("""SELECT t1.fn, t1.val AS val1, t1.col_pos, t2.line_num, t2.val AS val2
FROM df4temp t1, df4temp t2
WHERE t1.col_pos <> 0
AND t1.col_pos = t2.col_pos
AND t1.line_num <> t2.line_num
AND t1.line_num = 1
AND t1.fn = t2.fn
""")
df52.show(100,false)
df51.createOrReplaceTempView("df51temp")
df52.createOrReplaceTempView("df52temp")
val df53 = spark.sql("""SELECT DISTINCT t1.pfx, t2.val1, t2.val2
FROM df51temp t1, df52temp t2
WHERE t1.fn = t2.fn
AND t1.line_num = t2.line_num
""")
df53.show(false)
returns:
+------+----+----+
|pfx |val1|val2|
+------+----+----+
|202008|888 |5 |
|202009|999 |20 |
|202009|20 |200 |
|202008|5 |5 |
|202008|10 |10 |
|202009|888 |10 |
|202008|15 |15 |
|202009|5 |10 |
|202009|10 |20 |
|202009|15 |100 |
|202008|20 |20 |
|202008|999 |10 |
+------+----+----+
What we see is Data Wrangling requiring massaged data for tempview creations and JOINing with SQL appropriately.
The key here is to know how to massage the data to make things easy. Note no groupBy etc. Per file, with varying length stuff, JOINing not attempted in RDD, too inflexible. Rank shows line#, so you know the first line with the 0 business.
This is what we call Data Wrangling. This is what we also call hard work for a few points on SO. This is one of my best efforts, and also one of the last of such efforts.
Weakness of solution is a lot of work to get 1st record of a file, there are alternatives. https://www.cyberciti.biz/faq/unix-linux-display-first-line-of-file/ preprocesing is what I would realistically consider.
I created a dataset in Spark using Java by reading a csv file. Following is my initial dataset:
+---+----------+-----+---+
|_c0| _c1| _c2|_c3|
+---+----------+-----+---+
| 1|9090999999|NANDU| 22|
| 2|9999999999| SANU| 21|
| 3|9999909090| MANU| 22|
| 4|9090909090|VEENA| 23|
+---+----------+-----+---+
I want to create dataframe as follows (one column having null values):
+---+----+--------+
|_c0| _c1| _c2|
+---+----|--------+
| 1|null| NANDU|
| 2|null| SANU|
| 3|null| MANU|
| 4|null| VEENA|
+---+----|--------+
Following is my existing code:
Dataset<Row> ds = spark.read().format("csv").option("header", "false").load("/home/nandu/Data.txt");
Column [] selectedColumns = new Column[2];
selectedColumns[0]= new Column("_c0");
selectedColumns[1]= new Column("_c2");
ds2 = ds.select(selectedColumns);
which will create dataset as follows.
+---+-----+
|_c0| _c2|
+---+-----+
| 1|NANDU|
| 2| SANU|
| 3| MANU|
| 4|VEENA|
+---+-----+
To select the two columns you want and add a new one with nulls you can use the following:
import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.types.StringType;
ds.select({col("_c0"), lit(null).cast(DataTypes.StringType).as("_c1"), col("_c2")});
Try Following code
import org.apache.spark.sql.functions.{ lit => flit}
import org.apache.spark.sql.types._
val ds = spark.range(100).withColumn("c2",$"id")
ds.withColumn("new_col",flit(null: String)).selectExpr("id","new_col","c2").show(5)
Hope this Helps
Cheers :)
Adding new column with string null value may solve the problem. Try the following code although it's written in scala but you'll get the idea:
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.StringType
val ds2 = ds.withColumn("new_col", lit(null).cast(StringType)).selectExpr("_c0", "new_col as _c1", "_c2")
I'm trying to replicate a single row from a Dataset n times and create a new Dataset from it. But, while replicating I need a column's value to be changed for each replication since it would be end up as the primary key when stored finally.
Below is the Scala code from SO post : Replicate Spark Row N-times
import org.apache.spark.sql.functions._
val result = singleRowDF
.withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
.selectExpr(singleRowDF.columns: _*)
How can I create a column from an array of values in Java and pass it to explode function? Suggestions are helpful.
Thanks
This is the Java program to replicate a single row from a Dataset n times.
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.lit;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.IntStream;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class SparkSample{
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//Create Dataset
List<Tuple2<String,Double>> inputList = new ArrayList<Tuple2<String,Double>>();
inputList.add(new Tuple2<String,Double>("A",1.0));
Dataset<Row> df = spark.createDataset(inputList, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF();
df.show(false);
//Java 8 style of creating Array. You can create by using for loop as well
int[] array = IntStream.range(0, 5).toArray();
//With Dummy Column
Dataset<Row> df1 = df.withColumn("dummy", explode(lit(array)));
df1.show(false);
//Drop Dummy Column
Dataset<Row> df2 = df1.drop(col("dummy"));
df2.show(false);
}
}
Below are the output of this program.
+---+---+
|_1 |_2 |
+---+---+
|A |1.0|
+---+---+
+---+---+-----+
|_1 |_2 |dummy|
+---+---+-----+
|A |1.0|0 |
|A |1.0|1 |
|A |1.0|2 |
|A |1.0|3 |
|A |1.0|4 |
+---+---+-----+
+---+---+
|_1 |_2 |
+---+---+
|A |1.0|
|A |1.0|
|A |1.0|
|A |1.0|
|A |1.0|
+---+---+
I am having two fields of java.sql.timestamp type in my dataframe and I want to find number of days between these two column
Below is the format of my data : *2016-12-23 23:56:02.0 (yyyy-MM-dd HH:mm:ss.S)
I had tried lots of method but did not find any solution. So can any one help here.
org.apache.spark.sql.functions is a treasure trove. For example, there is the datediff method that does exactly what you want: here is the ScalaDoc.
An example:
val spark: SparkSession = ??? // your spark session
val sc: SparkContext = ??? // your spark context
import spark.implicits._ // to better work with spark sql
import java.sql.Timestamp
final case class Data(id: Int, from: Timestamp, to: Timestamp)
val ds =
spark.createDataset(sc.parallelize(Seq(
Data(1, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-11 00:00:00")),
Data(2, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-21 00:00:00")),
Data(3, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-23 00:00:00")),
Data(4, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-07 00:00:00"))
)))
import org.apache.spark.sql.functions._
ds.select($"id", datediff($"from", $"to")).show()
By running this snippet you would end up with the following output:
+---+------------------+
| id|datediff(from, to)|
+---+------------------+
| 1| -10|
| 2| -20|
| 3| -22|
| 4| -6|
+---+------------------+