spark scala : Convert DataFrame OR Dataset to single comma separated string

spark scala : Convert DataFrame OR Dataset to single comma separated string - java

Below is the spark scala code which will print one column DataSet[Row]:
import org.apache.spark.sql.{Dataset, Row, SparkSession}
val spark: SparkSession = SparkSession.builder()
.appName("Spark DataValidation")
.config("SPARK_MAJOR_VERSION", "2").enableHiveSupport()
.getOrCreate()
val kafkaPath:String="hdfs:///landing/APPLICATION/*"
val targetPath:String="hdfs://datacompare/3"
val pk:String = "APPLICATION_ID"
val pkValues = spark
.read
.json(kafkaPath)
.select("message.data.*")
.select(pk)
.distinct()
pkValues.show()
Output of about code :
+--------------+
|APPLICATION_ID|
+--------------+
| 388|
| 447|
| 346|
| 861|
| 361|
| 557|
| 482|
| 518|
| 432|
| 422|
| 533|
| 733|
| 472|
| 457|
| 387|
| 394|
| 786|
| 458|
+--------------+
Question :
How to convert this dataframe to comma separated String variable ?
Expected output :
val data:String= "388,447,346,861,361,557,482,518,432,422,533,733,472,457,387,394,786,458"
Please suggest how to convert DataFrame[Row] or Dataset to one String .

I don't think that's a good idea, since a dataFrame is a distributed object and can be inmense. Collect will bring all the data to the driver, so you should perform this kind operation carefully.
Here is what you can do with a dataFrame (two options):
df.select("APPLICATION_ID").rdd.map(r => r(0)).collect.mkString(",")
df.select("APPLICATION_ID").collect.mkString(",")
Result with a test dataFrame with only 3 rows:
String = 388,447,346
Edit: With DataSet you can do directly:
ds.collect.mkString(",")

Use collect_list:
import org.apache.spark.sql.functions._
val data = pkValues.select(collect_list(col(pk))) // collect to one row
.as[Array[Long]] // set encoder, so you will have strongly-typed Dataset
.take(1)(0) // get the first row - result will be Array[Long]
.mkString(",") // and join all values
However, it's quite a bad idea to perform collect or take of all rows. Instead, you may want to save pkValues somewhere with .write? Or make it an argument to other function, to keep distributed computing
Edit: Just noticed, that #SCouto posted other answer just after me. Collect will also be correct, with collect_list function you have one advantage - you can easily go grouping if you want and i.e. group keys to even and odd ones. It's up to you which solution you prefer, simpler with collect or one line longer, but more powerful

Related

how to concat all columns in a spark dataframe, using java?

This is how I do do for 2 specific columns:
dataSet.withColumn("colName", concat(dataSet.col("col1"), lit(","),dataSet.col("col2") ));
but dataSet.columns() retruns Sting array, and not Column array.
How should I craete a List<Column>?
Thanks!

Simple Way - Instead of df.columns use concat_ws(",","*"), Check below code.
df.withColumn("colName",expr("concat_ws(',',*)")).show(false)
+---+--------+---+-------------+
|id |name |age|colName |
+---+--------+---+-------------+
|1 |Srinivas|29 |1,Srinivas,29|
|2 |Ravi |30 |2,Ravi,30 |
+---+--------+---+-------------+

Java has more verbose syntax.
Try this -
df.withColumn("colName",concat_ws(",", toScalaSeq(Arrays.stream(df.columns()).map(functions::col).collect(Collectors.toList()))));
Use below utility to convert java list to scala seq-
<T> Buffer<T> toScalaSeq(List<T> list) {
return JavaConversions.asScalaBuffer(list);
}

If someone is looking for a way to concat all the columns of a DataFrame in Scala, this is what worked for me:
val df_new = df.withColumn(new_column_name, concat_ws("-", df.columns.map(col): _*))

Alternative of Select DATE_FORMAT(date, format) in Apache Spark

I am using Apache-Spark SQL and Java to read from parquet file. The file contains a date column(M/d/yyyy) and I want to change that to some other format(yyyy-dd-MM). Something like Select DATE_FORMAT(date, format) that we can do in mysql. Is there any similar method in Apache-Spark?

What you can do is parse the string using to_timestamp with your current schema and format it with the one you desire using date_format:
val df = Seq("1/1/2015", "02/10/2014", "4/30/2010", "03/7/2015").toDF("d")
df.select('d, date_format(to_timestamp('d, "MM/dd/yyyy"), "yyyy-dd-MM") as "new_d")
.show
+----------+----------+
| d| new_d|
+----------+----------+
| 1/1/2015|2015-01-01|
|02/10/2014|2014-10-02|
| 4/30/2010|2010-30-04|
| 03/7/2015|2015-07-03|
+----------+----------+
Note that the parsing is pretty robust and supports single digit days and months.

How to query Oracle via JDBC for an intersection

I need to check from Java if a certain user has at least one group membership. In Oracle (12 by the way) there is a big able that looks like this:
DocId | Group
-----------------
1 | Group-A
1 | Group-E
1 | Group-Z
2 | Group-A
3 | Group-B
3 | Group-W
In Java I have this information:
docId = 1
listOfUsersGroups = { "Group-G", "Group-A", "Group-Of-Something-Totally-Different" }
I have seen solutions like this, but this is not the approach I want to go for. I would like to do something like this (I know this is incorrect syntax) ...
SELECT * FROM PERMSTABLE WHERE DOCID = 1 AND ('Group-G', 'Group-A', 'Group-Of-Something-Totally-Different' ) HASATLEASTONE OF Group
... and not use any temporary SQL INSERTs. The outcome should be that after executing this query I know that my user has a match because he is member of Group-A.

You can do this (using IN condition):
SELECT * FROM PERMSTABLE WHERE DocId = 1 AND Group IN
('Group-G', 'Group-A', 'Group-Of-Something-Totally-Different')

Postgresql Array Functions with QueryDSL

I use the Vlad Mihalcea's library in order to map SQL arrays (Postgresql in my case) to JPA. Then let's imagine I have an Entity, ex.
#TypeDefs(
{#TypeDef(name = "string-array", typeClass =
StringArrayType.class)}
)
#Entity
public class Entity {
#Type(type = "string-array")
#Column(columnDefinition = "text[]")
private String[] tags;
}
The appropriate SQL is:
CREATE TABLE entity (
tags text[]
);
Using QueryDSL I'd like to fetch rows which tags contains all the given ones. The raw SQL could be:
SELECT * FROM entity WHERE tags #> '{"someTag","anotherTag"}'::text[];
(taken from: https://www.postgresql.org/docs/9.1/static/functions-array.html)
Is it possible to do it with QueryDSL? Something like the code bellow ?
predicate.and(entity.tags.eqAll(<whatever>));

1st step is to generate proper sql: WHERE tags #> '{"someTag","anotherTag"}'::text[];
2nd step is described by coladict (thanks a lot!): figure out the functions which are called: #> is arraycontains and ::text[] is string_to_array
3rd step is to call them properly. After hours of debug I figured out that HQL doesn't treat functions as functions unless I added an expression sign (in my case: ...=true), so the final solution looks like this:
predicate.and(
Expressions.booleanTemplate("arraycontains({0}, string_to_array({1}, ',')) = true",
entity.tags,
tagsStr)
);
where tagsStr - is a String with values separated by ,

Since you can't use custom operators, you will have to use their functional equivalents. You can look them up in the psql console with \doS+. For \doS+ #> we get several results, but this is the one you want:
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+---------------------+-------------
pg_catalog | #> | anyarray | anyarray | boolean | arraycontains | contains
It tells us the function used is called arraycontains, so now we look-up that function to see it's parameters using \df arraycontains
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+---------------+------------------+---------------------+--------
pg_catalog | arraycontains | boolean | anyarray, anyarray | normal
From here, we transform the target query you're aiming for into:
SELECT * FROM entity WHERE arraycontains(tags, '{"someTag","anotherTag"}'::text[]);
You should then be able to use the builder's function call to create this condition.
ParameterExpression<String[]> tags = cb.parameter(String[].class);
Expression<Boolean> tagcheck = cb.function("Flight_.id", Boolean.class, Entity_.tags, tags);
Though I use a different array solution (might publish soon), I believe it should work, unless there are bugs in the underlying implementation.
An alternative to method would be to compile the escaped string format of the array and pass it on as the second parameter. It's easier to print if you don't treat the double-quotes as optional. In that event, you have to replace String[] with String in the ParameterExpression row above

For EclipseLink I created a function
CREATE OR REPLACE FUNCTION check_array(array_val text[], string_comma character varying ) RETURNS bool AS $$
BEGIN
RETURN arraycontains(array_val, string_to_array(string_comma, ','));
END;
$$ LANGUAGE plpgsql;
As pointed out by Serhii, then you can useExpressions.booleanTemplate("FUNCTION('check_array', {0}, {1}) = true", entity.tags, tagsStr)

Defining new schema for Spark Rows

I have a DataFrame and one of its columns contains a string of JSON. So far, I've implemented the Function interface as required by the JavaRDD.map method: Function<Row,Row>(). Within this function, I'm parsing the JSON, and creating a new row whose additional columns came from values in the JSON. For example:
Original row:
+------+-----------------------------------+
| id | json |
+------+-----------------------------------+
| 1 | {"id":"abcd", "name":"dmux",...} |
+------------------------------------------+
After applying my function:
+------+----------+-----------+
| id | json_id | json_name |
+------+----------+-----------+
| 1 | abcd | dmux |
+-----------------+-----------+
I'm running into trouble when trying to create a new DataFrame from the returned JavaRDD. Now that I have these new rows, I need to create a schema. The schema is highly dependent on the structure of the JSON, so I'm trying to figure out a way of passing schema data back from the function along with the Row object. I can't use broadcast variables as the SparkContext doesn't get passed into the function.
Other than looping through each column in a row in the caller of Function what options do I have?

You can create a StructType. This is Scala, but it would work the same way:
val newSchema = StructType(Array(
StructField("id", LongType, false),
StructField("json_id", StringType, false),
StructField("json_name", StringType, false)
))
val newDf = sqlContext.createDataFrame(rdd, newSchema)
Incidentally, you need to make sure your rdd is of type RDD[Row].

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

spark scala : Convert DataFrame OR Dataset to single comma separated string - java

Related

how to concat all columns in a spark dataframe, using java?

Alternative of Select DATE_FORMAT(date, format) in Apache Spark

How to query Oracle via JDBC for an intersection

Postgresql Array Functions with QueryDSL

Defining new schema for Spark Rows

Categories

Resources