how to concat all columns in a spark dataframe, using java? - java

This is how I do do for 2 specific columns:
dataSet.withColumn("colName", concat(dataSet.col("col1"), lit(","),dataSet.col("col2") ));
but dataSet.columns() retruns Sting array, and not Column array.
How should I craete a List<Column>?
Thanks!

Simple Way - Instead of df.columns use concat_ws(",","*"), Check below code.
df.withColumn("colName",expr("concat_ws(',',*)")).show(false)
+---+--------+---+-------------+
|id |name |age|colName |
+---+--------+---+-------------+
|1 |Srinivas|29 |1,Srinivas,29|
|2 |Ravi |30 |2,Ravi,30 |
+---+--------+---+-------------+

Java has more verbose syntax.
Try this -
df.withColumn("colName",concat_ws(",", toScalaSeq(Arrays.stream(df.columns()).map(functions::col).collect(Collectors.toList()))));
Use below utility to convert java list to scala seq-
<T> Buffer<T> toScalaSeq(List<T> list) {
return JavaConversions.asScalaBuffer(list);
}

If someone is looking for a way to concat all the columns of a DataFrame in Scala, this is what worked for me:
val df_new = df.withColumn(new_column_name, concat_ws("-", df.columns.map(col): _*))

Related

spark scala : Convert DataFrame OR Dataset to single comma separated string

Below is the spark scala code which will print one column DataSet[Row]:
import org.apache.spark.sql.{Dataset, Row, SparkSession}
val spark: SparkSession = SparkSession.builder()
.appName("Spark DataValidation")
.config("SPARK_MAJOR_VERSION", "2").enableHiveSupport()
.getOrCreate()
val kafkaPath:String="hdfs:///landing/APPLICATION/*"
val targetPath:String="hdfs://datacompare/3"
val pk:String = "APPLICATION_ID"
val pkValues = spark
.read
.json(kafkaPath)
.select("message.data.*")
.select(pk)
.distinct()
pkValues.show()
Output of about code :
+--------------+
|APPLICATION_ID|
+--------------+
| 388|
| 447|
| 346|
| 861|
| 361|
| 557|
| 482|
| 518|
| 432|
| 422|
| 533|
| 733|
| 472|
| 457|
| 387|
| 394|
| 786|
| 458|
+--------------+
Question :
How to convert this dataframe to comma separated String variable ?
Expected output :
val data:String= "388,447,346,861,361,557,482,518,432,422,533,733,472,457,387,394,786,458"
Please suggest how to convert DataFrame[Row] or Dataset to one String .
I don't think that's a good idea, since a dataFrame is a distributed object and can be inmense. Collect will bring all the data to the driver, so you should perform this kind operation carefully.
Here is what you can do with a dataFrame (two options):
df.select("APPLICATION_ID").rdd.map(r => r(0)).collect.mkString(",")
df.select("APPLICATION_ID").collect.mkString(",")
Result with a test dataFrame with only 3 rows:
String = 388,447,346
Edit: With DataSet you can do directly:
ds.collect.mkString(",")
Use collect_list:
import org.apache.spark.sql.functions._
val data = pkValues.select(collect_list(col(pk))) // collect to one row
.as[Array[Long]] // set encoder, so you will have strongly-typed Dataset
.take(1)(0) // get the first row - result will be Array[Long]
.mkString(",") // and join all values
However, it's quite a bad idea to perform collect or take of all rows. Instead, you may want to save pkValues somewhere with .write? Or make it an argument to other function, to keep distributed computing
Edit: Just noticed, that #SCouto posted other answer just after me. Collect will also be correct, with collect_list function you have one advantage - you can easily go grouping if you want and i.e. group keys to even and odd ones. It's up to you which solution you prefer, simpler with collect or one line longer, but more powerful

Postgresql Array Functions with QueryDSL

I use the Vlad Mihalcea's library in order to map SQL arrays (Postgresql in my case) to JPA. Then let's imagine I have an Entity, ex.
#TypeDefs(
{#TypeDef(name = "string-array", typeClass =
StringArrayType.class)}
)
#Entity
public class Entity {
#Type(type = "string-array")
#Column(columnDefinition = "text[]")
private String[] tags;
}
The appropriate SQL is:
CREATE TABLE entity (
tags text[]
);
Using QueryDSL I'd like to fetch rows which tags contains all the given ones. The raw SQL could be:
SELECT * FROM entity WHERE tags #> '{"someTag","anotherTag"}'::text[];
(taken from: https://www.postgresql.org/docs/9.1/static/functions-array.html)
Is it possible to do it with QueryDSL? Something like the code bellow ?
predicate.and(entity.tags.eqAll(<whatever>));
1st step is to generate proper sql: WHERE tags #> '{"someTag","anotherTag"}'::text[];
2nd step is described by coladict (thanks a lot!): figure out the functions which are called: #> is arraycontains and ::text[] is string_to_array
3rd step is to call them properly. After hours of debug I figured out that HQL doesn't treat functions as functions unless I added an expression sign (in my case: ...=true), so the final solution looks like this:
predicate.and(
Expressions.booleanTemplate("arraycontains({0}, string_to_array({1}, ',')) = true",
entity.tags,
tagsStr)
);
where tagsStr - is a String with values separated by ,
Since you can't use custom operators, you will have to use their functional equivalents. You can look them up in the psql console with \doS+. For \doS+ #> we get several results, but this is the one you want:
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+---------------------+-------------
pg_catalog | #> | anyarray | anyarray | boolean | arraycontains | contains
It tells us the function used is called arraycontains, so now we look-up that function to see it's parameters using \df arraycontains
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+---------------+------------------+---------------------+--------
pg_catalog | arraycontains | boolean | anyarray, anyarray | normal
From here, we transform the target query you're aiming for into:
SELECT * FROM entity WHERE arraycontains(tags, '{"someTag","anotherTag"}'::text[]);
You should then be able to use the builder's function call to create this condition.
ParameterExpression<String[]> tags = cb.parameter(String[].class);
Expression<Boolean> tagcheck = cb.function("Flight_.id", Boolean.class, Entity_.tags, tags);
Though I use a different array solution (might publish soon), I believe it should work, unless there are bugs in the underlying implementation.
An alternative to method would be to compile the escaped string format of the array and pass it on as the second parameter. It's easier to print if you don't treat the double-quotes as optional. In that event, you have to replace String[] with String in the ParameterExpression row above
For EclipseLink I created a function
CREATE OR REPLACE FUNCTION check_array(array_val text[], string_comma character varying ) RETURNS bool AS $$
BEGIN
RETURN arraycontains(array_val, string_to_array(string_comma, ','));
END;
$$ LANGUAGE plpgsql;
As pointed out by Serhii, then you can useExpressions.booleanTemplate("FUNCTION('check_array', {0}, {1}) = true", entity.tags, tagsStr)

Spark structured streaming: converting row to json

I'm trying to convert Row of DataFrame into json string using only spark API.
From input Row
+----------------+-----------+
| someThing| else|
+----------------+-----------+
| life| 42|
+----------------+-----------+
with
myDataFrame
.select(struct("*").as("col"))
.select(to_json(col("col")))
.writeStream()
.foreach(new KafkaWriter())
.start()
using KafkaWriter, that is using row.toString() i got:
[{
"someThing":"life",
"else":42
}]
When i would like to get this instead:
{
"someThing":"life",
"else":42
}
(without the [])
Any idea?
Just found the solution. Using Row.mkString instead of Row.toString solved my case.

Find column index by searching column header of a Dataset in Apache Spark Java

I have a Spark Dataset similar to the example below:
0 1 2 3
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category |UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
+------+------------+--------------------+---+
I need to find the position of a column by searching the column header.
For Example:
int position=getColumnPosition("Category");
This should return 2.
Is there any Spark function supported on Dataset<Row> datatype to find the column index or any java functions which can run on Spark dataset?
You need to access the schema and read the field index as follows:
int position = df.schema().fieldIndex("Category");
I have used index attribute for lists:
df.columns.index(column_name)
You can consider this option (Scala implementation):
def getColumnPosition(dataframe: DataFrame, colName: String): Int = {
dataframe.columns.indexOf(colName)
}

Defining new schema for Spark Rows

I have a DataFrame and one of its columns contains a string of JSON. So far, I've implemented the Function interface as required by the JavaRDD.map method: Function<Row,Row>(). Within this function, I'm parsing the JSON, and creating a new row whose additional columns came from values in the JSON. For example:
Original row:
+------+-----------------------------------+
| id | json |
+------+-----------------------------------+
| 1 | {"id":"abcd", "name":"dmux",...} |
+------------------------------------------+
After applying my function:
+------+----------+-----------+
| id | json_id | json_name |
+------+----------+-----------+
| 1 | abcd | dmux |
+-----------------+-----------+
I'm running into trouble when trying to create a new DataFrame from the returned JavaRDD. Now that I have these new rows, I need to create a schema. The schema is highly dependent on the structure of the JSON, so I'm trying to figure out a way of passing schema data back from the function along with the Row object. I can't use broadcast variables as the SparkContext doesn't get passed into the function.
Other than looping through each column in a row in the caller of Function what options do I have?
You can create a StructType. This is Scala, but it would work the same way:
val newSchema = StructType(Array(
StructField("id", LongType, false),
StructField("json_id", StringType, false),
StructField("json_name", StringType, false)
))
val newDf = sqlContext.createDataFrame(rdd, newSchema)
Incidentally, you need to make sure your rdd is of type RDD[Row].

Categories

Resources