Spark java : Creating a new Dataset with a given schema

Spark java : Creating a new Dataset with a given schema - java

I have this code that is working well in scala :
val schema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", TimestampType, true),
StructField("field3", DoubleType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true)
))
val df = spark.read
// some options
.schema(schema)
.load(myEndpoint)
I want to do something similar in Java. So my code is the following :
final StructType schema = new StructType(new StructField[] {
new StructField("field1", new StringType(), true,new Metadata()),
new StructField("field2", new TimestampType(), true,new Metadata()),
new StructField("field3", new StringType(), true,new Metadata()),
new StructField("field4", new StringType(), true,new Metadata()),
new StructField("field5", new StringType(), true,new Metadata())
});
Dataset<Row> df = spark.read()
// some options
.schema(schema)
.load(myEndpoint);
But this give me the following error :
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType#37c5b8e8 (of class org.apache.spark.sql.types.StringType)
Nothing seem wrong with my schemas so I don't really know what the problem is here.
spark.read().load(myEndpoint).printSchema();
root
|-- field5: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field1: string (nullable = true)
|-- field4: string (nullable = true)
|-- field3: string (nullable = true)
schema.printTreeString();
root
|-- field1: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field3: string (nullable = true)
|-- field4: string (nullable = true)
|-- field5: string (nullable = true)
EDIT :
Here is a data sample :
spark.read().load(myEndpoint).show(false);
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|field5 |field2 |field1 |field4 |field3 |
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:54:50|SOME_VALUE |SOME_VALUE |0.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:58:50|SOME_VALUE |SOME_VALUE |50.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 17:00:50|SOME_VALUE |SOME_VALUE |20.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 18:04:50|SOME_VALUE |SOME_VALUE |10.0 |
...
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+

Using the static methods and fields from the Datatypes class instead the constructors worked for me in Spark 2.3.1:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("field1", DataTypes.StringType, true),
DataTypes.createStructField("field2", DataTypes.TimestampType, true),
DataTypes.createStructField("field3", DataTypes.StringType, true),
DataTypes.createStructField("field4", DataTypes.StringType, true),
DataTypes.createStructField("field5", DataTypes.StringType, true)
});

Related

Compare schema of dataframe with schema of other dataframe

I have schema from two dataset read from hdfs path and it is defined below:
val df = spark.read.parquet("/path")
df.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
|-- dept: integer (nullable = true)

Since your schema file seems like a CSV :
// Read and convert into a MAP
val csvSchemaDf = spark.read.csv("/testschemafile")
val schemaMap = csvSchema.rdd.map(x => (x(0).toString.trim,x(1).toString.trim)).collectAsMap
var isSchemaMatching = true
//Iterate through the schema fields of your df and compare
for( field <- df.schema.toList ){
if( !(schemaMap.contains(field.name) &&
field.dataType.toString.equals(schemaMap.get(field.name).get))){
//Mismatch
isSchemaMatching = false;
}
}
use isSchemaMatching for further logic

You can create instance of StructType in the following way:
val schema = StructType(
Seq(
StructField("name", StringType(), true),
StructField("id", IntegerType(), true)
))
Just read the file and create schema based on data in file.
Spark schema examples
Scaladoc of spark types
Spark type doc

Spark - read JSON array from column

Using Spark 2.11, I've the following Dataset (read from Cassandra table):
+------------+----------------------------------------------------------+
|id |attributes |
+------------+----------------------------------------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}] |
+------------+----------------------------------------------------------+
This is the printSchema():
root
|-- id: string (nullable = true)
|-- attributes: string (nullable = true)
The attributes column is an array of JSON objects. I'm trying to explode it into Dataset but keep failing. I was trying to define schema as follow:
StructType type = new StructType()
.add("id", new IntegerType(), false)
.add("name", new StringType(), false)
.add("score", new FloatType(), false)
.add("snippets", new IntegerType(), false );
ArrayType schema = new ArrayType(type, false);
And provide it to from_json as follow:
df = df.withColumn("val", functions.from_json(df.col("attributes"), schema));
This fails with MatchError:
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.IntegerType#43756cb (of class org.apache.spark.sql.types.IntegerType)
What's the correct way to do that?

You can specify the schema this way :
val schema = ArrayType(
StructType(Array(
StructField("id", IntegerType, false),
StructField("name", StringType, false),
StructField("score", FloatType, false),
StructField("snippets", IntegerType, false)
)),
false
)
val df1 = df.withColumn("val", from_json(col("attributes"), schema))
df1.show(false)
//+-----------+------------------------------------------------------+------------------------+
//|id |attributes |val |
//+-----------+------------------------------------------------------+------------------------+
//|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
//+-----------+------------------------------------------------------+------------------------+
Or for Java:
import static org.apache.spark.sql.types.DataTypes.*;
StructType schema = createArrayType(createStructType(Arrays.asList(
createStructField("id", IntegerType, false),
createStructField("name", StringType, false),
createStructField("score", FloatType, false),
createStructField("snippets", StringType, false)
)), false);

You can define the schema as a literal string instead:
val df2 = df.withColumn(
"val",
from_json(
df.col("attributes"),
lit("array<struct<id: int, name: string, score: float, snippets: int>>")
)
)
df2.show(false)
+-----------+------------------------------------------------------+------------------------+
|id |attributes |val |
+-----------+------------------------------------------------------+------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
+-----------+------------------------------------------------------+------------------------+
If you prefer to use a schema:
val spark_struct = new StructType()
.add("id", IntegerType, false)
.add("name", StringType, false)
.add("score", FloatType, false)
.add("snippets", IntegerType, false)
val schema = new ArrayType(spark_struct, false)
val df2 = df.withColumn(
"val",
from_json(
df.col("attributes"),
schema
)
)
Two problems with your original code were: (1) you used the reserved keyword type as a variable name, and (2) you don't need to use new in add.

How to maintain order of key-value in DataFrame same as JSON?

Sample JSON data:
{"name": "dev","salary": 100,"occupation": "engg","address": "noida"}
{"name": "karthik","salary": 200,"occupation": "engg","address": "blore"}
Spark Java code:
DataFrame df = sqlContext.read().json(jsonPath);
df.printSchema();
df.show(false);
Output:
root
|-- address: string (nullable = true)
|-- name: string (nullable = true)
|-- occupation: string (nullable = true)
|-- salary: long (nullable = true)
+-------+-------+----------+------+
|address|name |occupation|salary|
+-------+-------+----------+------+
|noida |dev |engg |10000 |
|blore |karthik|engg |20000 |
+-------+-------+----------+------+
Columns are arranged in the alphabetical order. Is there any way to maintain natural order?

You can provide schema while reading the json and it will maintain the order.
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("salary", DataTypes.IntegerType, true),
DataTypes.createStructField("occupation", DataTypes.StringType, true),
DataTypes.createStructField("address", DataTypes.StringType, true)});
DataFrame df = sqlContext.read().schema(schema).json(jsonPath);
df.printSchema();
df.show(false);

You have got two options
create a schema according to the order of your json data and apply
while reading it and
Select fields from the table as the order you want.
Better option is to use schema while reading input.

Access multi-dimensional WrappedArray elements in Java using Spark SQL Row

I have the following schema:
geometry: struct (nullable = true)
-- coordinates: array (nullable = true)
-- element: array (containsNull = true)
-- element: array (containsNull = true)
-- element: double (containsNull = true)
In Java, how can I access the double element with a Spark SQL row?
The furthest I can seem to get is: row.getStruct(0).getList(0).
Thanks!

In Scala this works, I leave it to you to translate it to java:
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.WrappedArray
object Demo {
case class MyStruct(coordinates:Array[Array[Array[Double]]])
case class MyRow(struct:MyStruct)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val data = MyRow(MyStruct(Array(Array(Array(1.0)))))
val df= sc.parallelize(Seq(data)).toDF()
// get first entry (row)
val row = df.collect()(0)
val arr = row.getAs[Row](0).getAs[WrappedArray[WrappedArray[WrappedArray[Double]]]](0)
//access an element
val res = arr(0)(0)(0)
println(res) // 1.0
}
}

It is best to avoid accessing row directly. You can:
df.selectExpr("geometry[0][0][0]")
or
df.select(col("geometry").getItem(0).getItem(0).getItem(0))
and use the result.

Creating a simple 1-row Spark DataFrame with Java API

In Scala, I can create a single-row DataFrame from an in-memory string like so:
val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()
When df.show() runs, it outputs:
+-----+
| fizz|
+-----+
| buzz|
+-----+
Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
.parallelize(stringAsList), StringType);
df.show();
...but still seem to be coming up short. Now when df.show(); executes, I get:
++
||
++
||
++
(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?

I have created 2 examples for Spark 2 if you need to upgrade:
Simple Fizz/Buzz (or foe/bar - old generation :) ):
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String> stringAsList = new ArrayList<>();
stringAsList.add("bar");
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes.createStructType(
new StructField[] { DataTypes.createStructField("foe", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
2x2 data:
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String[]> stringAsList = new ArrayList<>();
stringAsList.add(new String[] { "bar1.1", "bar2.1" });
stringAsList.add(new String[] { "bar1.2", "bar2.2" });
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes
.createStructType(new StructField[] { DataTypes.createStructField("foe1", DataTypes.StringType, false),
DataTypes.createStructField("foe2", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
Code can be downloaded from: https://github.com/jgperrin/net.jgp.labs.spark.

You can achieve this by creating List to Rdd and than create Schema which will contain column name.
There might be other ways as well, it's just one of them.
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> {
return RowFactory.create(row);
});
StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("fizz", DataTypes.StringType, false) });
DataFrame df = sqlContext.createDataFrame(rowRDD, schema).toDF();
df.show();
//+----+
|fizz|
+----+
|buzz|

Building on what #jgp suggested. If you want to do this for mixed types you can do:
List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
new Tuple2<>(1, false),
new Tuple2<>(1, false),
new Tuple2<>(1, false));
JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));
StructType mySchema = new StructType()
.add("id", DataTypes.IntegerType, false)
.add("flag", DataTypes.BooleanType, false);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();
This might help with the #jdk2588 's question.

This post here provides a solution that doesn't go through sparkContext.parallelize(...): https://timepasstechies.com/create-spark-dataframe-java-list/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark java : Creating a new Dataset with a given schema - java

Related

Compare schema of dataframe with schema of other dataframe

Spark - read JSON array from column

How to maintain order of key-value in DataFrame same as JSON?

Access multi-dimensional WrappedArray elements in Java using Spark SQL Row

Creating a simple 1-row Spark DataFrame with Java API

Categories

Resources