I have schema from two dataset read from hdfs path and it is defined below:
val df = spark.read.parquet("/path")
df.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
|-- dept: integer (nullable = true)
Since your schema file seems like a CSV :
// Read and convert into a MAP
val csvSchemaDf = spark.read.csv("/testschemafile")
val schemaMap = csvSchema.rdd.map(x => (x(0).toString.trim,x(1).toString.trim)).collectAsMap
var isSchemaMatching = true
//Iterate through the schema fields of your df and compare
for( field <- df.schema.toList ){
if( !(schemaMap.contains(field.name) &&
field.dataType.toString.equals(schemaMap.get(field.name).get))){
//Mismatch
isSchemaMatching = false;
}
}
use isSchemaMatching for further logic
You can create instance of StructType in the following way:
val schema = StructType(
Seq(
StructField("name", StringType(), true),
StructField("id", IntegerType(), true)
))
Just read the file and create schema based on data in file.
Spark schema examples
Scaladoc of spark types
Spark type doc
Related
Using Spark 2.11, I've the following Dataset (read from Cassandra table):
+------------+----------------------------------------------------------+
|id |attributes |
+------------+----------------------------------------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}] |
+------------+----------------------------------------------------------+
This is the printSchema():
root
|-- id: string (nullable = true)
|-- attributes: string (nullable = true)
The attributes column is an array of JSON objects. I'm trying to explode it into Dataset but keep failing. I was trying to define schema as follow:
StructType type = new StructType()
.add("id", new IntegerType(), false)
.add("name", new StringType(), false)
.add("score", new FloatType(), false)
.add("snippets", new IntegerType(), false );
ArrayType schema = new ArrayType(type, false);
And provide it to from_json as follow:
df = df.withColumn("val", functions.from_json(df.col("attributes"), schema));
This fails with MatchError:
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.IntegerType#43756cb (of class org.apache.spark.sql.types.IntegerType)
What's the correct way to do that?
You can specify the schema this way :
val schema = ArrayType(
StructType(Array(
StructField("id", IntegerType, false),
StructField("name", StringType, false),
StructField("score", FloatType, false),
StructField("snippets", IntegerType, false)
)),
false
)
val df1 = df.withColumn("val", from_json(col("attributes"), schema))
df1.show(false)
//+-----------+------------------------------------------------------+------------------------+
//|id |attributes |val |
//+-----------+------------------------------------------------------+------------------------+
//|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
//+-----------+------------------------------------------------------+------------------------+
Or for Java:
import static org.apache.spark.sql.types.DataTypes.*;
StructType schema = createArrayType(createStructType(Arrays.asList(
createStructField("id", IntegerType, false),
createStructField("name", StringType, false),
createStructField("score", FloatType, false),
createStructField("snippets", StringType, false)
)), false);
You can define the schema as a literal string instead:
val df2 = df.withColumn(
"val",
from_json(
df.col("attributes"),
lit("array<struct<id: int, name: string, score: float, snippets: int>>")
)
)
df2.show(false)
+-----------+------------------------------------------------------+------------------------+
|id |attributes |val |
+-----------+------------------------------------------------------+------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
+-----------+------------------------------------------------------+------------------------+
If you prefer to use a schema:
val spark_struct = new StructType()
.add("id", IntegerType, false)
.add("name", StringType, false)
.add("score", FloatType, false)
.add("snippets", IntegerType, false)
val schema = new ArrayType(spark_struct, false)
val df2 = df.withColumn(
"val",
from_json(
df.col("attributes"),
schema
)
)
Two problems with your original code were: (1) you used the reserved keyword type as a variable name, and (2) you don't need to use new in add.
This question already has answers here:
Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)
(2 answers)
Closed 4 years ago.
I'm working with a new Spark project using Java. I have to read some data from the CSV files and these CSVs have an array of floats and I do not know how I can get this array in my dataset.
I'm reading from this CSV:
[CSV data image][1] https://imgur.com/a/PdrMhev
And I'm trying to get the data in this way:
Dataset<Row> typedTrainingData = sparkSession.sql("SELECT CAST(IDp as String) IDp, CAST(Instt as String) Instt, CAST(dataVector as String) dataVector FROM TRAINING_DATA");
And I get this:
root
|-- IDp: string (nullable = true)
|-- Instt: string (nullable = true)
|-- dataVector: string (nullable = true)
+-------+-------------+-----------------+
| IDp| Instt| dataVector|
+-------+-------------+-----------------+
| p01| V11apps|-0.41,-0.04,0.1..|
| p02| V21apps|-1.50,-1.50,-1...|
+-------+-------------+-----------------+
As you can see in the schema, I read the array as a String but I want to get as array. Recommendations?
I want to use some Machine Learning algorithms of MLlib in this data loaded, for that reason I want to get the data as array.
Thank you guys!!!!!!!!
first define your schema,
StructType customStructType = new StructType();
customStructType = customStructType.add("_c0", DataTypes.StringType, false);
customStructType = customStructType.add("_c1", DataTypes.StringType, false);
customStructType = customStructType.add("_c2", DataTypes.createArrayType(DataTypes.LongType), false);
then you can map your df to the new schema,
Dataset<Row> newDF = oldDF.map((MapFunction<Row, Row>) row -> {
String strings[] = row.getString(3).split(",");
long[] result = new long[strings.length];
for (int i = 0; i < strings.length; i++)
result[i] = Long.parseLong(strings[i]);
return RowFactory.create(row.getString(0),row.getString(1),result);
}, RowEncoder.apply(customStructType));
I have this code that is working well in scala :
val schema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", TimestampType, true),
StructField("field3", DoubleType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true)
))
val df = spark.read
// some options
.schema(schema)
.load(myEndpoint)
I want to do something similar in Java. So my code is the following :
final StructType schema = new StructType(new StructField[] {
new StructField("field1", new StringType(), true,new Metadata()),
new StructField("field2", new TimestampType(), true,new Metadata()),
new StructField("field3", new StringType(), true,new Metadata()),
new StructField("field4", new StringType(), true,new Metadata()),
new StructField("field5", new StringType(), true,new Metadata())
});
Dataset<Row> df = spark.read()
// some options
.schema(schema)
.load(myEndpoint);
But this give me the following error :
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType#37c5b8e8 (of class org.apache.spark.sql.types.StringType)
Nothing seem wrong with my schemas so I don't really know what the problem is here.
spark.read().load(myEndpoint).printSchema();
root
|-- field5: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field1: string (nullable = true)
|-- field4: string (nullable = true)
|-- field3: string (nullable = true)
schema.printTreeString();
root
|-- field1: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field3: string (nullable = true)
|-- field4: string (nullable = true)
|-- field5: string (nullable = true)
EDIT :
Here is a data sample :
spark.read().load(myEndpoint).show(false);
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|field5 |field2 |field1 |field4 |field3 |
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:54:50|SOME_VALUE |SOME_VALUE |0.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:58:50|SOME_VALUE |SOME_VALUE |50.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 17:00:50|SOME_VALUE |SOME_VALUE |20.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 18:04:50|SOME_VALUE |SOME_VALUE |10.0 |
...
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
Using the static methods and fields from the Datatypes class instead the constructors worked for me in Spark 2.3.1:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("field1", DataTypes.StringType, true),
DataTypes.createStructField("field2", DataTypes.TimestampType, true),
DataTypes.createStructField("field3", DataTypes.StringType, true),
DataTypes.createStructField("field4", DataTypes.StringType, true),
DataTypes.createStructField("field5", DataTypes.StringType, true)
});
Sample JSON data:
{"name": "dev","salary": 100,"occupation": "engg","address": "noida"}
{"name": "karthik","salary": 200,"occupation": "engg","address": "blore"}
Spark Java code:
DataFrame df = sqlContext.read().json(jsonPath);
df.printSchema();
df.show(false);
Output:
root
|-- address: string (nullable = true)
|-- name: string (nullable = true)
|-- occupation: string (nullable = true)
|-- salary: long (nullable = true)
+-------+-------+----------+------+
|address|name |occupation|salary|
+-------+-------+----------+------+
|noida |dev |engg |10000 |
|blore |karthik|engg |20000 |
+-------+-------+----------+------+
Columns are arranged in the alphabetical order. Is there any way to maintain natural order?
You can provide schema while reading the json and it will maintain the order.
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("salary", DataTypes.IntegerType, true),
DataTypes.createStructField("occupation", DataTypes.StringType, true),
DataTypes.createStructField("address", DataTypes.StringType, true)});
DataFrame df = sqlContext.read().schema(schema).json(jsonPath);
df.printSchema();
df.show(false);
You have got two options
create a schema according to the order of your json data and apply
while reading it and
Select fields from the table as the order you want.
Better option is to use schema while reading input.
I have the following schema:
geometry: struct (nullable = true)
-- coordinates: array (nullable = true)
-- element: array (containsNull = true)
-- element: array (containsNull = true)
-- element: double (containsNull = true)
In Java, how can I access the double element with a Spark SQL row?
The furthest I can seem to get is: row.getStruct(0).getList(0).
Thanks!
In Scala this works, I leave it to you to translate it to java:
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.WrappedArray
object Demo {
case class MyStruct(coordinates:Array[Array[Array[Double]]])
case class MyRow(struct:MyStruct)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val data = MyRow(MyStruct(Array(Array(Array(1.0)))))
val df= sc.parallelize(Seq(data)).toDF()
// get first entry (row)
val row = df.collect()(0)
val arr = row.getAs[Row](0).getAs[WrappedArray[WrappedArray[WrappedArray[Double]]]](0)
//access an element
val res = arr(0)(0)(0)
println(res) // 1.0
}
}
It is best to avoid accessing row directly. You can:
df.selectExpr("geometry[0][0][0]")
or
df.select(col("geometry").getItem(0).getItem(0).getItem(0))
and use the result.