Spark - read JSON array from column - java

Using Spark 2.11, I've the following Dataset (read from Cassandra table):
+------------+----------------------------------------------------------+
|id |attributes |
+------------+----------------------------------------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}] |
+------------+----------------------------------------------------------+
This is the printSchema():
root
|-- id: string (nullable = true)
|-- attributes: string (nullable = true)
The attributes column is an array of JSON objects. I'm trying to explode it into Dataset but keep failing. I was trying to define schema as follow:
StructType type = new StructType()
.add("id", new IntegerType(), false)
.add("name", new StringType(), false)
.add("score", new FloatType(), false)
.add("snippets", new IntegerType(), false );
ArrayType schema = new ArrayType(type, false);
And provide it to from_json as follow:
df = df.withColumn("val", functions.from_json(df.col("attributes"), schema));
This fails with MatchError:
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.IntegerType#43756cb (of class org.apache.spark.sql.types.IntegerType)
What's the correct way to do that?

You can specify the schema this way :
val schema = ArrayType(
StructType(Array(
StructField("id", IntegerType, false),
StructField("name", StringType, false),
StructField("score", FloatType, false),
StructField("snippets", IntegerType, false)
)),
false
)
val df1 = df.withColumn("val", from_json(col("attributes"), schema))
df1.show(false)
//+-----------+------------------------------------------------------+------------------------+
//|id |attributes |val |
//+-----------+------------------------------------------------------+------------------------+
//|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
//+-----------+------------------------------------------------------+------------------------+
Or for Java:
import static org.apache.spark.sql.types.DataTypes.*;
StructType schema = createArrayType(createStructType(Arrays.asList(
createStructField("id", IntegerType, false),
createStructField("name", StringType, false),
createStructField("score", FloatType, false),
createStructField("snippets", StringType, false)
)), false);

You can define the schema as a literal string instead:
val df2 = df.withColumn(
"val",
from_json(
df.col("attributes"),
lit("array<struct<id: int, name: string, score: float, snippets: int>>")
)
)
df2.show(false)
+-----------+------------------------------------------------------+------------------------+
|id |attributes |val |
+-----------+------------------------------------------------------+------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
+-----------+------------------------------------------------------+------------------------+
If you prefer to use a schema:
val spark_struct = new StructType()
.add("id", IntegerType, false)
.add("name", StringType, false)
.add("score", FloatType, false)
.add("snippets", IntegerType, false)
val schema = new ArrayType(spark_struct, false)
val df2 = df.withColumn(
"val",
from_json(
df.col("attributes"),
schema
)
)
Two problems with your original code were: (1) you used the reserved keyword type as a variable name, and (2) you don't need to use new in add.

Related

Compare schema of dataframe with schema of other dataframe

I have schema from two dataset read from hdfs path and it is defined below:
val df = spark.read.parquet("/path")
df.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
|-- dept: integer (nullable = true)
Since your schema file seems like a CSV :
// Read and convert into a MAP
val csvSchemaDf = spark.read.csv("/testschemafile")
val schemaMap = csvSchema.rdd.map(x => (x(0).toString.trim,x(1).toString.trim)).collectAsMap
var isSchemaMatching = true
//Iterate through the schema fields of your df and compare
for( field <- df.schema.toList ){
if( !(schemaMap.contains(field.name) &&
field.dataType.toString.equals(schemaMap.get(field.name).get))){
//Mismatch
isSchemaMatching = false;
}
}
use isSchemaMatching for further logic
You can create instance of StructType in the following way:
val schema = StructType(
Seq(
StructField("name", StringType(), true),
StructField("id", IntegerType(), true)
))
Just read the file and create schema based on data in file.
Spark schema examples
Scaladoc of spark types
Spark type doc

Java SimpleXml. How to parse XML, which contains Base64-encoded xml?

I'm writing an Android app which retrieves xml data from the server. For manipulating such data I use Retrofit-SimpleXmlConverter(http://simple.sourceforge.net/home.php, https://github.com/square/retrofit/tree/master/retrofit-converters/simplexml) pipeline. I need to get this list of transactions:
<?xml version="1.0"?>
<Transactions>
<Transaction>
<Transaction_Time>12-01-2018</Transaction_Time>
<Transaction_Bonuses>123.11</Transaction_Bonuses>
<Summ>123.11</Summ>
<Transaction_Type>12</Transaction_Type>
<Dop_Info>BLOB Base64 data with cyrillic symbols</Dop_Info>
</Transaction>
<Transaction>
<Transaction_Time>12-01-2018</Transaction_Time>
<Transaction_Bonuses>123.11</Transaction_Bonuses>
<Summ>123.11</Summ>
<Transaction_Type>12</Transaction_Type>
<Dop_Info>no/Dop_Info>
</Transaction>
My problem is that xml has tag with BASE-64 encoded xml. From encoded xml I need follow data:
<CHECK>
<LINE name="Name of dish", quantity="1", summ="123,5"></LINE>
<LINE name="Name of dish", quantity="1", summ="123,5"></LINE>
<LINE name="Name of dish", quantity="1", summ="123,5"></LINE>
</CHECK>
Data classes:
#Root(name = "Transactions")
data class Transactions #JvmOverloads constructor(
#field:ElementList(inline = true, required = false)
#param:ElementList(inline = true, required = false)
val list: List<Transaction>? = null
) {
#Root(name = "Transaction")
data class Transaction #JvmOverloads constructor(
#field:Element(name = "Transaction_Time")
#param:Element(name = "Transaction_Time")
var transactionTime: String? = null,
#field:Element(name = "Transaction_Bonuses")
#param:Element(name = "Transaction_Bonuses")
val transactionBonuses: Double? = 0.0,
#field:Element(name = "Summ")
#param:Element(name = "Summ")
val summ: Double? = 0.0,
#field:Element(name = "Transaction_Type")
#param:Element(name = "Transaction_Type")
val transactionType: Int? = 0,
#field:Element(name = "Dop_Info")
#param:Element(name = "Dop_Info")
val dopInfo: ByteArray ? = null
){
#Root(name = "CHECK")
data class Check #JvmOverloads constructor(
#field:ElementList(inline = true, required = false)
#param:ElementList(inline = true, required = false)
#field:Path("CHECKDATA/CHECKLINES")
#param:Path("Holders_Cards/Holder_Card/Card")
val list: List<Line>? = null
){
#Root(name = "LINE")
data class Line #JvmOverloads constructor(
#field: Attribute(name = "name", required = false)
#param: Attribute(name = "name", required = false)
val name: String? = null,
#field: Attribute(name = "quantity", required=false)
#param: Attribute(name = "quantity", required=false)
val quantity: Int? = null,
#field: Attribute(name = "summ", required = false)
#param: Attribute(name = "summ", required = false)
val summ: Double? = null
)
}
}
}
At this point I'm confused - how can I correctly serialize the xml in the way I could get instead of blob-data Check model? What are next steps?
P.S For solution I tried to use guide from http://simple.sourceforge.net/download/stream/doc/tutorial/tutorial.php#callback but it didn't find anything helpful

Spark java : Creating a new Dataset with a given schema

I have this code that is working well in scala :
val schema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", TimestampType, true),
StructField("field3", DoubleType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true)
))
val df = spark.read
// some options
.schema(schema)
.load(myEndpoint)
I want to do something similar in Java. So my code is the following :
final StructType schema = new StructType(new StructField[] {
new StructField("field1", new StringType(), true,new Metadata()),
new StructField("field2", new TimestampType(), true,new Metadata()),
new StructField("field3", new StringType(), true,new Metadata()),
new StructField("field4", new StringType(), true,new Metadata()),
new StructField("field5", new StringType(), true,new Metadata())
});
Dataset<Row> df = spark.read()
// some options
.schema(schema)
.load(myEndpoint);
But this give me the following error :
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType#37c5b8e8 (of class org.apache.spark.sql.types.StringType)
Nothing seem wrong with my schemas so I don't really know what the problem is here.
spark.read().load(myEndpoint).printSchema();
root
|-- field5: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field1: string (nullable = true)
|-- field4: string (nullable = true)
|-- field3: string (nullable = true)
schema.printTreeString();
root
|-- field1: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field3: string (nullable = true)
|-- field4: string (nullable = true)
|-- field5: string (nullable = true)
EDIT :
Here is a data sample :
spark.read().load(myEndpoint).show(false);
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|field5 |field2 |field1 |field4 |field3 |
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:54:50|SOME_VALUE |SOME_VALUE |0.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:58:50|SOME_VALUE |SOME_VALUE |50.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 17:00:50|SOME_VALUE |SOME_VALUE |20.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 18:04:50|SOME_VALUE |SOME_VALUE |10.0 |
...
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
Using the static methods and fields from the Datatypes class instead the constructors worked for me in Spark 2.3.1:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("field1", DataTypes.StringType, true),
DataTypes.createStructField("field2", DataTypes.TimestampType, true),
DataTypes.createStructField("field3", DataTypes.StringType, true),
DataTypes.createStructField("field4", DataTypes.StringType, true),
DataTypes.createStructField("field5", DataTypes.StringType, true)
});

How to maintain order of key-value in DataFrame same as JSON?

Sample JSON data:
{"name": "dev","salary": 100,"occupation": "engg","address": "noida"}
{"name": "karthik","salary": 200,"occupation": "engg","address": "blore"}
Spark Java code:
DataFrame df = sqlContext.read().json(jsonPath);
df.printSchema();
df.show(false);
Output:
root
|-- address: string (nullable = true)
|-- name: string (nullable = true)
|-- occupation: string (nullable = true)
|-- salary: long (nullable = true)
+-------+-------+----------+------+
|address|name |occupation|salary|
+-------+-------+----------+------+
|noida |dev |engg |10000 |
|blore |karthik|engg |20000 |
+-------+-------+----------+------+
Columns are arranged in the alphabetical order. Is there any way to maintain natural order?
You can provide schema while reading the json and it will maintain the order.
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("salary", DataTypes.IntegerType, true),
DataTypes.createStructField("occupation", DataTypes.StringType, true),
DataTypes.createStructField("address", DataTypes.StringType, true)});
DataFrame df = sqlContext.read().schema(schema).json(jsonPath);
df.printSchema();
df.show(false);
You have got two options
create a schema according to the order of your json data and apply
while reading it and
Select fields from the table as the order you want.
Better option is to use schema while reading input.

Creating a simple 1-row Spark DataFrame with Java API

In Scala, I can create a single-row DataFrame from an in-memory string like so:
val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()
When df.show() runs, it outputs:
+-----+
| fizz|
+-----+
| buzz|
+-----+
Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
.parallelize(stringAsList), StringType);
df.show();
...but still seem to be coming up short. Now when df.show(); executes, I get:
++
||
++
||
++
(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?
I have created 2 examples for Spark 2 if you need to upgrade:
Simple Fizz/Buzz (or foe/bar - old generation :) ):
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String> stringAsList = new ArrayList<>();
stringAsList.add("bar");
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes.createStructType(
new StructField[] { DataTypes.createStructField("foe", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
2x2 data:
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String[]> stringAsList = new ArrayList<>();
stringAsList.add(new String[] { "bar1.1", "bar2.1" });
stringAsList.add(new String[] { "bar1.2", "bar2.2" });
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes
.createStructType(new StructField[] { DataTypes.createStructField("foe1", DataTypes.StringType, false),
DataTypes.createStructField("foe2", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
Code can be downloaded from: https://github.com/jgperrin/net.jgp.labs.spark.
You can achieve this by creating List to Rdd and than create Schema which will contain column name.
There might be other ways as well, it's just one of them.
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> {
return RowFactory.create(row);
});
StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("fizz", DataTypes.StringType, false) });
DataFrame df = sqlContext.createDataFrame(rowRDD, schema).toDF();
df.show();
//+----+
|fizz|
+----+
|buzz|
Building on what #jgp suggested. If you want to do this for mixed types you can do:
List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
new Tuple2<>(1, false),
new Tuple2<>(1, false),
new Tuple2<>(1, false));
JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));
StructType mySchema = new StructType()
.add("id", DataTypes.IntegerType, false)
.add("flag", DataTypes.BooleanType, false);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();
This might help with the #jdk2588 's question.
This post here provides a solution that doesn't go through sparkContext.parallelize(...): https://timepasstechies.com/create-spark-dataframe-java-list/

Categories

Resources