Cast Spark dataframe’s schema

Cast Spark dataframe’s schema - java

I have a dataframe with the following schema:
StructType currentSchema = new StructType(new StructField[]{
new StructField("age", DataTypes.StringType, false, Metadata.empty()),
new StructField("grade", DataTypes.StringType, false, Metadata.empty()),
new StructField("dateOfBirth", DataTypes.StringType, false, Metadata.empty())
});
And I want to convert it at once (without specifying each column) to the following schema:
StructType newSchema = new StructType(new StructField[]{
new StructField("age", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("grade", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("dateOfBirth", DataTypes.DateType, false, Metadata.empty())
});
Is there any way to do such df.convert(newSchema) operation?

Since DataFrame are immutable you have to create new DataFrame to change the schema. To do so, do one of the following methods:
I:
Dataset<Row> ndf = df.select(col("age").cast(DataTypes.IntegerType),
col("grade").cast(DataTypes.IntegerType),
col("dateOfBirth").cast(DataTypes.DateType));
ndf.printSchema();
II:
or (I did just for age column):
Dataset<Row> ndf = df.withColumn("new_age", df.col("age").cast(DataTypes.IntegerType)).drop("age");
ndf.printSchema();
III:
at last but not least, use a map function to do your operation and change type simultaneously:
Dataset<Row> df2 = df.map(new MapFunction<Row, Row>() {
#Override
public Row call(Row row) throws Exception {
return RowFactory.create((int)row.getString(0),
(int)row.getString(1),
(date)row.getString(2));
}
}, RowEncoder.apply(newSchema));
df2.printSchema();
In this method if cast (int) not worked, use Integer.Parse instead of.

One way to go at it would be to ask spark to cast all your columns to the new type you expect. I am not sure it would work with all types of conversions, but it works for many cases:
List<Column> columns = Arrays
.stream(newSchema.fields())
.map(field -> col(field.name()).cast(field.dataType()))
.collect(Collectors.toList());
Dataset<Row> newResult = result.select(columns.toArray(new Column[0]));
Another way to go at it would be to rely on the way spark applies schemas to csv files but that would require writing your data on the disk so I don't recommend that option.
result.write().csv("somewhere");
Dataset<Row> newResult = spark.read().schema(newSchema).csv("somewhere");

Related

Import CSV file into Spark Dataset with a Column of Arrays (Java)

I have a CSV dataset where one of the columns contain arrays. How do I import it into a Spark Dataset in Java using a schema that contains arrays?
I've tried the following (where the 3rd column is an array):
// Import data data
DataType arrayType = DataTypes.createArrayType(DataTypes.StringType);
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("field1", DataTypes.StringType, true),
DataTypes.createStructField("field2", DataTypes.StringType, true),
DataTypes.createStructField("field3", arrayType, false),
});
Dataset<Row> df = spark.read().format("csv")
.option("sep", "\t")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.option("header", "true")
.schema(schema)
.load(filepath);

Array column, stored in CSV as string, can be parsed to ArrayType with "from_json" function:
val csvFileContent = Seq(
"ID\tArrayColumn",
"1\t['a','b']",
"2\t['c','d']"
).toDS()
val csvFileDataFrame = spark.read.option("header", "true").option("delimiter", "\t").csv(csvFileContent.as(Encoders.STRING))
csvFileDataFrame
.withColumn("ArrayColumn", from_json(col("ArrayColumn"), ArrayType(StringType)))
Output:
+---+-----------+
|ID |ArrayColumn|
+---+-----------+
|1 |[a, b] |
|2 |[c, d] |
+---+-----------+

Creating a simple 1-row Spark DataFrame with Java API

In Scala, I can create a single-row DataFrame from an in-memory string like so:
val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()
When df.show() runs, it outputs:
+-----+
| fizz|
+-----+
| buzz|
+-----+
Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
.parallelize(stringAsList), StringType);
df.show();
...but still seem to be coming up short. Now when df.show(); executes, I get:
++
||
++
||
++
(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?

I have created 2 examples for Spark 2 if you need to upgrade:
Simple Fizz/Buzz (or foe/bar - old generation :) ):
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String> stringAsList = new ArrayList<>();
stringAsList.add("bar");
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes.createStructType(
new StructField[] { DataTypes.createStructField("foe", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
2x2 data:
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String[]> stringAsList = new ArrayList<>();
stringAsList.add(new String[] { "bar1.1", "bar2.1" });
stringAsList.add(new String[] { "bar1.2", "bar2.2" });
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes
.createStructType(new StructField[] { DataTypes.createStructField("foe1", DataTypes.StringType, false),
DataTypes.createStructField("foe2", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
Code can be downloaded from: https://github.com/jgperrin/net.jgp.labs.spark.

You can achieve this by creating List to Rdd and than create Schema which will contain column name.
There might be other ways as well, it's just one of them.
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> {
return RowFactory.create(row);
});
StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("fizz", DataTypes.StringType, false) });
DataFrame df = sqlContext.createDataFrame(rowRDD, schema).toDF();
df.show();
//+----+
|fizz|
+----+
|buzz|

Building on what #jgp suggested. If you want to do this for mixed types you can do:
List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
new Tuple2<>(1, false),
new Tuple2<>(1, false),
new Tuple2<>(1, false));
JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));
StructType mySchema = new StructType()
.add("id", DataTypes.IntegerType, false)
.add("flag", DataTypes.BooleanType, false);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();
This might help with the #jdk2588 's question.

This post here provides a solution that doesn't go through sparkContext.parallelize(...): https://timepasstechies.com/create-spark-dataframe-java-list/

Converting Java Map to Spark DataFrame (Java API)

I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:
Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???
But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.

So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.
SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sf);
SQLContext sqlCon = new SQLContext(sc);
Map map = new HashMap<String, Map<String, String>>();
map.put("test1", putMap);
HashMap putMap = new HashMap<String, String>();
putMap.put("1", "test");
List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();
Set<String> allKeys = map.keySet();
for (String key : allKeys) {
list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
};
JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);
System.out.println(rdd.first());
List<StructField> fields = new ArrayList<>();
StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
StructField field2 = DataTypes.createStructField("Map",
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
fields.add(field1);
fields.add(field2);
StructType struct = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {
#Override
public Row call(Tuple2<String, HashMap> arg0) throws Exception {
return RowFactory.create(arg0._1, arg0._2);
}
});
DataFrame df = sqlCon.createDataFrame(rowRDD, struct);
df.show();
In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!
Edit: Obviously you can delete all the prints. I did this for visualization purposes!

Comparison between different methods of executing SQL queries on Cassandra Column Families using spark

As part of my project, I have to create a SQL query interface for a very large Cassandra Dataset, hence I have been looking at different methods for executing SQL queries on cassandra column families using Spark and I have come up with 3 different methods
using Spark SQLContext with a statically defined schema
// statically defined in the application
public static class TableTuple implements Serializable {
private int id;
private String line;
TableTuple (int i, String l) {
id = i;
line = l;
}
// getters and setters
...
}
and I consume the definition as:
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> rowrdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<TableTuple> rdd = rowrdd.map(row -> new TableTuple(row.getInt(0), row.getString(1)));
DataFrame dataFrame = sqlContext.createDataFrame(rdd, TableTuple.class);
dataFrame.registerTempTable("lines");
DataFrame resultsFrame = sqlContext.sql("Select line from lines where id=1");
System.out.println(Arrays.asList(resultsFrame.collect()));
using Spark SQLContext with a dynamically defined schema
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> cassandraRdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<Row> rdd = cassandraRdd.map(row -> RowFactory.create(row.getInt(0), row.getString(1)));
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema);
dataFrame.registerTempTable("lines");
DataFrame resultDataFrame = sqlContext.sql("select line from lines where id = 1");
System.out.println(Arrays.asList(resultDataFrame.collect()));
using CassandraSQLContext from the spark-cassandra-connector
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
CassandraSQLContext sqlContext = new CassandraSQLContext(sc);
DataFrame resultsFrame = sqlContext.sql("Select line from " + CASSANDRA_KEYSPACE + "." + CASSANDRA_COLUMN_FAMILY + " where id = 1");
System.out.println(Arrays.asList(resultsFrame.collect()));
I would like to know the advantages/disadvantages of one method over another. Also, for the CassandraSQLContext method, are queries limited to CQL, or is it fully compatible with Spark SQL. I would also like an analysis pertaining to my specific use case, I have a cassandra column family with ~17.6 million tuples having 62 columns. For querying such a large database, which method is most adequate ?

Convert JavaDStream<String> to JavaRDD<String>

I have a JavaDStream which gets the data from an external source. I'm trying to integrate Spark Streaming and SparkSQL. It's known that JavaDStream is made up of JavaRDD's . And i can only apply the function applySchema() when I have a JavaRDD. Please help me to convert it to a JavaRDD. I know there are functions in scala, and its much easier. But help me out in Java.

You can't transform a DStream into an RDD. As you mention, a DStream contains RDDs. The way to get access to the RDDs is by applying a function to each RDD of the DStream using foreachRDD. See the docs: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function2)

You have to first access all the RDDs inside the DStream using forEachRDD as:
javaDStream.foreachRDD( rdd => {
rdd.collect.foreach({
...
})
})

I hope this helps to covert JavaDstream to JavaRDD!
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
//Create JavaRDD<Row>
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("value", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cast Spark dataframe’s schema - java

Related

Import CSV file into Spark Dataset with a Column of Arrays (Java)

Creating a simple 1-row Spark DataFrame with Java API

Converting Java Map to Spark DataFrame (Java API)

Comparison between different methods of executing SQL queries on Cassandra Column Families using spark

Convert JavaDStream<String> to JavaRDD<String>

Categories

Resources