Apache Spark read array float from CSV using Java [duplicate] - java

This question already has answers here:
Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)
(2 answers)
Closed 4 years ago.
I'm working with a new Spark project using Java. I have to read some data from the CSV files and these CSVs have an array of floats and I do not know how I can get this array in my dataset.
I'm reading from this CSV:
[CSV data image][1] https://imgur.com/a/PdrMhev
And I'm trying to get the data in this way:
Dataset<Row> typedTrainingData = sparkSession.sql("SELECT CAST(IDp as String) IDp, CAST(Instt as String) Instt, CAST(dataVector as String) dataVector FROM TRAINING_DATA");
And I get this:
root
|-- IDp: string (nullable = true)
|-- Instt: string (nullable = true)
|-- dataVector: string (nullable = true)
+-------+-------------+-----------------+
| IDp| Instt| dataVector|
+-------+-------------+-----------------+
| p01| V11apps|-0.41,-0.04,0.1..|
| p02| V21apps|-1.50,-1.50,-1...|
+-------+-------------+-----------------+
As you can see in the schema, I read the array as a String but I want to get as array. Recommendations?
I want to use some Machine Learning algorithms of MLlib in this data loaded, for that reason I want to get the data as array.
Thank you guys!!!!!!!!

first define your schema,
StructType customStructType = new StructType();
customStructType = customStructType.add("_c0", DataTypes.StringType, false);
customStructType = customStructType.add("_c1", DataTypes.StringType, false);
customStructType = customStructType.add("_c2", DataTypes.createArrayType(DataTypes.LongType), false);
then you can map your df to the new schema,
Dataset<Row> newDF = oldDF.map((MapFunction<Row, Row>) row -> {
String strings[] = row.getString(3).split(",");
long[] result = new long[strings.length];
for (int i = 0; i < strings.length; i++)
result[i] = Long.parseLong(strings[i]);
return RowFactory.create(row.getString(0),row.getString(1),result);
}, RowEncoder.apply(customStructType));

Related

Compare schema of dataframe with schema of other dataframe

I have schema from two dataset read from hdfs path and it is defined below:
val df = spark.read.parquet("/path")
df.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
|-- dept: integer (nullable = true)
Since your schema file seems like a CSV :
// Read and convert into a MAP
val csvSchemaDf = spark.read.csv("/testschemafile")
val schemaMap = csvSchema.rdd.map(x => (x(0).toString.trim,x(1).toString.trim)).collectAsMap
var isSchemaMatching = true
//Iterate through the schema fields of your df and compare
for( field <- df.schema.toList ){
if( !(schemaMap.contains(field.name) &&
field.dataType.toString.equals(schemaMap.get(field.name).get))){
//Mismatch
isSchemaMatching = false;
}
}
use isSchemaMatching for further logic
You can create instance of StructType in the following way:
val schema = StructType(
Seq(
StructField("name", StringType(), true),
StructField("id", IntegerType(), true)
))
Just read the file and create schema based on data in file.
Spark schema examples
Scaladoc of spark types
Spark type doc

How can I change a non numeric value in all the data set using Spark?

I'm using a data set with a lot of columns, this data set has ? in all the data set. I would like to Spark (Java) to change the ? to 0. By far I can only do this with one column but I would like to do everywhere:
Dataset<Row> csvData = spark.read()
.option("header", false)
.option("inferSchema", true)
.option("maxColumns", 50000)
.csv("src/main/resources/K9.data");
csvData = csvData.withColumn("_c5409", when(col("_c5409").isNull(),0).otherwise(col("_c5409")) )
.withColumn("_c0", when(col("_c0").equalTo("?"),0).otherwise(col("_c0")) );
Maybe this has an easy solution, I'm new with Java and Spark :)
You can create the list of columns using when, and use that in select if it has to deal with complex if and else cases
List<org.apache.spark.sql.Column> list = new ArrayList<org.apache.spark.sql.Column>();
for( String col : csvData.columns()){
list.add(when(csvData.col(col).isNull(),0).otherwise(csvData.col(col)).alias(col));
}
csvData = csvData.select(list.toArray(new org.apache.spark.sql.Column[0]));
If it is simply to replace nulls, this is good enough
csvData = csvData.na().fill(0, df.columns());

Create a Dataset from String Spark Java (Without RDD)

I need to create a Dataset from String. Key is the String
Header h = new Header();
h.setName(Key);
SQLContext sqlC = spark.sqlContext();
Dataset<String> ds = sqlC.createDataset(Collections.singletonList(h), Encoders.STRING());
ds.show();
I need to write it into txt file(Is there one? I am using csv right now)
ds.write().format("com.databricks.spark.csv").mode("overwrite")
.save(SomeLocation);
from documentation df.write.text():
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html#text-java.lang.String-

How to maintain order of key-value in DataFrame same as JSON?

Sample JSON data:
{"name": "dev","salary": 100,"occupation": "engg","address": "noida"}
{"name": "karthik","salary": 200,"occupation": "engg","address": "blore"}
Spark Java code:
DataFrame df = sqlContext.read().json(jsonPath);
df.printSchema();
df.show(false);
Output:
root
|-- address: string (nullable = true)
|-- name: string (nullable = true)
|-- occupation: string (nullable = true)
|-- salary: long (nullable = true)
+-------+-------+----------+------+
|address|name |occupation|salary|
+-------+-------+----------+------+
|noida |dev |engg |10000 |
|blore |karthik|engg |20000 |
+-------+-------+----------+------+
Columns are arranged in the alphabetical order. Is there any way to maintain natural order?
You can provide schema while reading the json and it will maintain the order.
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("salary", DataTypes.IntegerType, true),
DataTypes.createStructField("occupation", DataTypes.StringType, true),
DataTypes.createStructField("address", DataTypes.StringType, true)});
DataFrame df = sqlContext.read().schema(schema).json(jsonPath);
df.printSchema();
df.show(false);
You have got two options
create a schema according to the order of your json data and apply
while reading it and
Select fields from the table as the order you want.
Better option is to use schema while reading input.

Access multi-dimensional WrappedArray elements in Java using Spark SQL Row

I have the following schema:
geometry: struct (nullable = true)
-- coordinates: array (nullable = true)
-- element: array (containsNull = true)
-- element: array (containsNull = true)
-- element: double (containsNull = true)
In Java, how can I access the double element with a Spark SQL row?
The furthest I can seem to get is: row.getStruct(0).getList(0).
Thanks!
In Scala this works, I leave it to you to translate it to java:
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.WrappedArray
object Demo {
case class MyStruct(coordinates:Array[Array[Array[Double]]])
case class MyRow(struct:MyStruct)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val data = MyRow(MyStruct(Array(Array(Array(1.0)))))
val df= sc.parallelize(Seq(data)).toDF()
// get first entry (row)
val row = df.collect()(0)
val arr = row.getAs[Row](0).getAs[WrappedArray[WrappedArray[WrappedArray[Double]]]](0)
//access an element
val res = arr(0)(0)(0)
println(res) // 1.0
}
}
It is best to avoid accessing row directly. You can:
df.selectExpr("geometry[0][0][0]")
or
df.select(col("geometry").getItem(0).getItem(0).getItem(0))
and use the result.

Categories

Resources