I have the following schema:
geometry: struct (nullable = true)
-- coordinates: array (nullable = true)
-- element: array (containsNull = true)
-- element: array (containsNull = true)
-- element: double (containsNull = true)
In Java, how can I access the double element with a Spark SQL row?
The furthest I can seem to get is: row.getStruct(0).getList(0).
Thanks!
In Scala this works, I leave it to you to translate it to java:
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.WrappedArray
object Demo {
case class MyStruct(coordinates:Array[Array[Array[Double]]])
case class MyRow(struct:MyStruct)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val data = MyRow(MyStruct(Array(Array(Array(1.0)))))
val df= sc.parallelize(Seq(data)).toDF()
// get first entry (row)
val row = df.collect()(0)
val arr = row.getAs[Row](0).getAs[WrappedArray[WrappedArray[WrappedArray[Double]]]](0)
//access an element
val res = arr(0)(0)(0)
println(res) // 1.0
}
}
It is best to avoid accessing row directly. You can:
df.selectExpr("geometry[0][0][0]")
or
df.select(col("geometry").getItem(0).getItem(0).getItem(0))
and use the result.
Related
I have schema from two dataset read from hdfs path and it is defined below:
val df = spark.read.parquet("/path")
df.printSchema()
root
|-- name: string (nullable = true)
|-- id: integer (nullable = true)
|-- dept: integer (nullable = true)
Since your schema file seems like a CSV :
// Read and convert into a MAP
val csvSchemaDf = spark.read.csv("/testschemafile")
val schemaMap = csvSchema.rdd.map(x => (x(0).toString.trim,x(1).toString.trim)).collectAsMap
var isSchemaMatching = true
//Iterate through the schema fields of your df and compare
for( field <- df.schema.toList ){
if( !(schemaMap.contains(field.name) &&
field.dataType.toString.equals(schemaMap.get(field.name).get))){
//Mismatch
isSchemaMatching = false;
}
}
use isSchemaMatching for further logic
You can create instance of StructType in the following way:
val schema = StructType(
Seq(
StructField("name", StringType(), true),
StructField("id", IntegerType(), true)
))
Just read the file and create schema based on data in file.
Spark schema examples
Scaladoc of spark types
Spark type doc
I have following XML to create nested Kotlin objects but it is not mapping all TimeCycleData element
<TimeCycle>
<Last>
<Date>2001-06-13T01:00:00.000Z</Date>
<TimeCycleData>
<Hours type="F">123</Hours>
</TimeCycleData>
<TimeCycleData>
<Cycles>1234</Cycles>
</TimeCycleData>
<TimeCycleData>
<Land>1234</Land>
</TimeCycleData>
</Last>
</TimeCycle>
and want to map following Kotlin data classes
data class TimeCycle(
#field:XmlElement(name = "Last")
val last: Last? = null
)
data class Last(
#field:XmlElement(name = "Date")
#field:XmlJavaTypeAdapter(value = LocalDateTimeAdapter::class, type = LocalDateTime::class)
val date: LocalDateTime? = null,
#field:XmlElement(name = "TimeCycleData")
val timeCycleData: TimeCycleData? = null
)
data class TimeCycleData(
#field:XmlElement(name = "Hours")
val hours: DurationDetails? = null,
#field:XmlElement(name = "Cycles")
val cycles: Int? = null,
#field:XmlElement(name = "Land")
val land: Int? = null
)
data class DurationDetails(
#field:XmlValue
#field:XmlJavaTypeAdapter(value = DurationAdapter::class, type = Duration::class)
val value: Duration? = null,
#field:XmlAttribute(name = "type")
val type: String = ""
)
when I unmarshal the XML, only the first TimeCycleData with Hours is filled. How can I merge all TimeCycleData into one single object?
UPDATE: corrected submitted xml
I guess
#field:XmlElement(name = "TimeCycleData")
val timeCycleData: TimeCycleData? = null
should be declared in somehow this way
#field:XmlElement(name = "TimeCycleData")
#field:XmlElementWrapper(name = "TimeCycleInfo")
val timeCycleInfo: List<TimeCycleData>? = null
.
I'm used to python and using the Scala Spark Streaming libraries to handle real-time Twitter streaming data. Right now, I'm able to send as a string, however, my streaming service requires JSON. Is there a way I can easily adapt my code to send as JSON dictionary instead of a String?
%scala
import scala.collection.JavaConverters._
import com.microsoft.azure.eventhubs._
import java.util.concurrent._
val namespaceName = "hubnamespace"
val eventHubName = "hubname"
val sasKeyName = "RootManageSharedAccessKey"
val sasKey = "key"
val connStr = new ConnectionStringBuilder()
.setNamespaceName(namespaceName)
.setEventHubName(eventHubName)
.setSasKeyName(sasKeyName)
.setSasKey(sasKey)
val pool = Executors.newFixedThreadPool(1)
val eventHubClient = EventHubClient.create(connStr.toString(), pool)
def sendEvent(message: String) = {
val messageData = EventData.create(message.getBytes("UTF-8"))
// CONVERT IT HERE?
eventHubClient.get().send(messageData)
System.out.println("Sent event: " + message + "\n")
}
import twitter4j._
import twitter4j.TwitterFactory
import twitter4j.Twitter
import twitter4j.conf.ConfigurationBuilder
val twitterConsumerKey = "key"
val twitterConsumerSecret = "key"
val twitterOauthAccessToken = "key"
val twitterOauthTokenSecret = "key"
val cb = new ConfigurationBuilder()
cb.setDebugEnabled(true)
.setOAuthConsumerKey(twitterConsumerKey)
.setOAuthConsumerSecret(twitterConsumerSecret)
.setOAuthAccessToken(twitterOauthAccessToken)
.setOAuthAccessTokenSecret(twitterOauthTokenSecret)
val twitterFactory = new TwitterFactory(cb.build())
val twitter = twitterFactory.getInstance()
val query = new Query(" #happynewyear ")
query.setCount(100)
query.lang("en")
var finished = false
while (!finished) {
val result = twitter.search(query)
val statuses = result.getTweets()
var lowestStatusId = Long.MaxValue
for (status <- statuses.asScala) {
if(!status.isRetweet()){
sendEvent(status.getText())
}
lowestStatusId = Math.min(status.getId(), lowestStatusId)
Thread.sleep(2000)
}
query.setMaxId(lowestStatusId - 1)
}
eventHubClient.get().close()
Scala has no native way to convert string to Json, you'll need to use an external library. I recommend using Jackson. If you use gradle you can add a dependency like this: compile("com.fasterxml.jackson.module:jackson-module-scala_2.12"). (Use appropriate scala version)
Then, you can simply convert your data object to JSON like this:
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val json = valueToTree(messageData)
I'd strongly recommend you put your effort in Jackson, you'll need it a lot if you work with JSON.
This question already has answers here:
Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)
(2 answers)
Closed 4 years ago.
I'm working with a new Spark project using Java. I have to read some data from the CSV files and these CSVs have an array of floats and I do not know how I can get this array in my dataset.
I'm reading from this CSV:
[CSV data image][1] https://imgur.com/a/PdrMhev
And I'm trying to get the data in this way:
Dataset<Row> typedTrainingData = sparkSession.sql("SELECT CAST(IDp as String) IDp, CAST(Instt as String) Instt, CAST(dataVector as String) dataVector FROM TRAINING_DATA");
And I get this:
root
|-- IDp: string (nullable = true)
|-- Instt: string (nullable = true)
|-- dataVector: string (nullable = true)
+-------+-------------+-----------------+
| IDp| Instt| dataVector|
+-------+-------------+-----------------+
| p01| V11apps|-0.41,-0.04,0.1..|
| p02| V21apps|-1.50,-1.50,-1...|
+-------+-------------+-----------------+
As you can see in the schema, I read the array as a String but I want to get as array. Recommendations?
I want to use some Machine Learning algorithms of MLlib in this data loaded, for that reason I want to get the data as array.
Thank you guys!!!!!!!!
first define your schema,
StructType customStructType = new StructType();
customStructType = customStructType.add("_c0", DataTypes.StringType, false);
customStructType = customStructType.add("_c1", DataTypes.StringType, false);
customStructType = customStructType.add("_c2", DataTypes.createArrayType(DataTypes.LongType), false);
then you can map your df to the new schema,
Dataset<Row> newDF = oldDF.map((MapFunction<Row, Row>) row -> {
String strings[] = row.getString(3).split(",");
long[] result = new long[strings.length];
for (int i = 0; i < strings.length; i++)
result[i] = Long.parseLong(strings[i]);
return RowFactory.create(row.getString(0),row.getString(1),result);
}, RowEncoder.apply(customStructType));
I am building an application in Spark, and would like to use the SparkContext and/or SQLContext within methods in my classes, mostly to pull/generate data sets from files or SQL queries.
For example, I would like to create a T2P object which contains methods that gather data (and in this case need access to the SparkContext):
class T2P (mid: Int, sc: SparkContext, sqlContext: SQLContext) extends Serializable {
def getImps(): DataFrame = {
val imps = sc.textFile("file.txt").map(line => line.split("\t")).map(d => Data(d(0).toInt, d(1), d(2), d(3))).toDF()
return imps
}
def getX(): DataFrame = {
val x = sqlContext.sql("SELECT a,b,c FROM table")
return x
}
}
//creating the T2P object
class App {
val conf = new SparkConf().setAppName("T2P App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val t2p = new T2P(0, sc, sqlContext);
}
Passing the SparkContext as an argument to the T2P class doesn't work since the SparkContext is not serializable (getting a task not serializable error when creating T2P objects). What is the best way to use the SparkContext/SQLContext inside my classes? Or perhaps is this the wrong way to design a data pull type process in Spark?
UPDATE
Realized from the comments on this post that the SparkContext was not the problem, but that I was using a using a method within a 'map' function, causing Spark to try to serialize the entire class. This would cause the error since SparkContext is not serializable.
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
def buildUserRollup() = {
this.userRollup = this.userSorted.map(line=>startMetricTo(line, this.startMetric))
}
This results in a 'task not serializable' exception.
I fixed this problem (with the help of the commenters and other StackOverflow users) by creating a separate MetricCalc object to store my startMetricTo() method. Then I changed the buildUserRollup() method to use this new startMetricTo(). This allows the entire MetricCalc object to be serialized without issue.
//newly created object
object MetricCalc {
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
}
//using function in T2P
def buildUserRollup(startMetric: String) = {
this.userRollup = this.userSorted.map(line=>MetricCalc.startMetricTo(line, startMetric))
}
I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion. Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.