Spark list all cached RDD names and unpersist

Spark list all cached RDD names and unpersist - java

I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below
rddName.unpersist()
but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something?

PySparkers: getPersistentRDDs isn't yet implemented in Python, so unpersist your RDDs by dipping into Java:
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()

#Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs :
scala> val rdd1 = sc.makeRDD(1 to 100)
# rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27
scala> val rdd2 = sc.makeRDD(10 to 1000)
# rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:27
scala> rdd2.cache.setName("rdd_2")
# res0: rdd2.type = rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27)
scala> rdd1.cache.setName("foo")
# res2: rdd1.type = foo ParallelCollectionRDD[0] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)
Now let's add another RDD and name it as well :
scala> rdd3.setName("bar")
# res4: rdd3.type = bar ParallelCollectionRDD[2] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res5: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)
We noticed that actually it isn't persisted.

Scala generic way of doing this ... loop through spark context get all persistent RDDs and unpersist.
I will use this at the end of a driver.
for ( (id,rdd) <- sparkSession.sparkContext.getPersistentRDDs ) {
log.info("Unexpected cached RDD " + id)
rdd.unpersist()
}
Java Generic way of doing this ... where jsc is JavaSparkContext
if (jsc != null) {
Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
// using for-each loop for iteration over Map.entrySet()
for (Map.Entry<Integer, JavaRDD<?>> entry : persistentRDDS.entrySet()) {
LOG.info("Key = " + entry.getKey() +
", un persisting cached RDD = " + entry.getValue().unpersist());
}
}
Another short form of unpersist in java with out knowing rdd names are :
Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
persistentRDDS.values().forEach(JavaRDD::unpersist);

There's no special meaning to the rrdName variable. It is just a reference to an RDD. For example, in the following code
val rrdName: RDD[Something]
val name2 = rrdName
name2 and rrdName are two references that point to the same RDD. Calling name2.unpersist is the same as calling rrdName.unpersist.
If you want to unpersist an RDD, you have to manually keep a reference to it.

Related

Convert sql data to Json Array [java spark]

I have dataframe , wanted to convert into JSON ARRAY Please find the example below
Dataframe
+------------+--------------------+----------+----------------+------------------+--------------
| Name| id|request_id|create_timestamp|deadline_timestamp|
+------------+--------------------+----------+----------------+------------------+--------------
| Freeform|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| D23|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| Stores|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
|VacationClub|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
Wanted in Json Like below:
[
{
"testname":"xyz",
"systemResponse":[
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280
},
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280,
}
]
}
]

You can define 2 beans
Create Array from the 1st DF as Array of inner Beans
Define a parent bean with testname and requestDetailArray as Array
Please also find code inline comments
object DataToJsonArray {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load you dataframe
val requestDetailArray = List(
("Freeform", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("D23", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("Stores", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("VacationClub", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556")
).toDF
//Map your Dataframe to RequestDetails bean
.map(row => RequestDetails(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4)))
//Collect it as Array
.collect()
//Create another data frme with List[BaseClass] and set the (testname,Array[RequestDetails])
List(BaseClass("xyz", requestDetailArray)).toDF()
.write
//Output your Dataframe as JSON
.json("/json/output/path")
}
}
case class RequestDetails(Name: String, id: String, request_id: String, create_timestamp: String, deadline_timestamp: String)
case class BaseClass(testname: String = "xyz", systemResponse: Array[RequestDetails])

Check below code.
import org.apache.spark.sql.functions._
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.select("systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+-----------------------------------------------------------------------------------------------------------------------------------------------+
Updated
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.withColumn("testname",lit("xyz"))
.select("testname","systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
// Exiting paste mode, now interpreting.
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

Will Spark cache the data twice if we cache a DataSet and then cache the same DataSet as a table

DataSet<Row> dataSet = sqlContext.sql("some query");
dataSet.registerTempTable("temp_table");
dataset.cache(); // cache 1
sqlContext.cacheTable("temp_table"); // cache 2
So,here my question is will spark cache the dataSet only once or there will be two copies of the same dataSet one as dataSet(cache 1) and other as a table(cache 2)

It will not, or at least it won't in any recent version:
scala> val df = spark.range(1)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.cache
res0: df.type = [id: bigint]
scala> df.createOrReplaceTempView("df")
scala> spark.catalog.cacheTable("df")
2018-01-23 12:33:48 WARN CacheManager:66 - Asked to cache already cached data.

Scala classNotFound on Any type

I'm trying to make a generic value type in my HashMap like so:
val aMap = ArrayBuffer[HashMap[String, Any]]()
aMap += HashMap()
aMap(0)("aKey") = "aStringVal"
aMap(0)("aKey2") = true // a bool value
aMap(0)("aKey3") = 23 // an int value
This works in my spark-shell but it gives me this ClassNotFoundException on scala.Any in my IntelliJ Project:
org.apache.spark.streaming.scheduler.JobScheduler logError - Error running job streaming job 1521859195000 ms.0
java.lang.ClassNotFoundException: scala.Any
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
I'm using Scala 2.11. Any ideas what could be causing this?

What this ended up being for me was creating a DataFrame with mixed data using .toDF
I had:
val baseDataFrame = Seq(
("value1", "one"),
("value2", 2),
("value3", 3)
).toDF("column1", "column2")
and this change fixed the issue:
val baseDataFrame = Seq(
("value1", "one"),
("value2", "2"),
("value3", "3")
).toDF("column1", "column2")

Scala/Java incompatible Boolean type in spark

While trying this example from https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-joins.html#joinWith
case class Person(id: Long, name: String, cityId: Long)
case class City(id: Long, name: String)
val people = Seq(Person(0, "Agata", 0), Person(1, "Iweta", 0)).toDS
val cities = Seq(City(0, "Warsaw"), City(1, "Washington")).toDS
val joined = people.joinWith(cities, people("cityId") === cities("id"))
joined.show()
I am getting this error
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 21, Column 35: Incompatible expression types "boolean" and "java.lang.Boolean"
Help me on this. Thanks in advance.

I tried your code with Spark version 1.6.0, Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) and got
scala> val joined = people.joinWith(cities, people("cityId") === cities("id"))
<console>:33: error: org.apache.spark.sql.Dataset[Person] does not take parameters
val joined = people.joinWith(cities, people("cityId") === cities("id"))
^
Instead you can try either
people.as("p").joinWith(cities.as("c"), $"p.cityId" === $"c.id").show
or
people.joinWith(cities, people.toDF()("cityId") === cities.toDF()("id")).show
or
val peopleDF = Seq(Person(0, "Agata", 0), Person(1, "Iweta", 0)).toDF
val citiesDF = Seq(City(0, "Warsaw"), City(1, "Washington")).toDF
peopleDF.join(citiesDF, peopleDF("cityId") === citiesDF("id")).show

Log producer object in Scala

I am new to scala and java altogether and trying to run a sample producer code. All it does is, takes some raw products and referrers stored in csv files and uses rnd to generate some random log. Following is my code:
object LogProducer extends App {
//WebLog config
val wlc = Settings.WebLogGen
val Products = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv")).getLines().toArray
val Referrers = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/referrers.csv")).getLines().toArray
val Visitors = (0 to wlc.visitors).map("Visitors-" + _)
val Pages = (0 to wlc.pages).map("Pages-" + _)
val rnd = new Random()
val filePath = wlc.filePath
val fw = new FileWriter(filePath, true)
//adding randomness to time increments for demo
val incrementTimeEvery = rnd.nextInt(wlc.records - 1) + 1
var timestamp = System.currentTimeMillis()
var adjustedTimestamp = timestamp
for (iteration <- 1 to wlc.records) {
adjustedTimestamp = adjustedTimestamp + ((System.currentTimeMillis() - timestamp) * wlc.timeMultiplier)
timestamp = System.currentTimeMillis()
val action = iteration % (rnd.nextInt(200) + 1) match {
case 0 => "purchase"
case 1 => "add_to_cart"
case _ => "page_view"
}
val referrer = Referrers(rnd.nextInt(Referrers.length - 1))
val prevPage = referrer match {
case "Internal" => Pages(rnd.nextInt(Pages.length - 1))
case _ => ""
}
val visitor = Visitors(rnd.nextInt(Visitors.length - 1))
val page = Pages(rnd.nextInt(Pages.length - 1))
val product = Products(rnd.nextInt(Products.length - 1))
val line = s"$adjustedTimestamp\t$referrer\t$action\t$prevPage\t$visitor\t$page\t$product\n"
fw.write(line)
if (iteration % incrementTimeEvery == 0) {
//os.flush()
println(s"Sent $iteration messages!")
val sleeping = rnd.nextInt(incrementTimeEvery * 60)
println(s"Sleeping for $sleeping ms")
}
}
}
It is pretty straightforward where it is basically generating some variables and adding it to the line.
However I am getting a big exception error stack which i am not able to understand:
"C:\Program Files\Java\jdk1.8.0_92\bin\java...
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:70)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:59)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:50)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:310)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1417)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:302)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1417)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:289)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1417)
at clickstream.LogProducer$.delayedEndpoint$clickstream$LogProducer$1(logProducer.scala:16)
at clickstream.LogProducer$delayedInit$body.apply(logProducer.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at clickstream.LogProducer$.main(logProducer.scala:12)
at clickstream.LogProducer.main(logProducer.scala)
Process finished with exit code 1
Can someone please help me identify what the exception mean? Thanks all

So it wasnt hard.. it was my amateurish knowledge. It was a simple IO exception where Intellij wasnt able to get the values from my csv file. When i imported it into resources root directory, it gave me a warning message of wrong encoding.
The error was at this point:
val Products = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv")).getLines().toArray
thanks for efforts though

It was an encoding issue, for Scala a quick fix would be:
replace:
val Products=scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv")).getLines().toArray
val Referrers = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/referrers.csv")).getLines().toArray
using this:
val Products=scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv"))("UTF-8").getLines().toArray
val Referrers = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/referrers.csv"))("UTF-8").getLines().toArray
For java and more details please check out this link: http://biercoff.com/malformedinputexception-input-length-1-exception-solution-for-scala-and-java/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark list all cached RDD names and unpersist - java

PySparkers: getPersistentRDDs isn't yet implemented in Python, so unpersist your RDDs by dipping into Java: for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items(): rdd.unpersist()

Related

Convert sql data to Json Array [java spark]

Will Spark cache the data twice if we cache a DataSet and then cache the same DataSet as a table

Scala classNotFound on Any type

Scala/Java incompatible Boolean type in spark

Log producer object in Scala

Categories

Resources