I am new to accumulators in Spark . I have created an accumulator which gathers the information of sum and count of all columns in a dataframe into a Map.
Which is not functioning as expected , so I have a few doubts.
When I run this class( pasted below) in local mode , I can see that the accumulators getting updated but the final value is still empty.For debug purposes, I added a print Statement in add() .
Q1) Why is the final accumulable not being updated when the accumulator is being added ?
For reference , I studied the CollectionsAccumulator where they have made use of SynchronizedList from Java Collections.
Q2) Does it need to be a synchronized/concurrent collection for an accumulator to update ?
Q3) Which collection will be best suited for such purpose ?
I have attached my execution flow along with spark ui snapshot for analysis.
Thanks.
EXECUTION:
INPUT DATAFRAME -
+-------+-------+
|Column1|Column2|
+-------+-------+
|1 |2 |
|3 |4 |
+-------+-------+
OUTPUT -
Add - Map(Column1 -> Map(sum -> 1, count -> 1), Column2 -> Map(sum -> 2, count -> 1))
Add - Map(Column1 -> Map(sum -> 4, count -> 2), Column2 -> Map(sum -> 6, count -> 2))
TestRowAccumulator(id: 1, name: Some(Test Accumulator for Sum&Count), value: Map())
SPARK UI SNAPSHOT -
CLASS :
class TestRowAccumulator extends AccumulatorV2[Row,Map[String,Map[String,Int]]]{
private var colMetrics: Map[String, Map[String, Int]] = Map[String , Map[String , Int]]()
override def isZero: Boolean = this.colMetrics.isEmpty
override def copy(): AccumulatorV2[Row, Map[String,Map[String,Int]]] = {
val racc = new TestRowAccumulator
racc.colMetrics = colMetrics
racc
}
override def reset(): Unit = {
colMetrics = Map[String,Map[String,Int]]()
}
override def add(v: Row): Unit = {
v.schema.foreach(field => {
val name: String = field.name
val value: Int = v.getAs[Int](name)
if(!colMetrics.contains(name))
{
colMetrics = colMetrics ++ Map(name -> Map("sum" -> value , "count" -> 1 ))
}else
{
val metric = colMetrics(name)
val sum = metric("sum") + value
val count = metric("count") + 1
colMetrics = colMetrics ++ Map(name -> Map("sum" -> sum , "count" -> count))
}
})
}
override def merge(other: AccumulatorV2[Row, Map[String,Map[String,Int]]]): Unit = {
other match {
case t:TestRowAccumulator => {
colMetrics.map(col => {
val map2: Map[String, Int] = t.colMetrics.getOrElse(col._1 , Map())
val map1: Map[String, Int] = col._2
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
} )
}
case _ => throw new UnsupportedOperationException(s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
}
override def value: Map[String, Map[String, Int]] = {
colMetrics
}
}
After a bit of debug , I found that merge function is being called .
It had erroneous code so the accumulable value was Map()
EXECUTION FlOW OF ACCUMULATOR (LOCAL MODE) :
ADD
ADD
MERGE
Once I corrected the merge function , accumulator worked as expected
Related
I have an object
I try to get access to field "english"
val englishSentence = dbField::class.declaredMemberProperties.filter { it.name == "english" }[0]
But when I
model.addAttribute("sentence", englishSentence)
I get val com.cyrillihotin.grammartrainer.entity.Sentence.english: kotlin.String
while I expect bla
You can use the call function on a KProperty to get its value from the object.
val dbField = Sentence(1, "bla-eng", "bla-rus")
val value = dbField::class.declaredMemberProperties.find { it.name == "english" }!!.call(dbField)
println(value)
Output: bla-eng
Remember that data type of value is Any here. You need to cast it manually to the desired data type.
If you want to list all the properties with their values, you can do this:
dbField::class.declaredMemberProperties.forEach {
println("${it.name} -> ${it.call(dbField)}")
}
Output:
english -> bla-eng
id -> 1
russian -> bla-rus
Do you mean this?
data class Sentence(val id:Int, val english:String, val russian:String)
val dbField = Sentence(1, "blaEng", "blaRus")
val englishProp = dbField::class.declaredMemberProperties.first { it.name == "english" }as KProperty1<Sentence, String>
println(englishProp.get(dbField))
It prints blaEng
I have an ConfigEntry class defined as
case class ConfigEntry(
key: String,
value: String
)
and a list:
val list: List[ConfigEntry] = List(
ConfigEntry("general.first", "general first value"),
ConfigEntry("general.second", "general second value"),
ConfigEntry("custom.first", "custom first value"),
ConfigEntry("custom.second", "custom second value")
)
Given a list of ConfigEntry, I want a map from property -> map of entries that satisfy that property
As an example, if I have
def getConfig: Map[String, Map[String, String]] = {
def getKey(key: String, index: Int): String = key.split("\\.")(index)
list.map { config =>
getKey(config.key, 0) -> Map(getKey(config.key, 1) -> config.value)
}.toMap
}
I get result
res0: Map[String,Map[String,String]] =
Map(
"general" ->
Map("second" -> "general second value"),
"custom" ->
Map("second" -> "custom second value")
)
and it should be
res0: Map[String,Map[String,String]] =
Map(
"general" ->
Map(
"first" -> "general first value",
"second" -> "general second value"
),
"custom" ->
Map(
"first" -> "custom first value",
"second" -> "custom second value"
)
)
The first record from the list is missing. It's probably through .toMap
How can I do this?
Thank you for any help given
You can do something like this:
final case class ConfigEntry(
key: String,
value: String
)
type Config = Map[String, Map[String, String]]
def getConfig(data: List[ConfigEntry]): Config =
data
.view
.map(e => e.key.split('.').toList -> e.value)
.collect {
case (k1 :: k2 :: Nil, v) => k1 -> (k2 -> v)
}.groupMap(_._1)(_._2)
.view
.mapValues(_.toMap)
.toMap
Or something like this:
def getConfig(data: List[ConfigEntry]): Config = {
#annotation.tailrec
def loop(remaining: List[ConfigEntry], acc: Config): Config =
remaining match {
case ConfigEntry(key, value) :: xs =>
val newAcc = key.split('.').toList match {
case k1 :: k2 :: Nil =>
acc.updatedWith(k1) {
case Some(map) =>
val newMap = map.updatedWith(k2) {
case Some(v) =>
println(s"Overwriting previous value ${v} for the key: ${key}")
// Just overwrite the previous value.
Some(value)
case None =>
Some(value)
}
Some(newMap)
case None =>
Some(Map(k2 -> value))
}
case _ =>
println(s"Bad key: ${key}")
// Just skip this key.
acc
}
loop(remaining = xs, newAcc)
case Nil =>
acc
}
loop(remaining = data, acc = Map.empty)
}
I leave the handling of errors like duplicated keys or bad keys to the reader.
BTW, since this is a config, have you considered using a Config library?
Your map will only produce a 1 to 1 result. To do what you want you will need an accumulator (existing map) to do this.
Working with your existing code, if you're especially tied to how you're parsing your primary and secondary keys via getKey you can apply foldLeft to your list instead, with an empty map as an initial value.
list.foldLeft(Map.empty[String, Map[String, String]]) { (configs, configEntry) =>
val primaryKey = getKey(configEntry.key, 0)
val secondaryKey = getKey(configEntry.key, 1)
configs.get(primaryKey) match {
case None =>
configs.updated(primaryKey, Map(secondaryKey -> configEntry.value))
case Some(configMap) =>
configs.updated(primaryKey, configMap.updated(secondaryKey, configEntry.value))
}
}
Simply:
list.map { ce =>
val Array(l, r) = ce.key.split("\\.")
l -> (r -> ce.value)
} // List[(String, (String, String))]
.groupBy { case (k, _) => k } // Map[String, List[(String, (String, String))]]
.view.mapValues(_.map { case (_, v) => v }.toMap) // MapView[String, List[(String, String)]]
.toMap // Map[String, Map[String, String]]
I have saved vector in session and I want to use random value from the vector but dont know how to extract value in session.
Errors:
'httpRequest-6' failed to execute: Vector(437420, 443940, 443932,
437437, 443981, 443956, 443973, 443915, 437445) named 'termIds' does
not support .random function
And
In 2nd scenario It passes vector in get request like this way, http://someurl/api/thr/Vector(435854)/terms/Vector(437420, 443940,
443932, 437437, 443981, 443956, 443973, 443915, 437445)
instead of using
http://someurl/api/thr/435854/terms/443973
::Here is my script::
class getTerm extends Simulation {
val repeatCount = Integer.getInteger("repeatCount", 1).toInt
val userCount = Integer.getInteger("userCount", 1).toInt
val turl = System.getProperty("turl", "some url")
val httpProtocol = http
.baseURL("http://" + turl)
val headers_10 = Map("Content-Type" -> """application/json""")
var thrIds = ""
var termIds = ""
// Scenario - 1
val getTerms = scenario("Scn 1")
.exec(http("list_of_term")
.get("/api/abc")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll.saveAs("thrIds"))
)
.exec(http("get_all_terms")
.get("""/api/thr/${thrIds.random()}/terms""")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll.saveAs("termIds"))
)
.exec(session => {
thrIds = session("thrIds").as[Long].toString
termIds = session("termIds").as[Long].toString
println("***************************************")
println("Session ====>>>> " + session)
println("Ths ID ====>>>> " + thrIds)
println("Term ID ====>>>> " + termIds)
println("***************************************")
session}
)
// Scenario - 2
// Want to extract vectors here and pass its value into get call
val getKnownTerms = scenario("Get Known Term")
.exec(_.set("thrIds", thrIds))
.exec(_.set("termIds", termIds))
.repeat (repeatCount){
exec(http("get_k_term")
.get("""/api/thr/${thrIds}/terms/${termIds.random()}""")
.headers(headers_10))
}
val scn = List(getTerms.inject(atOnceUsers(1)), getKnownTerms.inject(nothingFor(20 seconds), atOnceUsers(userCount)))
setUp(scn).protocols(httpProtocol)
}
Here is the solution which may help others.
class getTerm extends Simulation {
val repeatCount = Integer.getInteger("repeatCount", 1).toInt
val userCount = Integer.getInteger("userCount", 1).toInt
val turl = System.getProperty("turl", "some url")
val httpProtocol = http
.baseURL("http://" + turl)
val headers_10 = Map("Content-Type" -> """application/json""")
// Change - 1
var thrIds: Seq[String] = _
var termIds: Seq[String] = _
// Scenario - 1
val getTerms = scenario("Scn 1")
.exec(http("list_of_term")
.get("/api/abc")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll
.transform { v => thrIds = v; v }
.saveAs("thrIds"))
)
.exec(http("get_all_trms")
.get("""/api/thr/${thrIds.random()}/terms""")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll
.transform { v => termIds = v; v }
.saveAs("termIds"))
)
// Scenario - 2
val getKnownTerms = scenario("Get Known Term")
.exec(_.set("thrIds", thrIds))
.exec(_.set("termIds", termIds))
.repeat (repeatCount){
exec(http("get_k_term")
.get("""/api/thr/${thrIds.random()}/terms/${termIds.random()}""")
.headers(headers_10))
}
val scn = List(getTerms.inject(atOnceUsers(1)), getKnownTerms.inject(nothingFor(20 seconds), atOnceUsers(userCount)))
setUp(scn).protocols(httpProtocol)
}
I need to create a HashMap of directory-to-file in scala while I list all files in the directory. How can I achieve this in scala?
val directoryToFile = awsClient.listFiles(uploadPath).collect {
case path if !path.endsWith("/") => {
path match {
// do some regex matching to get directory & file names
case regex(dir, date) => {
// NEED TO CREATE A HASH MAP OF dir -> date. How???
}
case _ => None
}
}
}
The method listFiles(path: String) returns a Seq[String] of absolute path of all files in the path passed as argument to the function
Try to write more idiomatic Scala. Something like this:
val directoryToFile = (for {
path <- awsClient.listFiles(uploadPath)
if !path.endsWith("/")
regex(dir, date) <- regex.findFirstIn(path)
} yield dir -> date).sortBy(_._2).toMap
You can filter and then foldLeft:
val l = List("""/opt/file1.txt""", """/opt/file2.txt""")
val finalMap = l
.filter(!_.endsWith("/"))
.foldLeft(Map.empty[String, LocalDateTime])((map, s) =>
s match {
case regex(dir, date) => map + (dir -> date)
case _ => map
}
)
You can try something like this:
val regex = """(\d)-(\d)""".r
val paths = List("1-2", "3-4", "555")
for {
// Hint to Scala to produce specific type
_ <- Map("" -> "")
// Not sure why your !path.endsWith("/") is not part of regex
path#regex(a, b) <- paths
if path.startsWith("1")
} yield (a, b)
//> scala.collection.immutable.Map[String,String] = Map(1 -> 2)
Slightly more complicated if you need max:
val regex = """(\d)-(\d)""".r
val paths = List("1-2", "3-4", "555", "1-3")
for {
(_, ps) <-
( for {
path#regex(a, b) <- paths
if path.startsWith("1")
} yield (a, b)
).groupBy(_._1)
} yield ps.maxBy(_._2)
//> scala.collection.immutable.Map[String,String] = Map(1 -> 3)
To ease the development of my map reduce tasks running on Hadoop prior to actually deploying the tasks to Hadoop I test using a simple map reducer I wrote :
object mapreduce {
import scala.collection.JavaConversions._
val intermediate = new java.util.HashMap[String, java.util.List[Int]]
//> intermediate : java.util.HashMap[String,java.util.List[Int]] = {}
val result = new java.util.ArrayList[Int] //> result : java.util.ArrayList[Int] = []
def emitIntermediate(key: String, value: Int) {
if (!intermediate.containsKey(key)) {
intermediate.put(key, new java.util.ArrayList)
}
intermediate.get(key).add(value)
} //> emitIntermediate: (key: String, value: Int)Unit
def emit(value: Int) {
println("value is " + value)
result.add(value)
} //> emit: (value: Int)Unit
def execute(data: java.util.List[String], mapper: String => Unit, reducer: (String, java.util.List[Int]) => Unit) {
for (line <- data) {
mapper(line)
}
for (keyVal <- intermediate) {
reducer(keyVal._1, intermediate.get(keyVal._1))
}
for (item <- result) {
println(item)
}
} //> execute: (data: java.util.List[String], mapper: String => Unit, reducer: (St
//| ring, java.util.List[Int]) => Unit)Unit
def mapper(record: String) {
var jsonAttributes = com.nebhale.jsonpath.JsonPath.read("$", record, classOf[java.util.ArrayList[String]])
println("jsonAttributes are " + jsonAttributes)
var key = jsonAttributes.get(0)
var value = jsonAttributes.get(1)
println("key is " + key)
var delims = "[ ]+";
var words = value.split(delims);
for (w <- words) {
emitIntermediate(w, 1)
}
} //> mapper: (record: String)Unit
def reducer(key: String, listOfValues: java.util.List[Int]) = {
var total = 0
for (value <- listOfValues) {
total += value;
}
emit(total)
} //> reducer: (key: String, listOfValues: java.util.List[Int])Unit
var dataToProcess = new java.util.ArrayList[String]
//> dataToProcess : java.util.ArrayList[String] = []
dataToProcess.add("[\"test1\" , \"test1 here is another test1 test1 \"]")
//> res0: Boolean = true
dataToProcess.add("[\"test2\" , \"test2 here is another test2 test1 \"]")
//> res1: Boolean = true
execute(dataToProcess, mapper, reducer) //> jsonAttributes are [test1, test1 here is another test1 test1 ]
//| key is test1
//| jsonAttributes are [test2, test2 here is another test2 test1 ]
//| key is test2
//| value is 2
//| value is 2
//| value is 4
//| value is 2
//| value is 2
//| 2
//| 2
//| 4
//| 2
//| 2
for (keyValue <- intermediate) {
println(keyValue._1 + "->"+keyValue._2.size)//> another->2
//| is->2
//| test1->4
//| here->2
//| test2->2
}
}
This allows me to run my mapreduce tasks within my Eclipse IDE on Windows before deploying to the actual Hadoop cluster. I would like to perform something similar for Spark or have the ability to write Spark code from within Eclipse to test prior to deploying to Spark cluster. Is this possible with Spark ? Since Spark runs on top of Hadoop does this mean I cannot run Spark without first having Hadoop installed ? So in other words can I run the code using just the Spark libraries ? :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val logFile = "$YOUR_SPARK_HOME/README.md" // Should be some file on your system
val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME",
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
taken from https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scala
If so what are the Spark libraries I need to include within my project ?
Add the following to your build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1" and make sure your scalaVersion is set (eg. scalaVersion := "2.10.3")
Also if you're just running the program locally, you can skip the last two arguments to SparkContext as follows val sc = new SparkContext("local", "Simple App")
Finally, Spark can run on Hadoop but can also run in stand alone mode. See: https://spark.apache.org/docs/0.9.1/spark-standalone.html