How to run simple Spark app from Eclipse/Intellij IDE? - java

To ease the development of my map reduce tasks running on Hadoop prior to actually deploying the tasks to Hadoop I test using a simple map reducer I wrote :
object mapreduce {
import scala.collection.JavaConversions._
val intermediate = new java.util.HashMap[String, java.util.List[Int]]
//> intermediate : java.util.HashMap[String,java.util.List[Int]] = {}
val result = new java.util.ArrayList[Int] //> result : java.util.ArrayList[Int] = []
def emitIntermediate(key: String, value: Int) {
if (!intermediate.containsKey(key)) {
intermediate.put(key, new java.util.ArrayList)
}
intermediate.get(key).add(value)
} //> emitIntermediate: (key: String, value: Int)Unit
def emit(value: Int) {
println("value is " + value)
result.add(value)
} //> emit: (value: Int)Unit
def execute(data: java.util.List[String], mapper: String => Unit, reducer: (String, java.util.List[Int]) => Unit) {
for (line <- data) {
mapper(line)
}
for (keyVal <- intermediate) {
reducer(keyVal._1, intermediate.get(keyVal._1))
}
for (item <- result) {
println(item)
}
} //> execute: (data: java.util.List[String], mapper: String => Unit, reducer: (St
//| ring, java.util.List[Int]) => Unit)Unit
def mapper(record: String) {
var jsonAttributes = com.nebhale.jsonpath.JsonPath.read("$", record, classOf[java.util.ArrayList[String]])
println("jsonAttributes are " + jsonAttributes)
var key = jsonAttributes.get(0)
var value = jsonAttributes.get(1)
println("key is " + key)
var delims = "[ ]+";
var words = value.split(delims);
for (w <- words) {
emitIntermediate(w, 1)
}
} //> mapper: (record: String)Unit
def reducer(key: String, listOfValues: java.util.List[Int]) = {
var total = 0
for (value <- listOfValues) {
total += value;
}
emit(total)
} //> reducer: (key: String, listOfValues: java.util.List[Int])Unit
var dataToProcess = new java.util.ArrayList[String]
//> dataToProcess : java.util.ArrayList[String] = []
dataToProcess.add("[\"test1\" , \"test1 here is another test1 test1 \"]")
//> res0: Boolean = true
dataToProcess.add("[\"test2\" , \"test2 here is another test2 test1 \"]")
//> res1: Boolean = true
execute(dataToProcess, mapper, reducer) //> jsonAttributes are [test1, test1 here is another test1 test1 ]
//| key is test1
//| jsonAttributes are [test2, test2 here is another test2 test1 ]
//| key is test2
//| value is 2
//| value is 2
//| value is 4
//| value is 2
//| value is 2
//| 2
//| 2
//| 4
//| 2
//| 2
for (keyValue <- intermediate) {
println(keyValue._1 + "->"+keyValue._2.size)//> another->2
//| is->2
//| test1->4
//| here->2
//| test2->2
}
}
This allows me to run my mapreduce tasks within my Eclipse IDE on Windows before deploying to the actual Hadoop cluster. I would like to perform something similar for Spark or have the ability to write Spark code from within Eclipse to test prior to deploying to Spark cluster. Is this possible with Spark ? Since Spark runs on top of Hadoop does this mean I cannot run Spark without first having Hadoop installed ? So in other words can I run the code using just the Spark libraries ? :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val logFile = "$YOUR_SPARK_HOME/README.md" // Should be some file on your system
val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME",
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
taken from https://spark.apache.org/docs/0.9.0/quick-start.html#a-standalone-app-in-scala
If so what are the Spark libraries I need to include within my project ?

Add the following to your build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1" and make sure your scalaVersion is set (eg. scalaVersion := "2.10.3")
Also if you're just running the program locally, you can skip the last two arguments to SparkContext as follows val sc = new SparkContext("local", "Simple App")
Finally, Spark can run on Hadoop but can also run in stand alone mode. See: https://spark.apache.org/docs/0.9.1/spark-standalone.html

Related

Spark Accumulator

I am new to accumulators in Spark . I have created an accumulator which gathers the information of sum and count of all columns in a dataframe into a Map.
Which is not functioning as expected , so I have a few doubts.
When I run this class( pasted below) in local mode , I can see that the accumulators getting updated but the final value is still empty.For debug purposes, I added a print Statement in add() .
Q1) Why is the final accumulable not being updated when the accumulator is being added ?
For reference , I studied the CollectionsAccumulator where they have made use of SynchronizedList from Java Collections.
Q2) Does it need to be a synchronized/concurrent collection for an accumulator to update ?
Q3) Which collection will be best suited for such purpose ?
I have attached my execution flow along with spark ui snapshot for analysis.
Thanks.
EXECUTION:
INPUT DATAFRAME -
+-------+-------+
|Column1|Column2|
+-------+-------+
|1 |2 |
|3 |4 |
+-------+-------+
OUTPUT -
Add - Map(Column1 -> Map(sum -> 1, count -> 1), Column2 -> Map(sum -> 2, count -> 1))
Add - Map(Column1 -> Map(sum -> 4, count -> 2), Column2 -> Map(sum -> 6, count -> 2))
TestRowAccumulator(id: 1, name: Some(Test Accumulator for Sum&Count), value: Map())
SPARK UI SNAPSHOT -
CLASS :
class TestRowAccumulator extends AccumulatorV2[Row,Map[String,Map[String,Int]]]{
private var colMetrics: Map[String, Map[String, Int]] = Map[String , Map[String , Int]]()
override def isZero: Boolean = this.colMetrics.isEmpty
override def copy(): AccumulatorV2[Row, Map[String,Map[String,Int]]] = {
val racc = new TestRowAccumulator
racc.colMetrics = colMetrics
racc
}
override def reset(): Unit = {
colMetrics = Map[String,Map[String,Int]]()
}
override def add(v: Row): Unit = {
v.schema.foreach(field => {
val name: String = field.name
val value: Int = v.getAs[Int](name)
if(!colMetrics.contains(name))
{
colMetrics = colMetrics ++ Map(name -> Map("sum" -> value , "count" -> 1 ))
}else
{
val metric = colMetrics(name)
val sum = metric("sum") + value
val count = metric("count") + 1
colMetrics = colMetrics ++ Map(name -> Map("sum" -> sum , "count" -> count))
}
})
}
override def merge(other: AccumulatorV2[Row, Map[String,Map[String,Int]]]): Unit = {
other match {
case t:TestRowAccumulator => {
colMetrics.map(col => {
val map2: Map[String, Int] = t.colMetrics.getOrElse(col._1 , Map())
val map1: Map[String, Int] = col._2
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
} )
}
case _ => throw new UnsupportedOperationException(s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
}
override def value: Map[String, Map[String, Int]] = {
colMetrics
}
}
After a bit of debug , I found that merge function is being called .
It had erroneous code so the accumulable value was Map()
EXECUTION FlOW OF ACCUMULATOR (LOCAL MODE) :
ADD
ADD
MERGE
Once I corrected the merge function , accumulator worked as expected

create a map from list in Scala

I need to create a HashMap of directory-to-file in scala while I list all files in the directory. How can I achieve this in scala?
val directoryToFile = awsClient.listFiles(uploadPath).collect {
case path if !path.endsWith("/") => {
path match {
// do some regex matching to get directory & file names
case regex(dir, date) => {
// NEED TO CREATE A HASH MAP OF dir -> date. How???
}
case _ => None
}
}
}
The method listFiles(path: String) returns a Seq[String] of absolute path of all files in the path passed as argument to the function
Try to write more idiomatic Scala. Something like this:
val directoryToFile = (for {
path <- awsClient.listFiles(uploadPath)
if !path.endsWith("/")
regex(dir, date) <- regex.findFirstIn(path)
} yield dir -> date).sortBy(_._2).toMap
You can filter and then foldLeft:
val l = List("""/opt/file1.txt""", """/opt/file2.txt""")
val finalMap = l
.filter(!_.endsWith("/"))
.foldLeft(Map.empty[String, LocalDateTime])((map, s) =>
s match {
case regex(dir, date) => map + (dir -> date)
case _ => map
}
)
You can try something like this:
val regex = """(\d)-(\d)""".r
val paths = List("1-2", "3-4", "555")
for {
// Hint to Scala to produce specific type
_ <- Map("" -> "")
// Not sure why your !path.endsWith("/") is not part of regex
path#regex(a, b) <- paths
if path.startsWith("1")
} yield (a, b)
//> scala.collection.immutable.Map[String,String] = Map(1 -> 2)
Slightly more complicated if you need max:
val regex = """(\d)-(\d)""".r
val paths = List("1-2", "3-4", "555", "1-3")
for {
(_, ps) <-
( for {
path#regex(a, b) <- paths
if path.startsWith("1")
} yield (a, b)
).groupBy(_._1)
} yield ps.maxBy(_._2)
//> scala.collection.immutable.Map[String,String] = Map(1 -> 3)

How to write a custom serializer for Java 8 LocalDateTime

I have a class named Child1 which I want to convert into JSON using Lift Json. Everything is working fine i was using joda date time but now i want to use Java 8 LocalDateTime but i am unable to write custom serializer for this here is my code
import org.joda.time.DateTime
import net.liftweb.json.Serialization.{ read, write }
import net.liftweb.json.DefaultFormats
import net.liftweb.json.Serializer
import net.liftweb.json.JsonAST._
import net.liftweb.json.Formats
import net.liftweb.json.TypeInfo
import net.liftweb.json.MappingException
class Child1Serializer extends Serializer[Child1] {
private val IntervalClass = classOf[Child1]
def deserialize(implicit format: Formats): PartialFunction[(TypeInfo, JValue), Child1] = {
case (TypeInfo(IntervalClass, _), json) => json match {
case JObject(
JField("str", JString(str)) :: JField("Num", JInt(num)) ::
JField("MyList", JArray(mylist)) :: (JField("myDate", JInt(mydate)) ::
JField("number", JInt(number)) ::Nil)
) => {
val c = Child1(
str, num.intValue(), mylist.map(_.values.toString.toInt), new DateTime(mydate.longValue)
)
c.number = number.intValue()
c
}
case x => throw new MappingException("Can't convert " + x + " to Interval")
}
}
def serialize(implicit format: Formats): PartialFunction[Any, JValue] = {
case x: Child1 =>
JObject(
JField("str", JString(x.str)) :: JField("Num", JInt(x.Num)) ::
JField("MyList", JArray(x.MyList.map(JInt(_)))) ::
JField("myDate", JInt(BigInt(x.myDate.getMillis))) ::
JField("number", JInt(x.number)) :: Nil
)
}
}
Object Test extends App {
case class Child1(var str:String, var Num:Int, MyList:List[Int], myDate:DateTime) {
var number: Int=555
}
val c = Child1("Mary", 5, List(1, 2), DateTime.now())
c.number = 1
println("number" + c.number)
implicit val formats = DefaultFormats + new Child1Serializer
val ser = write(c)
println("Child class converted to string" + ser)
var obj = read[Child1](ser)
println("object of Child is "+ obj)
println("str" + obj.str)
println("Num" + obj.Num)
println("MyList" + obj.MyList)
println("myDate" + obj.myDate)
println("number" + obj.number)
}
now i want to use Java 8 LocalDateTime like this
case class Child1(var str: String, var Num: Int, MyList: List[Int], val myDate: LocalDateTime = LocalDateTime.now()) {
var number: Int=555
}
what modification do i need to make in my custom serializer class Child1Serializer i tried to do it but i was unable to do it please help me
In the serializer, serialize the date like this:
def serialize(implicit format: Formats): PartialFunction[Any, JValue] = {
case x: Child1 =>
JObject(
JField("str", JString(x.str)) :: JField("Num", JInt(x.Num)) ::
JField("MyList", JArray(x.MyList.map(JInt(_)))) ::
JField("myDate", JString(x.myDate.toString)) ::
JField("number", JInt(x.number)) :: Nil
)
}
In the deserializer,
def deserialize(implicit format: Formats): PartialFunction[(TypeInfo, JValue), Child1] = {
case (TypeInfo(IntervalClass, _), json) => json match {
case JObject(
JField("str", JString(str)) :: JField("Num", JInt(num)) ::
JField("MyList", JArray(mylist)) :: (JField("myDate", JString(mydate)) ::
JField("number", JInt(number)) ::Nil)
) => {
val c = Child1(
str, num.intValue(), mylist.map(_.values.toString.toInt), LocalDateTime.parse(myDate)
)
c.number = number.intValue()
c
}
case x => throw new MappingException("Can't convert " + x + " to Interval")
}
}
The LocalDateTime object writes to an ISO format using toString and the parse factory method should be able to reconstruct the object from such a string.
You can define the LocalDateTime serializer like this.
class LocalDateTimeSerializer extends Serializer[LocalDateTime] {
private val LocalDateTimeClass = classOf[LocalDateTime]
def deserialize(implicit format: Formats): PartialFunction[(TypeInfo, JValue), LocalDateTime] = {
case (TypeInfo(LocalDateTimeClass, _), json) => json match {
case JString(dt) => LocalDateTime.parse(dt)
case x => throw new MappingException("Can't convert " + x + " to LocalDateTime")
}
}
def serialize(implicit format: Formats): PartialFunction[Any, JValue] = {
case x: LocalDateTime => JString(x.toString)
}
}
Also define your formats like this
implicit val formats = DefaultFormats + new LocalDateTimeSerializer + new FieldSerializer[Child1]
Please note the usage of FieldSerializer to serialize the non-constructor filed, number

Scala/Akka NullPointerException on master actor

I am learning akka framework for parallel processing in scala, and I was trying to migrating a java project to scala so I can learn both akka and scala at the same time. I am get a NullPointerException on master actor when trying to receive mutable object from the worker actor after some computation in the worker. All code is below...
import akka.actor._
import java.math.BigInteger
import akka.routing.ActorRefRoutee
import akka.routing.Router
import akka.routing.RoundRobinRoutingLogic
object Main extends App {
val system = ActorSystem("CalcSystem")
val masterActor = system.actorOf(Props[Master], "master")
masterActor.tell(new Calculate, ActorRef.noSender)
}
class Master extends Actor {
private val messages: Int = 10;
var resultList: Seq[String] = _
//val workerRouter = this.context.actorOf(Props[Worker].withRouter(new RoundRobinRouter(2)), "worker")
var router = {
val routees = Vector.fill(5) {
val r = context.actorOf(Props[Worker])
context watch r
ActorRefRoutee(r)
}
Router(RoundRobinRoutingLogic(), routees)
}
def receive() = {
case msg: Calculate =>
processMessages()
case msg: Result =>
resultList :+ msg.getFactorial().toString
println(msg.getFactorial())
if (resultList.length == messages) {
end
}
}
private def processMessages() {
var i: Int = 0
for (i <- 1 to messages) {
// workerRouter.tell(new Work, self)
router.route(new Work, self)
}
}
private def end() {
println("List = " + resultList)
this.context.system.shutdown()
}
}
import akka.actor._
import java.math.BigInteger
class Worker extends Actor {
private val calculator = new Calculator
def receive() = {
case msg: Work =>
println("Called calculator.calculateFactorial: " + context.self.toString())
val result = new Result(calculator.calculateFactorial)
sender.tell(result, this.context.parent)
case _ =>
println("I don't know what to do with this...")
}
}
import java.math.BigInteger
class Result(bigInt: BigInteger) {
def getFactorial(): BigInteger = bigInt
}
import java.math.BigInteger
class Calculator {
def calculateFactorial(): BigInteger = {
var result: BigInteger = BigInteger.valueOf(1)
var i = 0
for(i <- 1 to 4) {
result = result.multiply(BigInteger.valueOf(i))
}
println("result: " + result)
result
}
}
You initialize the resultList with null and then try to append something.
Does your calculation ever stop? In line
resultList :+ msg.getFactorial().toString
you're creating a copy of sequence with an element appended. But there is no assignment to var resultList
This line will work as you want.
resultList = resultList :+ msg.getFactorial().toString
I recommend you to avoid mutable variables in actor and use context.become
https://github.com/alexandru/scala-best-practices/blob/master/sections/5-actors.md#52-should-mutate-state-in-actors-only-with-contextbecome

Is there a cleaner way to do this Group Query in MongoDB from Groovy?

I'm working on learning MongoDB. Language of choice for the current run at it is Groovy.
Working on Group Queries by trying to answer the question of which pet is the most needy one.
Below is my first attempt and it's awful. Any help cleaning this up (or simply confirming that there isn't a cleaner way to do it) would be much appreciated.
Thanks in advance!
package mongo.pets
import com.gmongo.GMongo
import com.mongodb.BasicDBObject
import com.mongodb.DBObject
class StatsController {
def dbPets = new GMongo().getDB('needsHotel').getCollection('pets')
//FIXME OMG THIS IS AWFUL!!!
def index = {
def petsNeed = 'a walk'
def reduce = 'function(doc, aggregator) { aggregator.needsCount += doc.needs.length }'
def key = new BasicDBObject()
key.put("name", true)
def initial = new BasicDBObject()
initial.put ("needsCount", 0)
def maxNeeds = 0
def needyPets = []
dbPets.group(key, new BasicDBObject(), initial, reduce).each {
if (maxNeeds < it['needsCount']) {
maxNeeds = it['needsCount']
needyPets = []
needyPets += it['name']
} else if (maxNeeds == it['needsCount']) {
needyPets += it['name']
}
}
def needyPet = needyPets
[petsNeedingCount: dbPets.find([needs: petsNeed]).count(), petsNeed: petsNeed, mostNeedyPet: needyPet]
}
}
It should be possible to be change the whole method to this (but I don't have MongoDB to test it)
def index = {
def petsNeed = 'a walk'
def reduce = 'function(doc, aggregator) { aggregator.needsCount += doc.needs.length }'
def key = [ name: true ] as BasicDBObject
def initial = [ needsCount: 0 ] as BasicDBObject
def allPets = dbPets.group( key, new BasicDBObject(), initial, reduce )
def maxNeeds = allPets*.needsCount.collect { it as Integer }.max()
def needyPet = allPets.findAll { maxNeeds == it.needsCount as Integer }.name
[petsNeedingCount: dbPets.find([needs: petsNeed]).count(), petsNeed: petsNeed, mostNeedyPet: needyPet]
}

Categories

Resources