Troubleshooting Scala Spark Configurations/Environments

Troubleshooting Scala Spark Configurations/Environments - java

Running windows 8.1, Java 1.8, Scala 2.10.5, Spark 1.4.1, Scala IDE (Eclipse 4.4), Ipython 3.0.0 and Jupyter Scala.
I'm relatively new to Scala and Spark and I'm seeing an issue where certain RDD commands like collect and first return the "Task not serializable" error. What's unusual to me is I see that error in Ipython notebooks with the Scala kernel or the Scala IDE. However when I run the code directly in the spark-shell I do not receive this error.
I would like to setup these two environments for more advanced code evaluation beyond the shell. I have little expertise in troubleshooting this type of issue and determining what to look for; if you can provide guidance on how to get started with resolving this kind of issue that would be greatly appreciated.
Code:
val logFile = "s3n://[key:[key secret]#mortar-example-data/airline-data"
val sample = sc.parallelize(sc.textFile(logFile).take(100).map(line => line.replace("'","").replace("\"","")).map(line => line.substring(0,line.length()-1)))
val header = sample.first
val data = sample.filter(_!= header)
data.take(1)
data.count
data.collect
Stack Trace
org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
org.apache.spark.rdd.RDD.filter(RDD.scala:310)
cmd49$$user$$anonfun$4.apply(Main.scala:188)
cmd49$$user$$anonfun$4.apply(Main.scala:187)
java.io.NotSerializableException: org.apache.spark.SparkConf
Serialization stack:
- object not serializable (class: org.apache.spark.SparkConf, value: org.apache.spark.SparkConf#5976e363)
- field (class: cmd12$$user, name: conf, type: class org.apache.spark.SparkConf)
- object (class cmd12$$user, cmd12$$user#39a7edac)
- field (class: cmd49, name: $ref$cmd12, type: class cmd12$$user)
- object (class cmd49, cmd49#3c2a0c4f)
- field (class: cmd49$$user, name: $outer, type: class cmd49)
- object (class cmd49$$user, cmd49$$user#774ea026)
- field (class: cmd49$$user$$anonfun$4, name: $outer, type: class cmd49$$user)
- object (class cmd49$$user$$anonfun$4, <function0>)
- field (class: cmd49$$user$$anonfun$4$$anonfun$apply$3, name: $outer, type: class cmd49$$user$$anonfun$4)
- object (class cmd49$$user$$anonfun$4$$anonfun$apply$3, <function1>)
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
org.apache.spark.rdd.RDD.filter(RDD.scala:310)
cmd49$$user$$anonfun$4.apply(Main.scala:188)
cmd49$$user$$anonfun$4.apply(Main.scala:187)

#Ashalynd was right about the fact that sc.textFile already creates and RDD. You don't need sc.parallelize in that case. documentation here
So considering your example, this is what you'll need to do :
// Read your data from S3
val logFile = "s3n://[key:[key secret]#mortar-example-data/airline-data"
val rawRDD = sc.textFile(logFile)
// Fetch the header
val header = rawRDD.first
// Filter on the header than map to clean the line
val sample = rawRDD.filter(!_.contains(header)).map {
line => line.replaceAll("['\"]","").substring(0,line.length()-1)
}.takeSample(false,100,12L) // takeSample returns a fixed-size sampled subset of this RDD in an array
It's better to use the takeSample function :
def takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T]
withReplacement : whether sampling is done with replacement
num : size of the returned sample
seed : seed for the random number generator
Note 1 : the sample is an Array[String], so if you wish to transform it to an RDD, you can use the parallelize function as followed :
val sampleRDD = sc.parallelize(sample.toSeq)
Note 2 : If you wish to take a sample RDD directly from your rawRDD.filter(...).map(...) , you can use the sample function that returns an RDD[T]. Nevertheless, you'll need to specify an fraction of the data you need instead of a specific number.

sc.textFile already creates distributed dataset (check the documentation ). You don't need sc.parallelize in that case, but - as eliasah properly noted - you need to turn the result into an RDD again, if you want to have an RDD.
val selection = sc.textFile(logFile). // RDD
take(100). // collection
map(_.replaceAll("['\"]",""). // use regex to match both chars
map(_.init) // a method that returns all elements except the last
// turn the resulting collection into RDD again
val sample = sc.parallelize(selection)

Related

Jython - Scala array/list not JSON serializable within python script

I have a Scala class that wraps an avro record with getters and setters. Using Jython to allow users to write python scripts to process the Avro record and ultimately do a json.dumps on the new processed record.
The issue is, if the user wants to grab a value that is an Array from the record, the interpreter complains that the object is not JSON serializable.
import json
json.dumps(<AClass>.getArray('myArray'))
The AClass is made available to any given python script at run time. Scala AClass:
class AClass {
def getArray(fieldName: String): Array[Integer] = {
val value: GenericData.Array[T] = [....]
value
.asInstanceOf[GenericData.Array[T]]
.asScala
.toArray[T]
}
}
I've tried a few other return types, 1) List[Integer], 2) mutable.Buffer[Integer], just the plain Avro generic Array 3) GenericData.Array[T]. All give the same serialization error with the slightly varying objects:
Runtime exception occurred during Python processing. TypeError: List(1, 2, 3) is not JSON serializable.
... Buffer(1, 2, 3) is not JSON serializable
... [1, 2, 3] is not JSON serializable.
Now it seems that if we were to convert it to a list() from within the python script, it works fine. This gave some leads but need it to happen at the Scala level.
import json
json.dumps(list(<AClass>.getArray('myArray')))
Is there any way to achieve this? What Scala / Java list type would translate directly into the python list type and/or be JSON serializable within the Jython py interpreter?

The json module in jython only accepts standard jython data types (that's why converting to list() works). See: https://docs.python.org/3/library/json.html#py-to-json-table

Kaitai Struct: pass some field to achieve fault tolerance

is there has any way to pass some field when parsing a truncated log in Kaitai Struct?
Because if it read a field (type specify to a enum) but value not in there, it will raise a NullPointer Exception.
So I want ask if any way to achieve that just like default: pass attribute in python library Construct
Here is my ksy file:
meta:
id: btsnoop
endian: be
seq:
- id: header
type: header
- id: packets
type: packet
repeat: eos
types:
header:
seq:
- id: iden
size: 8
- id: version
type: u4
- id: datalink_type
type: u4
enum: linktype
packet:
seq:
- id: ori_len
type: u4
- id: include_len
type: u4
- id: pkt_flags
type: u4
- id: cumu_drop
type: u4
- id: timestamp
type: s8
- id: data
size: include_len
type: frame
frame:
seq:
- id: pkt_type
type: u1
enum: pkttype
- id: cmd
type: cmd
if: pkt_type == pkttype::cmd_pkt
- id: acl
type: acl
if: pkt_type == pkttype::acl_pkt
- id: evt
type: evt
if: pkt_type == pkttype::evt_pkt
cmd:
seq:
- id: opcode
type: u2le
- id: params_len
type: u1
- id: params
size: params_len
acl:
seq:
- id: handle
type: u2le
evt:
seq:
- id: status
type: u1
enum: status
- id: total_length
type: u1
- id: params
size-eos: true
enums: <-- I need to list all possible option in every enum?
linktype:
0x03E9: unencapsulated_hci
0x03EA: hci_uart
0x03EB: hci_bscp
0x03EC: hci_serial
pkttype:
1: cmd_pkt
2: acl_pkt
4: evt_pkt
status:
0x0D: complete_D
0x0E: complete_E
0xFF: vendor_specific
Thanks for reply :)

There are still two questions you're facing here :)
Parsing partial / truncated / damaged data
The main problem here is that normally Kaitai Struct compiles .ksy into a code that does the actual parsing in class constructor. That means if a problem arises, boom, you've got no object at all. In most use cases, it is desired behavior, as it actually allows you to be sure that the object is fully initialized. The problem is usually an EOFException, when format wants to read next primitive, but there's no data in the stream left, or, in some more complicated cases, something else.
However, there are some use-cases as you've mentioned, where having "best effort" parsing would be helpful - i.e. you're ok with having half-filled object. Another popular use case for that is the visualizer: it's helpful to show "best effort" there too, as it's better to show user half-parsed result visualized (to aid in locating at error) rather than no result at all (and leave the user with the guesswork).
There's a simple solution for that in Kaitai Struct - you can compile your class with --debug option. This way you'll get a class that has object creation and parsing separated, parsing would be just another method of an object (void _read()). However, this means that you'll have to call parsing method manually. For example, if your original code was:
Btssnoop b = Btssnoop.fromFile("/path/to/file.bin");
System.out.println(b.packets.size());
after you've compiled it with --debug, you'll have to do extra step:
Btssnoop b = Btssnoop.fromFile("/path/to/file.bin");
b._read();
System.out.println(b.packets.size());
and then you can wrap it up in a try/catch block and actually continue processing even after getting IOException:
Btssnoop b = Btssnoop.fromFile("/path/to/file.bin");
try {
b._read();
} catch (IOException e) {
System.out.println("warning: truncated packets");
}
System.out.println(b.packets.size());
There are a few catches, though:
--debug was not yet available for Java target, as of release v0.3; actually, it's not even in public git repository right now, I hope I'll push it soon though.
--debug also does a few extra things, like writing down positions of every attribute, which imposes pretty harsh performance / memory penalty. Tell me if you'll need a switch to compile "separate constructor/parsing" functionality without the rest of --debug functionality - I can think of additional switch to enable just that.
If you need to do continuous parsing incoming packets as they arrive, probably it's a bad idea to store them all in memory and re-parse them all on every update. We're considering event-based parsing model for that one, please tell me if you'd be interested in that one.
Missing enum values and NPE
Current Java implementation translates enums reading into something like
this.pet1 = Animal.byId(_io.readU4le());
where Animal.byId is translated into:
private static final Map<Long, Animal> byId = new HashMap<Long, Animal>(3);
static {
for (Animal e : Animal.values())
byId.put(e.id(), e);
}
public static Animal byId(long id) { return byId.get(id); }
Java Map's get returns null by contract, when no value was found in the map. You should be able to compare that null with something (i.e. other enum value) and get proper true or false. Can you show me where exactly you have NPE problem, i.e. your code, generated code and stack trace?

How To Convert Scala Case Class to Java HashMap

I'm using Mule ESB (Java Based) and I have some scala components that modify and create data. My Data is represented in Case Classes. I'm trying to convert them to Java, however Just getting them to convert to Scala types is a challenge. Here's a simplified example of what I'm trying to do:
package com.echostar.ese.experiment
import scala.collection.JavaConverters
case class Resource(guid: String, filename: String)
case class Blackboard(name: String, guid:String, resource: Resource)
object CCC extends App {
val res = Resource("4alskckd", "test.file")
val bb = Blackboard("Test", "123asdfs", res)
val myMap = getCCParams(bb)
val result = new java.util.HashMap[String,Object](myMap)
println("Result:"+result)
def getCCParams(cc: AnyRef) =
(Map[String, Any]() /: cc.getClass.getDeclaredFields) {(a, f) =>
f.setAccessible(true)
val value = f.get(cc) match {
// this covers tuples as well as case classes, so there may be a more specific way
case caseClassInstance: Product => getCCParams(caseClassInstance): Map[String, Any]
case x => x
}
a + (f.getName -> value)
}
}
Current Error: Recursive method needs return type.
My Scala Foo isn't very strong. I grabbed this method from another answer here
and basically know what it's doing, but not enough to change this to java.util.HashMap and java.util.List
Expected Output:
Result:{"name"="Test", "guid"="123asdfs", "resource"= {"guid"="4alskckd", "filename"="test.file"}}
UPDATE1:
1. Added getCCParams(caseClassInstance): Map[String, Any] to line 22 Above per #cem-catikkas. IDE syntax error still says "recursive method ... needs result type" and "overloaded method java.util.HashMap cannot be applied to scala.collection.immutable.Map".
2. Changed java.util.HashMap[String, Object]

You should follow what the error tells you. Since getCCParams is a recursive method you need to declare its return type.
def getCCParams(cc: AnyRef): Map[String, Any]

Answering this in case anyone else going through the issue ends up here (as happened to me).
I believe the error you were getting had to do with the fact that the return type was being declared at method invocation (line 22), however the compiler was expecting it at the method's declaration (in your case, line 17). The below seems to have worked:
def getCCParams(cc: AnyRef): Map[String, Any] = ...
Regarding the conversion from Scala Map to Java HashMap, by adding the ._ wildcard to the JavaConverters import statement, you manage to import all the methods of the object as single identifiers, which is a requirement for implicit conversions. This will include the asJava method which can then be used to convert the Scala Map to a Java one, and then this can be passed to the java.util.HashMap(Map<? extends K,? extends V> m) constructor to instantiate a HashMap:
import scala.collection.JavaConverters._
import java.util.{HashMap => JHashMap}
...
val myMap = getCCParams(bb)
val r = myMap.asJava // converting to java.util.Map[String, Any]
val result: JHashMap[String,Any] = new JHashMap(r)

I wonder if you've considered going at it the other way around, by implementing the java.util.Map interface in your case class? Then you wouldn't have to convert back and forth, but any consumers downstream that are using a Map interface will just work (for example if you're using Groovy's field dot-notation).

How do I write a JSON Format for an object in the Java library that doesn't have an apply method?

I've been stuck on this particular problem for about a week now, and I figure I'm going to write this up as a question on here to clear out my thoughts and get some guidance.
So I have this case class that has a java.sql.Timestamp field:
case class Request(id: Option[Int], requestDate: Timestamp)
and I want to convert this to a JsObject
val q = Query(Requests).list // This is Slick, a database access lib for Scala
printList(q)
Ok(Json.toJson(q)) // and this is where I run into trouble
"No Json deserializer found for type List[models.Request]. Try to implement an implicit Writes or Format for this type." Okay, that makes sense.
So following the Play documentation here, I attempt to write a Format...
implicit val requestFormat = Json.format[Request] // need Timestamp deserializer
implicit val timestampFormat = (
(__ \ "time").format[Long] // error 1
)(Timestamp.apply, unlift(Timestamp.unapply)) // error 2
Error 1
Description Resource Path Location Type overloaded method value format with alternatives:
(w: play.api.libs.json.Writes[Long])(implicit r: play.api.libs.json.Reads[Long])play.api.libs.json.OFormat[Long]
<and>
(r: play.api.libs.json.Reads[Long])(implicit w: play.api.libs.json.Writes[Long])play.api.libs.json.OFormat[Long]
<and>
(implicit f: play.api.libs.json.Format[Long])play.api.libs.json.OFormat[Long]
cannot be applied to (<error>, <error>)
Apparently importing like so (see the documentation "ctrl+F import") is getting me into trouble:
import play.api.libs.json._ // so I change this to import only Format and fine
import play.api.libs.functional.syntax._
import play.api.libs.json.Json
import play.api.libs.json.Json._
Now that the overloading error went away, I reach more trubbles: not found: value __ I imported .../functional.syntax._ already just like it says in the documentation! This guy ran into the same issue but the import fixed it for him! So why?! I thought this might just be Eclipse's problem and tried to play run anyway ... nothing changed. Fine. The compiler is always right.
Imported play.api.lib.json.JsPath, changed __ to JsPath, and wallah:
Error 2
value apply is not a member of object java.sql.Timestamp
value unapply is not a member of object java.sql.Timestamp
I also try changing tacks and writing a Write for this instead of Format, without the fancy new combinator (__) feature by following the original blog post the official docs are based on/copy-pasted from:
// I change the imports above to use Writes instead of Format
implicit val timestampFormat = new Writes[Timestamp]( // ERROR 3
def writes(t: Timestamp): JsValue = { // ERROR 4 def is underlined
Json.obj(
/* Returns the number of milliseconds since
January 1, 1970, 00:00:00 GMT represented by this Timestamp object. */
"time" -> t.getTime()
)
}
)
ERROR 3: trait Writes is abstract, cannot be instantiated
ERROR 4: illegal start of simple expression
At this point I'm about at my wits' end here, so I'm just going back to the rest of my mental stack and report from my first piece of code
My utter gratefulness to anybody who can put me out of my coding misery

It's not necessarily apply or unapply functions you need. It's a) a function that constructs whatever the type you need given some parameters, and b) a function that turns an instance of that type into a tuple of values (usually matching the input parameters.)
The apply and unapply functions you get for free with a Scala case class just happen to do this, so it's convenient to use them. But you can always write your own.
Normally you could do this with anonymous functions like so:
import java.sql.Timestamp
import play.api.libs.functional.syntax._
import play.api.libs.json._
implicit val timestampFormat: Format[Timestamp] = (
(__ \ "time").format[Long]
)((long: Long) => new Timestamp(long), (ts: Timestamp) => (ts.getTime))
However! In this case you fall foul of a limitation with the API that prevents you from writing formats like this, with only one value. This limitation is explained here, as per this answer.
For you, a way that works would be this more complex-looking hack:
import java.sql.Timestamp
import play.api.libs.functional.syntax._
import play.api.libs.json._
implicit val rds: Reads[Timestamp] = (__ \ "time").read[Long].map{ long => new Timestamp(long) }
implicit val wrs: Writes[Timestamp] = (__ \ "time").write[Long].contramap{ (a: Timestamp) => a.getTime }
implicit val fmt: Format[Timestamp] = Format(rds, wrs)
// Test it...
val testTime = Json.obj("time" -> 123456789)
assert(testTime.as[Timestamp] == new Timestamp(123456789))

Java nested Map to Scala nested sequence

I'm new to Scala and our project mixes Java and Scala code together (using the Play Framework). I'm trying to write a Scala method that can take a nested Java Map such as:
LinkedHashMap<String, LinkedHashMap<String, String>> groupingA = new LinkedHashMap<String, LinkedHashMap<String,String>>();
And have that java object passed to a Scala function that can loop through it. I have the following scala object definition to try and support the above Java nested map:
Seq[(String, Seq[(String,String)])]
Both the Java file and the Scala file compile fine individually, but when my java object tries to create a new instance of my scala class and pass in the nested map, I get a compiler error with the following details:
[error] ..... overloaded method value apply with alternatives:
[error] (options: java.util.List[String])scala.collection.mutable.Buffer[(String, String)] <and>
[error] (options: scala.collection.immutable.List[String])List[(String, String)] <and>
[error] (options: java.util.Map[String,String])Seq[(String, String)] <and>
[error] (options: scala.collection.immutable.Map[String,String])Seq[(String, String)] <and>
[error] (options: (String, String)*)Seq[(String, String)]
[error] cannot be applied to (java.util.LinkedHashMap[java.lang.String,java.util.LinkedHashMap[java.lang.String,java.lang.String]])
Any ideas here on how I can pass in a nested Java LinkedHashMap such as above into a Scala file where I can generically iterate over a nested collection? I'm trying to write this generic enough that it would also work for a nested Scala collection in case we ever switch to writing our play framework controllers in Scala instead of Java.

Seq is a base trait defined in the Scala Collections hierarchy. While java and scala offer byte code compatibility, scala defines a number of its own types including its own collection library. The rub here is if you want to write idiomatic scala you need to convert your java data to scala data. The way I see it you have a few options.
You can use Richard's solution and convert the java types to scala types in your scala code. I think this is ugly because it assumes your input will always be coming from java land.
You can write beautiful, perfect scala handler and provide a companion object that offers the ugly java conversion behavior. This disentangles your scala implementation from the java details.
Or you could write an implicit def like the one below genericizing it to your heart's content.
.
import java.util.LinkedHashMap
import scala.collection.JavaConversions.mapAsScalaMap
object App{
implicit def wrapLhm[K,V,G](i:LinkedHashMap[K,LinkedHashMap[G,V]]):LHMWrapper[K,V,G] = new LHMWrapper[K,V,G](i)
def main(args: Array[String]){
println("Hello World!")
val lhm = new LinkedHashMap[String, LinkedHashMap[String,String]]()
val inner = new LinkedHashMap[String,String]()
inner.put("one", "one")
lhm.put("outer",inner);
val s = lhm.getSeq()
println(s.toString())
}
class LHMWrapper[K,V,G](value: LinkedHashMap[K,LinkedHashMap[G,V]]){
def getSeq():Seq[ (K, Seq[(G,V)])] = mapAsScalaMap(value).mapValues(mapAsScalaMap(_).toSeq).toSeq
}
}

Try this:
import scala.collections.JavaConversions.mapAsScalaMap
val lhm: LinkedHashMap[String, LinkedHashMap[String, String]] = getLHM()
val scalaMap = mapAsScalaMap(lhm).mapValues(mapAsScalaMap(_).toSeq).toSeq
I tested this, and got a result of type Seq[String, Seq[(String, String)]]
(The conversions will wrap the original Java object, rather than actually creating a Scala object with a copy of the values. So the conversions to Seq aren't necessary, you could leave it as a Map, the iteration order will be the same).
Let me guess, are you processing query parameters?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Troubleshooting Scala Spark Configurations/Environments - java

Related

Jython - Scala array/list not JSON serializable within python script

Kaitai Struct: pass some field to achieve fault tolerance

How To Convert Scala Case Class to Java HashMap

How do I write a JSON Format for an object in the Java library that doesn't have an apply method?

Java nested Map to Scala nested sequence

Categories

Resources