I have a Scala class that wraps an avro record with getters and setters. Using Jython to allow users to write python scripts to process the Avro record and ultimately do a json.dumps on the new processed record.
The issue is, if the user wants to grab a value that is an Array from the record, the interpreter complains that the object is not JSON serializable.
import json
json.dumps(<AClass>.getArray('myArray'))
The AClass is made available to any given python script at run time. Scala AClass:
class AClass {
def getArray(fieldName: String): Array[Integer] = {
val value: GenericData.Array[T] = [....]
value
.asInstanceOf[GenericData.Array[T]]
.asScala
.toArray[T]
}
}
I've tried a few other return types, 1) List[Integer], 2) mutable.Buffer[Integer], just the plain Avro generic Array 3) GenericData.Array[T]. All give the same serialization error with the slightly varying objects:
Runtime exception occurred during Python processing. TypeError: List(1, 2, 3) is not JSON serializable.
... Buffer(1, 2, 3) is not JSON serializable
... [1, 2, 3] is not JSON serializable.
Now it seems that if we were to convert it to a list() from within the python script, it works fine. This gave some leads but need it to happen at the Scala level.
import json
json.dumps(list(<AClass>.getArray('myArray')))
Is there any way to achieve this? What Scala / Java list type would translate directly into the python list type and/or be JSON serializable within the Jython py interpreter?
The json module in jython only accepts standard jython data types (that's why converting to list() works). See: https://docs.python.org/3/library/json.html#py-to-json-table
Related
I am trying to remotely invoke an MBean via a commandline. Right now, I am able to list attributes and operations. For example, I can list all the attributes and operations for HotspotDiagnostic using this command:
java -jar cmdline-jmxclient-0.10.3.jar admin:P#sSw0rd 10.11.12.13:1111 com.sun.management:type=HotSpotDiagnostic
Which gives me this list of Attributes and Operations
Attributes:
DiagnosticOptions: DiagnosticOptions (type=[Ljavax.management.openmbean.CompositeData;)
ObjectName: ObjectName (type=javax.management.ObjectName)
Operations:
dumpHeap: dumpHeap
Parameters 2, return type=void
name=p0 type=java.lang.String p0
name=p1 type=boolean p1
getVMOption: getVMOption
Parameters 1, return type=javax.management.openmbean.CompositeData
name=p0 type=java.lang.String p0
setVMOption: setVMOption
Parameters 2, return type=void
name=p0 type=java.lang.String p0
name=p1 type=java.lang.String p1
But now lets say I want to invoke the dumpHeap operation which takes two parameters p0 and p1 of type string and boolean, respectively. How would I pass those arguments in?
I've tried these:
java -jar cmdline-jmxclient-0.10.3.jar admin:P#sSw0rd10.11.12.13:1111 com.sun.management:type=HotSpotDiagnostic dumpHeap p0=aaa p1=true
java -jar cmdline-jmxclient-0.10.3.jar admin:P#sSw0rd10.11.12.13:1111 com.sun.management:type=HotSpotDiagnostic dumpHeap aaa true
But I'm not sure what the syntax is, or even what I'm supposed to pass for the string parameter. This isn't for anything specific btw. Merely want to learn and understand more about how to leverage these operations from the command line. Any docs and assistance much appreciated.
EDIT: I'm naive. Oracle docs indicate the string param is an output file per this link. But still uncertain about how to pass the parameters into my command.
According to the cmdline-jmxclient documentation:
http://crawler.archive.org/cmdline-jmxclient/ you have to use comma-delimited parameters to pass to your operation.
So in your case if would be:
java -jar cmdline-jmxclient-0.10.3.jar admin:P#sSw0rd10.11.12.13:1111 com.sun.management:type=HotSpotDiagnostic dumpHeap test,true
Take note that there is an present bug in the cmdline jar file that doesn't take into account Java primitives(int, booelean, byte, etc.) and will throw a ClassNotFoundException because it can't find by the primitive name.
If you find yourself coming across this issue you can either apply the patch to the jar code that is documented here: https://webarchive.jira.com/browse/HER-1630. Or simply change the type field in the jmx endpoint code from it's primitive type to it's Wrapper object type (int -> Integer)
I am trying python and java integration using Jep. I have loaded randomforest model from pickle file (rf.pkl) as sklearn.ensemble.forest.RandomForestClassifier object from java program using Jep.
I want this loading to be one time so that I wanted to execute a python function defined in python script prediction.py (to predict using rf model) by sending "rfmodel" argument from java to call python function.
But the argument sent to python from java is read as string in python. How can I retain the datatype of argument in python as sklearn.ensemble.forest.RandomForestClassifier?
Jep jep = new Jep();
jep.eval("import pickle");
jep.eval("clf = pickle.load(open('C:/Downloads/DSRFmodel.pkl', 'rb'))");
jep.eval("print(type(clf))");
Object randomForest = jep.getValue("clf");
jep.eval("import integration");
jep.set("arg1", requestId);
jep.set("arg2", randomForest);
jep.eval("result = integration.trainmodel(arg1, arg2)");
------------
python.py
import pickle
def trainmodel(requestid, rf):
//when rf is printed it is 'str' format.
When Jep converts a Python object into a Java object if it does not recognize the Python type it will return the String representation of the Python object, see this bug for discussion on that behavior. If you are running the latest version of Jep(3.8) you can override this behavior by passing a Java class to the getValue function. The PyObject class was created to serve as a generic wrapper around arbitrary python objects. The following code should do what you want:
Jep jep = new Jep();
jep.eval("import pickle");
jep.eval("clf = pickle.load(open('C:/Downloads/DSRFmodel.pkl', 'rb'))");
jep.eval("print(type(clf))");
Object randomForest = jep.getValue("clf", PyObject.class);
jep.eval("import integration");
jep.set("arg1", requestId);
jep.set("arg2", randomForest);
jep.eval("result = integration.trainmodel(arg1, arg2)");
I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,
For my application, my ,proto file has properties that are repeated string. For example
message OntologyHumanName
{
repeated string family = 1;
}
From this, the 2.5.0 protoc compiler generates Java code like
private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;
If I run a Scala Spark job that uses the Kryo serializer I get the following error
Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more
The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.
My environment is CDH QuickStart 5.5 with JDK 1.8.0_60
Try to register the Lazy class with:
Kryo kryo = new Kryo()
kryo.register(com.google.protobuf.LazyStringArrayList.class)
Also for custom Protobuf messages take a look at the solution in this answer for registering custom/nestes classes generated by protoc.
I think your RDD's type contains class OntologyHumanName. like: RDD[(String, OntologyHumanName)], and this type RDD in shuffle stage by coincidence. View this: https://github.com/EsotericSoftware/kryo#kryoserializable kryo can't do serialization on abstract class.
Read the spark doc: http://spark.apache.org/docs/latest/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
on kryo doc:
public class SomeClass implements KryoSerializable {
// ...
public void write (Kryo kryo, Output output) {
// ...
}
public void read (Kryo kryo, Input input) {
// ...
}
}
but the class: OntologyHumanName is generated by protobuf automatically. So I don't think this's a good way to do.
Try to use case class replace OntologyHumanName to avoid doing serialization on class OntologyHumanName directly. This way I didn't try, it's doesn't work possiblly.
case class OntologyHumanNameScalaCaseClass(val humanNames: OntologyHumanName)
An ugly way. I just converted protobuf class to scala things. This way can't be failed. likeļ¼
import scala.collection.JavaConverters._
val humanNameObj: OntologyHumanName = ...
val families: List[String] = humamNameObj.getFamilyList.asScala //use this to replace the humanNameObj.
hope resolve your problem above.
Running windows 8.1, Java 1.8, Scala 2.10.5, Spark 1.4.1, Scala IDE (Eclipse 4.4), Ipython 3.0.0 and Jupyter Scala.
I'm relatively new to Scala and Spark and I'm seeing an issue where certain RDD commands like collect and first return the "Task not serializable" error. What's unusual to me is I see that error in Ipython notebooks with the Scala kernel or the Scala IDE. However when I run the code directly in the spark-shell I do not receive this error.
I would like to setup these two environments for more advanced code evaluation beyond the shell. I have little expertise in troubleshooting this type of issue and determining what to look for; if you can provide guidance on how to get started with resolving this kind of issue that would be greatly appreciated.
Code:
val logFile = "s3n://[key:[key secret]#mortar-example-data/airline-data"
val sample = sc.parallelize(sc.textFile(logFile).take(100).map(line => line.replace("'","").replace("\"","")).map(line => line.substring(0,line.length()-1)))
val header = sample.first
val data = sample.filter(_!= header)
data.take(1)
data.count
data.collect
Stack Trace
org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
org.apache.spark.rdd.RDD.filter(RDD.scala:310)
cmd49$$user$$anonfun$4.apply(Main.scala:188)
cmd49$$user$$anonfun$4.apply(Main.scala:187)
java.io.NotSerializableException: org.apache.spark.SparkConf
Serialization stack:
- object not serializable (class: org.apache.spark.SparkConf, value: org.apache.spark.SparkConf#5976e363)
- field (class: cmd12$$user, name: conf, type: class org.apache.spark.SparkConf)
- object (class cmd12$$user, cmd12$$user#39a7edac)
- field (class: cmd49, name: $ref$cmd12, type: class cmd12$$user)
- object (class cmd49, cmd49#3c2a0c4f)
- field (class: cmd49$$user, name: $outer, type: class cmd49)
- object (class cmd49$$user, cmd49$$user#774ea026)
- field (class: cmd49$$user$$anonfun$4, name: $outer, type: class cmd49$$user)
- object (class cmd49$$user$$anonfun$4, <function0>)
- field (class: cmd49$$user$$anonfun$4$$anonfun$apply$3, name: $outer, type: class cmd49$$user$$anonfun$4)
- object (class cmd49$$user$$anonfun$4$$anonfun$apply$3, <function1>)
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
org.apache.spark.rdd.RDD.filter(RDD.scala:310)
cmd49$$user$$anonfun$4.apply(Main.scala:188)
cmd49$$user$$anonfun$4.apply(Main.scala:187)
#Ashalynd was right about the fact that sc.textFile already creates and RDD. You don't need sc.parallelize in that case. documentation here
So considering your example, this is what you'll need to do :
// Read your data from S3
val logFile = "s3n://[key:[key secret]#mortar-example-data/airline-data"
val rawRDD = sc.textFile(logFile)
// Fetch the header
val header = rawRDD.first
// Filter on the header than map to clean the line
val sample = rawRDD.filter(!_.contains(header)).map {
line => line.replaceAll("['\"]","").substring(0,line.length()-1)
}.takeSample(false,100,12L) // takeSample returns a fixed-size sampled subset of this RDD in an array
It's better to use the takeSample function :
def takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T]
withReplacement : whether sampling is done with replacement
num : size of the returned sample
seed : seed for the random number generator
Note 1 : the sample is an Array[String], so if you wish to transform it to an RDD, you can use the parallelize function as followed :
val sampleRDD = sc.parallelize(sample.toSeq)
Note 2 : If you wish to take a sample RDD directly from your rawRDD.filter(...).map(...) , you can use the sample function that returns an RDD[T]. Nevertheless, you'll need to specify an fraction of the data you need instead of a specific number.
sc.textFile already creates distributed dataset (check the documentation ). You don't need sc.parallelize in that case, but - as eliasah properly noted - you need to turn the result into an RDD again, if you want to have an RDD.
val selection = sc.textFile(logFile). // RDD
take(100). // collection
map(_.replaceAll("['\"]",""). // use regex to match both chars
map(_.init) // a method that returns all elements except the last
// turn the resulting collection into RDD again
val sample = sc.parallelize(selection)
I'm new to Scala and our project mixes Java and Scala code together (using the Play Framework). I'm trying to write a Scala method that can take a nested Java Map such as:
LinkedHashMap<String, LinkedHashMap<String, String>> groupingA = new LinkedHashMap<String, LinkedHashMap<String,String>>();
And have that java object passed to a Scala function that can loop through it. I have the following scala object definition to try and support the above Java nested map:
Seq[(String, Seq[(String,String)])]
Both the Java file and the Scala file compile fine individually, but when my java object tries to create a new instance of my scala class and pass in the nested map, I get a compiler error with the following details:
[error] ..... overloaded method value apply with alternatives:
[error] (options: java.util.List[String])scala.collection.mutable.Buffer[(String, String)] <and>
[error] (options: scala.collection.immutable.List[String])List[(String, String)] <and>
[error] (options: java.util.Map[String,String])Seq[(String, String)] <and>
[error] (options: scala.collection.immutable.Map[String,String])Seq[(String, String)] <and>
[error] (options: (String, String)*)Seq[(String, String)]
[error] cannot be applied to (java.util.LinkedHashMap[java.lang.String,java.util.LinkedHashMap[java.lang.String,java.lang.String]])
Any ideas here on how I can pass in a nested Java LinkedHashMap such as above into a Scala file where I can generically iterate over a nested collection? I'm trying to write this generic enough that it would also work for a nested Scala collection in case we ever switch to writing our play framework controllers in Scala instead of Java.
Seq is a base trait defined in the Scala Collections hierarchy. While java and scala offer byte code compatibility, scala defines a number of its own types including its own collection library. The rub here is if you want to write idiomatic scala you need to convert your java data to scala data. The way I see it you have a few options.
You can use Richard's solution and convert the java types to scala types in your scala code. I think this is ugly because it assumes your input will always be coming from java land.
You can write beautiful, perfect scala handler and provide a companion object that offers the ugly java conversion behavior. This disentangles your scala implementation from the java details.
Or you could write an implicit def like the one below genericizing it to your heart's content.
.
import java.util.LinkedHashMap
import scala.collection.JavaConversions.mapAsScalaMap
object App{
implicit def wrapLhm[K,V,G](i:LinkedHashMap[K,LinkedHashMap[G,V]]):LHMWrapper[K,V,G] = new LHMWrapper[K,V,G](i)
def main(args: Array[String]){
println("Hello World!")
val lhm = new LinkedHashMap[String, LinkedHashMap[String,String]]()
val inner = new LinkedHashMap[String,String]()
inner.put("one", "one")
lhm.put("outer",inner);
val s = lhm.getSeq()
println(s.toString())
}
class LHMWrapper[K,V,G](value: LinkedHashMap[K,LinkedHashMap[G,V]]){
def getSeq():Seq[ (K, Seq[(G,V)])] = mapAsScalaMap(value).mapValues(mapAsScalaMap(_).toSeq).toSeq
}
}
Try this:
import scala.collections.JavaConversions.mapAsScalaMap
val lhm: LinkedHashMap[String, LinkedHashMap[String, String]] = getLHM()
val scalaMap = mapAsScalaMap(lhm).mapValues(mapAsScalaMap(_).toSeq).toSeq
I tested this, and got a result of type Seq[String, Seq[(String, String)]]
(The conversions will wrap the original Java object, rather than actually creating a Scala object with a copy of the values. So the conversions to Seq aren't necessary, you could leave it as a Map, the iteration order will be the same).
Let me guess, are you processing query parameters?