How to provide a codec to the SaveAsSequenceFile method in Spark? - java

I am trying to figure out how to pass a codec to the saveAsSequenceFile method in Apache Spark. Below is the code I am trying to run. I am running Scala 2.10.4, Spark 1.0.0, Java 1.7.60, and Apache Hadoop 2.4.0.
val rdd:RDD[(String, String)] = sc.sequenceFile(secPath,
classOf[Text],
classOf[Text]
).map { case (k,v) => (k.toString, v.toString)}
val sortedOutput = rdd.sortByKey(true, 1)
sortedOutput.saveAsSequenceFile(secPathOut)
My issue is that I am new to Spark and Scala. I do not understand what the javadoc means for the codec variable passed to the saveAsSequenceFile method.
def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None): Unit
What does the <: mean? I get that the codec is optional, because when I run the above code it works. Could someone please show an example of a properly formatted codec call to this method?
Thanks!

The <: indicates that the class you pass in should extend org.apache.hadoop.io.compress.CompressionCodec (read this), spark uses a lot of HDFS features and is pretty heavily integrated with it at this point. This means you can pass the class of any of the following as the codec, BZip2Codec, DefaultCodec, GzipCodec, there are likely also other extensions of CompressionCodec not built into hadoop. Here is an example of calling the method
sc.parallelize(List((1,2))).saveAsSequenceFile("path",Some(classOf[GzipCodec]))
The Option[...] is used in scala in favor of java's null even though null exists in scala. Option can be Some(...) or None

Related

Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist in PySpark

This is the snippet:
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
d = spark.read.format("csv").option("header", True).option("inferSchema", True).load('file.csv')
d.show()
After this runs into the error:
An error occurred while calling o163.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
All the other methods work well. Tried researching alot but in vain. Any lead will be highly appreciated
This is an indicator of a Spark version mismatch. Before Spark 2.3 show method took only two arguments:
def show(self, n=20, truncate=True):
since 2.3 it takes three arguments:
def show(self, n=20, truncate=True, vertical=False):
In your case Python client seems to invoke the latter one, while the JVM backend uses the older version.
Since SparkContext initialization undergone significant changes in 2.4, which would cause failure on SparkContext.__init__, you're likely using:
2.3.x Python library.
2.2.x JARs.
You can confirm that by checking versions directly from your session, Python:
sc.version
vs. JVM:
sc._jsc.version()
Problems like this, are usually a result of misconfigured PYTHONPATH (either directly, or by using pip installed PySpark on top per-existing Spark binaries) or SPARK_HOME.
On spark-shell console, enter the variable name and see the data type.
As an alternative, you can tab twice after variable named. and it will show necessary function which could be applied.
Example of a DataFrame object.
res23: org.apache.spark.sql.DataFrame = [order_id: string, book_name: string ... 1 more field]

customer Transformer converting to PMML by SkLearn2PMML-Plugin in java side

I have known the SkLearn2PMML-Plugin project in github(https://github.com/jpmml/sklearn2pmml-plugin/blob/master/README.md). But I have little experience in Java . Can someone help me to write the java plugin of my feature transformer. Below is my feature transformer.
class FeatureSelector(TransformerMixin):
'''A transformer for extracting certain column(s)'''
def __init__(self, cols):
self.cols = cols
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transform_params):
return X[self.cols]
class ModelTransformer(TransformerMixin):
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return pd.DataFrame(self.model.predict(X))
You can achieve FeatureSelector functionality using the sklearn2pmml.preprocessing.ExpressionTransformer transformation:
selector = ExpressionTransformer("X[0]")
The ModelTransformer functionality is a bit more tricky, but certainly doable. Next time, please consider opening a feature request with the SkLearn2PMML project directly (instead of asking SO to write code for you): https://github.com/jpmml/sklearn2pmml/issues/118

Cannot resolve method 'flatMap(<lambdaexpression>)' error

I am new to apache spark and trying to run the wordcount example . But intellij editor gives me the error at line 47 Cannot resolve method 'flatMap()' error.
Edit :
This is the line where I am getting the error
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
It looks like you're using an older version of Spark that expects Iterable rather than Iterator from the flatMap() function. Try this:
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)));
See also Spark 2.0.0 Arrays.asList not working - incompatible types
Stream#flatMap is used for combining multiple streams into one, so the supplier method you provided must return a Stream result.
you can try like this:
lines.stream().flatMap(line -> Stream.of(SPACE.split(line)))
.map(word -> // map to JavaRDD)
flatMap method take a FlatMapFunctionas parameter which is not annotated with #FunctionalInterface. So indeed you can not use it as a lambda.
Just build a real FlatMapFunctionobject as parameter and you will be sure of it.
flatMap() is Java 8 Stream API. I think you should check the IDEA compile java version.
compile java version

Get Java reflection representation of Scala type

This seems like a simple question, but it's very challenging to search for, so I'm asking a new question. My apologies if it's already been asked.
Due to the compiler bug described here Scala 2.11.5 compiler crash with type aliases and manifests (also here https://issues.scala-lang.org/browse/SI-9155), I need to use scala TypeTags and friends for discovery of type parameters to methods. However, I then need to use that type information in a Java library that uses java.lang.Class and java.lang.reflect.Type.
How can I convert a scala.reflect.runtime.universe Type into a java.lang.reflect.Type or java.lang.Class?
Put concretely, how would I fill out the body of this method:
def typeFor[T](implicit tag: TypeTag[T]): java.lang.reflect.Type = ...
or, if that's not possible:
def typeFor[T](implicit tag: TypeTag[T]): java.lang.Class[T] = ...
And note, due to the bug posted above, I cannot use scala.reflect.Manifest.
The short answer is no, but you can try to do something similar to this SO question. However there is an open ticket....
This may have some limitations I'm not aware of, but you could drop down to Java reflection and try something like:
import scala.util.control.Exception._
def typeMe[T](implicit t: TypeTag[T]) = {
catching(classOf[Exception]) opt Class.forName(t.tpe.typeSymbol.asClass.fullName)
}
println(typeMe[String])
println(typeMe[ClassTag[_]])
Results in:
Some(class java.lang.String)
Some(interface scala.reflect.ClassTag)
The way I solved it with manifests, was:
private def typeFromManifest(m: Manifest[_]): Type = {
if (m.typeArguments.isEmpty) { m.runtimeClass }
else new ParameterizedType {
def getRawType = m.runtimeClass
def getActualTypeArguments = m.typeArguments.map(typeFromManifest).toArray
def getOwnerType = null
}
}
Right now I'm trying to solve this using something other than Manifest which should be removed from scala runtime.

Java nested Map to Scala nested sequence

I'm new to Scala and our project mixes Java and Scala code together (using the Play Framework). I'm trying to write a Scala method that can take a nested Java Map such as:
LinkedHashMap<String, LinkedHashMap<String, String>> groupingA = new LinkedHashMap<String, LinkedHashMap<String,String>>();
And have that java object passed to a Scala function that can loop through it. I have the following scala object definition to try and support the above Java nested map:
Seq[(String, Seq[(String,String)])]
Both the Java file and the Scala file compile fine individually, but when my java object tries to create a new instance of my scala class and pass in the nested map, I get a compiler error with the following details:
[error] ..... overloaded method value apply with alternatives:
[error] (options: java.util.List[String])scala.collection.mutable.Buffer[(String, String)] <and>
[error] (options: scala.collection.immutable.List[String])List[(String, String)] <and>
[error] (options: java.util.Map[String,String])Seq[(String, String)] <and>
[error] (options: scala.collection.immutable.Map[String,String])Seq[(String, String)] <and>
[error] (options: (String, String)*)Seq[(String, String)]
[error] cannot be applied to (java.util.LinkedHashMap[java.lang.String,java.util.LinkedHashMap[java.lang.String,java.lang.String]])
Any ideas here on how I can pass in a nested Java LinkedHashMap such as above into a Scala file where I can generically iterate over a nested collection? I'm trying to write this generic enough that it would also work for a nested Scala collection in case we ever switch to writing our play framework controllers in Scala instead of Java.
Seq is a base trait defined in the Scala Collections hierarchy. While java and scala offer byte code compatibility, scala defines a number of its own types including its own collection library. The rub here is if you want to write idiomatic scala you need to convert your java data to scala data. The way I see it you have a few options.
You can use Richard's solution and convert the java types to scala types in your scala code. I think this is ugly because it assumes your input will always be coming from java land.
You can write beautiful, perfect scala handler and provide a companion object that offers the ugly java conversion behavior. This disentangles your scala implementation from the java details.
Or you could write an implicit def like the one below genericizing it to your heart's content.
.
import java.util.LinkedHashMap
import scala.collection.JavaConversions.mapAsScalaMap
object App{
implicit def wrapLhm[K,V,G](i:LinkedHashMap[K,LinkedHashMap[G,V]]):LHMWrapper[K,V,G] = new LHMWrapper[K,V,G](i)
def main(args: Array[String]){
println("Hello World!")
val lhm = new LinkedHashMap[String, LinkedHashMap[String,String]]()
val inner = new LinkedHashMap[String,String]()
inner.put("one", "one")
lhm.put("outer",inner);
val s = lhm.getSeq()
println(s.toString())
}
class LHMWrapper[K,V,G](value: LinkedHashMap[K,LinkedHashMap[G,V]]){
def getSeq():Seq[ (K, Seq[(G,V)])] = mapAsScalaMap(value).mapValues(mapAsScalaMap(_).toSeq).toSeq
}
}
Try this:
import scala.collections.JavaConversions.mapAsScalaMap
val lhm: LinkedHashMap[String, LinkedHashMap[String, String]] = getLHM()
val scalaMap = mapAsScalaMap(lhm).mapValues(mapAsScalaMap(_).toSeq).toSeq
I tested this, and got a result of type Seq[String, Seq[(String, String)]]
(The conversions will wrap the original Java object, rather than actually creating a Scala object with a copy of the values. So the conversions to Seq aren't necessary, you could leave it as a Map, the iteration order will be the same).
Let me guess, are you processing query parameters?

Categories

Resources