Task not Serializable - Spark Java - java

I'm getting the Task not serializable error in Spark. I've searched and tried to use a static function as suggested in some posts, but it still gives the same error.
Code is as below:
public class Rating implements Serializable {
private SparkSession spark;
private SparkConf sparkConf;
private JavaSparkContext jsc;
private static Function<String, Rating> mapFunc;
public Rating() {
mapFunc = new Function<String, Rating>() {
public Rating call(String str) {
return Rating.parseRating(str);
}
};
}
public void runProcedure() {
sparkConf = new SparkConf().setAppName("Filter Example").setMaster("local");
jsc = new JavaSparkContext(sparkConf);
SparkSession spark = SparkSession.builder().master("local").appName("Word Count")
.config("spark.some.config.option", "some-value").getOrCreate();
JavaRDD<Rating> ratingsRDD = spark.read().textFile("sample_movielens_ratings.txt")
.javaRDD()
.map(mapFunc);
}
public static void main(String[] args) {
Rating newRating = new Rating();
newRating.runProcedure();
}
}
The error gives:
How do I solve this error?
Thanks in advance.

Clearly Rating cannot be Serializable, because it contains references to Spark structures (i.e. SparkSession, SparkConf, etc.) as attributes.
The problem here is in
JavaRDD<Rating> ratingsRD = spark.read().textFile("sample_movielens_ratings.txt")
.javaRDD()
.map(mapFunc);
If you look at the definition of mapFunc, you're returning a Rating object.
mapFunc = new Function<String, Rating>() {
public Rating call(String str) {
return Rating.parseRating(str);
}
};
This function is used inside a map (a transformation in Spark terms). Because the transformations are executed directly into the worker nodes and not in the driver node, their code must be serializable. This forces Spark to try serialize the Rating class, but it is not possible.
Try to extract the features you need from Rating, and placing them in a different class that does not own any Spark structure. Finally, use this new class as return type of your mapFunc function.

In addition you have to be sure to not include non-serializable variables in your class like JavaSparkContext and SparkSession. if you need to include them you should define like this:
private transient JavaSparkContext sparkCtx;
private transient SparkSession spark;
Good luck.

Related

How to read AvroFile into Tuple Class with Java in Flink

I'm Trying to read an Avro file and perform some operations on it, everything works fine but the aggregation functions, when I use them it get the below exception :
aggregating on field positions is only possible on tuple data types
then I change my class to implement Tuple4 (as I have 4 fields) but then when I want to collect the results get AvroTypeException Unknown Type : T0
Here are my data and job classes :
public class Nation{
public Integer N_NATIONKEY;
public String N_NAME;
public Integer N_REGIONKEY;
public String N_COMMENT;
public Integer getN_NATIONKEY() {
return N_NATIONKEY;
}
public void setN_NATIONKEY(Integer n_NATIONKEY) {
N_NATIONKEY = n_NATIONKEY;
}
public String getN_NAME() {
return N_NAME;
}
public void setN_NAME(String n_NAME) {
N_NAME = n_NAME;
}
public Integer getN_REGIONKEY() {
return N_REGIONKEY;
}
public void setN_REGIONKEY(Integer n_REGIONKEY) {
N_REGIONKEY = n_REGIONKEY;
}
public String getN_COMMENT() {
return N_COMMENT;
}
public void setN_COMMENT(String n_COMMENT) {
N_COMMENT = n_COMMENT;
}
public Nation() {
}
public static void main(String[] args) throws Exception {
Configuration parameters = new Configuration();
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Path path2 = new Path("/Users/violet/Desktop/nation.avro");
AvroInputFormat<Nation> format = new AvroInputFormat<Nation>(path2,Nation.class);
format.configure(parameters);
DataSet<Nation> nation = env.createInput(format);
nation.aggregate(Aggregations.SUM,0);
JobExecutionResult res = env.execute();
}
and here's the tuple class and the same code for the job as above:
public class NationTuple extends Tuple4<Integer,String,Integer,String> {
Integer N_NATIONKEY(){ return this.f0;}
String N_NAME(){return this.f1;}
Integer N_REGIONKEY(){ return this.f2;}
String N_COMMENT(){ return this.f3;}
}
I tried with this class and got the TypeException (Used NationTuple everywhere instead of Nation)
I don't think having your class implementing Tuple4 is right way to go. Instead you should add to your topology a MapFunction that converts your NationTuple to Tuple4.
static Tuple4<Integer, String, Integer, String> toTuple(Nation nation) {
return Tuple4.of(nation.N_NATIONKEY, ...);
}
And then in your topology call:
inputData.map(p -> toTuple(p)).returns(new TypeHint<Tuple4<Integer, String, Integer, String>(){});
The only subtle part is that you need to provide a type hint so flink can figure out what kind of tuple your function returns.
Another solution is to use field names instead of tuple field indices when doing your aggregation. For example:
groupBy("N_NATIONKEY", "N_REGIONKEY")
This is all explained here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#specifying-keys

Java Generics : cant call a function with said generics even though type matches

I have this code where I have defined two classes using generics.
1. Section which can have a generic type of data.
2. Config which uses kind of builder patterns and stores list of such sections.
On running this code it gives compilation error and I am no where to understand why. I have mentioned the type.
Error : incompatible types: java.util.List> cannot be converted to java.util.List>
public class Main {
public static void main(String[] args) {
Section<String> section = new Section<>("wow");
List<Section<String>> sections = new ArrayList<>();
sections.add(section);
Config<String> config = new Config<>().setSections(sections);
}
public static class Section<T> {
private T data;
public Section(T data) {
this.data = data;
}
public T getData() {
return data;
}
}
public static class Config<T> {
private List<Section<T>> sections;
public Config() {
}
public Config<T> setSections(List<Section<T>> sections) {
this.sections = sections;
return this;
}
}
}
The problem is at line 7, you are creating new Config and call setSections on the same line.
So the solutions are two:
Explicit type:
Config<String> config = new Config<String>().setSections(sections);
Split operations:
Config<String> config = new Config<>();
conf.setSections(sections);
It's a compiler peculiarity, you'll have to write
Config<String> config = new Config<String>().setSections(sections);

Why does static keyword "fix" the issue of Task not serializable?

I have experienced "task not serializable" issue when running spark streaming. The reason can be found in this thread.
After I have tried couple methods and fixed the issue, I don't understand why it works.
public class StreamingNotWorking implements Serializable {
private SparkConf sparkConf;
private JavaStreamingContext jssc;
public StreamingNotWorking(parameter) {
sparkConf = new SparkConf();
this.jssc = createContext(parameter);
JavaDStream<String> messages = functionCreateStream(parameter);
messages.print();
}
public void run() {
this.jssc.start();
this.jssc.awaitTermination();
}
public class streamingNotWorkingDriver {
public static void main(String[] args) {
Streaming bieventsStreaming = new StreamingNotWorking(parameter);
bieventsStreaming.run();
}
Will give the same "Task not serializable" error.
However, if I modify the code to:
public class StreamingWorking implements Serializable {
private static SparkConf sparkConf;
private static JavaStreamingContext jssc;
public void createStream(parameter) {
sparkConf = new SparkConf();
this.jssc = createContext(parameter);
JavaDStream<String> messages = functionCreateStream(parameter);
messages.print();
run();
}
public void run() {
this.jssc.start();
this.jssc.awaitTermination();
}
public class streamingWorkingDriver {
public static void main(String[] args) {
Streaming bieventsStreaming = new StreamingWorking();
bieventsStreaming.createStream(parameter);
}
works perfectly fine.
I know one of the reasons is that sparkConf and jssc need to be static. But I don't understand why.
Could anyone explain the difference?
neither JavaStreamingContext nor SparkConf implements Serializable.
You can't serialize instances of classes without this interface.
Static members wont be serialized.
More information can be found here:
http://docs.oracle.com/javase/7/docs/api/java/io/Serializable.html

Print from Apache Storm Bolt

I'm working my way through the example code of some Storm topologies and bolts, but I'm running into something weird. My goal is to set up Kafka with Storm, so that Storm can process the messages available on the Kafka bus. I have the following bolt defined:
public class ReportBolt extends BaseRichBolt {
private static final long serialVersionUID = 6102304822420418016L;
private Map<String, Long> counts;
private OutputCollector collector;
#Override #SuppressWarnings("rawtypes")
public void prepare(Map stormConf, TopologyContext context, OutputCollector outCollector) {
collector = outCollector;
counts = new HashMap<String, Long>();
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// terminal bolt = does not emit anything
}
#Override
public void execute(Tuple tuple) {
System.out.println("HELLO " + tuple);
}
#Override
public void cleanup() {
System.out.println("HELLO FINAL");
}
}
In essence, it should just output each Kafka message; and when the cleanup function is called, a different message should appear.
I have looked at the worker logs, and I find the final message (i.e. "HELLO FINAL"), but the Kafka messages with "HELLO" are nowhere to be found. As far as I can tell this should be a simple printer bolt, but I can't see where I'm going wrong. The workers logs indicate I am connected to the Kafka bus (it fetches the offset etc.).
In short, why are my println's not showing up in the worker logs?
EDIT
public class AckedTopology {
private static final String SPOUT_ID = "monitoring_test_spout";
private static final String REPORT_BOLT_ID = "acking-report-bolt";
private static final String TOPOLOGY_NAME = "monitoring-topology";
public static void main(String[] args) throws Exception {
int numSpoutExecutors = 1;
KafkaSpout kspout = buildKafkaSpout();
ReportBolt reportBolt = new ReportBolt();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SPOUT_ID, kspout, numSpoutExecutors);
builder.setBolt(REPORT_BOLT_ID, reportBolt);
Config cfg = new Config();
StormSubmitter.submitTopology(TOPOLOGY_NAME, cfg, builder.createTopology());
}
private static KafkaSpout buildKafkaSpout() {
String zkHostPort = "URL";
String topic = "TOPIC";
String zkRoot = "/brokers";
String zkSpoutId = "monitoring_test_spout_id";
ZkHosts zkHosts = new ZkHosts(zkHostPort);
SpoutConfig spoutCfg = new SpoutConfig(zkHosts, topic, zkRoot, zkSpoutId);
KafkaSpout kafkaSpout = new KafkaSpout(spoutCfg);
return kafkaSpout;
}
}
Your bolt is not chained with the spout. You need to use storm's grouping in order to do that .. Use something like this
builder.setBolt(REPORT_BOLT_ID, reportBolt).shuffleGrouping(SPOUT_ID);
The setBolt typically returns a InputDeclarer object. In your case by specifying shuffleGrouping(SPOUT_ID) you are telling storm that you are interested in consuming all the tuples emitted by component having id REPORT_BOLT_ID.
Read more on stream groupings and choose the one based on your need.

Can I use google reflections in a static section of code to find sub classes?

I am trying to set up a system that allows me to subclass a class that gets exported to a text file without having to modify the initial class. To do this, I am trying to build a list of callbacks that can tell if they handle a particular entry, and then use that callback to get an instance of that class created from the file. The problem is I get an error
java.lang.NoClassDefFoundError: com/google/common/base/Predicate when I try to run anything involving this class. What am I doing wrong?
public abstract class Visibility {
private static final List<VisibilityCreationCallback> creationCallbacks;
static {
creationCallbacks = new ArrayList<VisibilityCreationCallback>();
Reflections reflections = new Reflections(new ConfigurationBuilder()
.setUrls(ClasspathHelper.forPackage("com.protocase.viewer.utils.visibility"))
.setScanners(new ResourcesScanner()));
// ... cut ...
public static Visibility importFromFile(Map<String, Object> importMap) {
for (VisibilityCreationCallback callback: creationCallbacks) {
if (callback.handles(importMap)) {
return callback.create(importMap);
}
}
return null;
}
public class CategoryVisibility extends Visibility{
public static VisibilityCreationCallback makeVisibilityCreationCallback () {
return new VisibilityCreationCallback() {
#Override
public boolean handles(Map<String, Object> importMap) {
return importMap.containsKey(classTag);
}
#Override
public CategoryVisibility create(Map<String, Object> importMap) {
return importPD(importMap);
}
};
}
/**
* Test of matches method, of class CategoryVisibility.
*/
#Test
public void testMatches1() {
Visibility other = new UnrestrictedVisibility();
CategoryVisibility instance = new CategoryVisibility("A Cat");
boolean expResult = true;
boolean result = instance.matches(other);
assertEquals(expResult, result);
}
You're just missing the guava library in your classpath, and Reflections requires it. That's the short answer.
The better solution is to use a proper build tool (maven,
gradle, ...), and have your transitive dependencies without the hassle.

Categories

Resources