JavaInputDStream is not working

JavaInputDStream is not working - java

I am getting following Error:
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
at org.apache.spark.streaming.kafka010.KafkaUtils$.initializeLogIfNecessary(KafkaUtils.scala:40)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.streaming.kafka010.KafkaUtils$.log(KafkaUtils.scala:40)
at org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66)
at org.apache.spark.streaming.kafka010.KafkaUtils$.logWarning(KafkaUtils.scala:40)
at org.apache.spark.streaming.kafka010.KafkaUtils$.fixKafkaParams(KafkaUtils.scala:157)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.<init>(DirectKafkaInputDStream.scala:65)
at org.apache.spark.streaming.kafka010.KafkaUtils$.createDirectStream(KafkaUtils.scala:126)
at org.apache.spark.streaming.kafka010.KafkaUtils$.createDirectStream(KafkaUtils.scala:149)
at org.apache.spark.streaming.kafka010.KafkaUtils.createDirectStream(KafkaUtils.scala)
at com.spark.kafka.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:50)
18/05/29 18:05:43 INFO SparkContext: Invoking stop() from shutdown hook
18/05/29 18:05:43 INFO SparkUI: Stopped Spark web UI at
Code snippet:
I am writing a simple kafka - spark streaming code in eclipse to consume the messages from kafka broker using spark streaming. Below is the code
I am trying to execute respective code it working fine utill JavaInputDStream, after that i am getting error.
evey import is fine. can some one help in this
package com.spark.kafka;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import org.apache.spark.streaming.Durations;
public final class JavaDirectKafkaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
String brokers = "localhost:9092";
String topics = "sparktestone";
SparkConf sparkConf = new SparkConf().setAppName("JavaDirectKafkaWordCount").setMaster("local[*]");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", brokers);
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
JavaDStream<String> lines = messages.map(ConsumerRecord::value);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
jssc.start();
jssc.awaitTermination();
}
}
In Pom i have added respective dependencies.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
<version>2.0.0</version>
</dependency>

I just have the same issue, in Spark 2.3 that method is abstract.
Conclusion, use Spark 2.2.
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>

Related

hadoop distcp via java resulting in NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

I am trying to run a distcp command on my hadoop cluster using the Hadoop Java Library to move content from the HDFS to a Google Cloud Bucket. I am getting the error NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
Below is my java code:
import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class HadoopHelper {
private static Logger logger = LoggerFactory.getLogger(HadoopHelper.class);
private static final String FS_DEFAULT_FS = "fs.defaultFS";
private final Configuration conf;
public HadoopHelper(String hadoopUrl) {
conf = new Configuration();
conf.set(FS_DEFAULT_FS, "hdfs://" + hadoopUrl);
}
public void distCP(JsonArray files, String target) {
try {
List<Path> srcPaths = new ArrayList<>();
for (JsonElement file : files) {
String srcPath = file.getAsString();
srcPaths.add(new Path(srcPath));
}
DistCpOptions options = new DistCpOptions.Builder(
srcPaths,
new Path("gs://" + target)
).build();
logger.info("Using distcp to copy {} to gs://{}", files, target);
this.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem");
this.conf.set("fs.gs.auth.service.account.email", "my-svc-account#my-gcp-project.iam.gserviceaccount.com");
this.conf.set("fs.gs.auth.service.account.keyfile", "config/my-svc-account-keyfile.p12");
this.conf.set("fs.gs.project.id", "my-gcp-project");
DistCp distCp = new DistCp(this.conf, options);
Job job = distCp.execute();
job.waitForCompletion(true);
logger.info("Distcp operation success. Exiting");
} catch (Exception e) {
logger.error("Error while trying to execute distcp", e);
logger.error("Distcp operation failed. Exiting");
throw new IllegalArgumentException("Distcp failed");
}
}
public void createDirectory() throws IOException {
FileSystem fileSystem = FileSystem.get(this.conf);
fileSystem.mkdirs(new Path("/user/newfolder"));
logger.info("Done");
}
}
I have added the below dependencies in the pom.xml:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-distcp</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.4</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util</artifactId>
<version>2.2.4</version>
</dependency>
If I run the distcp command on the cluster itself like so: hadoop distcp /user gs://my_bucket_name/
The distcp operation works and the content is copied onto the Cloud Bucket.

Did you add the jar to the hadoop's classpath?
Add the connector jar to Hadoop's classpath
Placing the connector jar in the HADOOP_COMMON_LIB_JARS_DIR directory should be sufficient to have Hadoop load the jar. Alternatively, to be certain that the jar is loaded, you can add HADOOP_CLASSPATH=$HADOOP_CLASSPATH:</path/to/gcs-connector.jar> to hadoop-env.sh in the Hadoop configuration directory.
This needs to be done to DisctCp conf(in your code this.conf) before this line of code:
this.conf.set("HADOOP_CLASSPATH","$HADOOP_CLASSPATH:/tmp/gcs-connector-latest-hadoop2.jar")
DistCp distCp = new DistCp(this.conf, options);
If it helps there is a troubleshooting section for this.

How do #Poller-s work in Spring Integration?

I am building an implementation of Sprint Integration with two PollableChannels:
Regular channel
Error channel
Messages are polled from the regular channel and processed. If there is an error during processing (e.g., an external service is unavailable), the message is sent into the error channel. From the error channel it is re-queued onto the regular channel, and the cycle continues until the message is successfully processed.
The idea is to poll the error channel infrequently, to give the processor some time to (hopefully) recover.
I have simulated this workflow in the following test:
package com.stackoverflow.questions.sipoller;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import org.junit.jupiter.api.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import org.springframework.integration.annotation.MessageEndpoint;
import org.springframework.integration.annotation.Poller;
import org.springframework.integration.annotation.Router;
import org.springframework.integration.annotation.ServiceActivator;
import org.springframework.integration.config.EnableIntegration;
import org.springframework.integration.dsl.MessageChannels;
import org.springframework.messaging.Message;
import org.springframework.messaging.PollableChannel;
import org.springframework.messaging.support.MessageBuilder;
import static org.awaitility.Awaitility.await;
import static org.awaitility.Durations.FIVE_MINUTES;
import static org.awaitility.Durations.ONE_HUNDRED_MILLISECONDS;
#SpringBootTest
class SiPollerApplicationTests {
private final static Logger LOG = LoggerFactory.getLogger(SiPollerApplicationTests.class);
private final static String QUEUE_CHANNEL_REGULAR = "queueChannelRegular";
private final static String QUEUE_CHANNEL_ERROR = "queueChannelError";
private final static String POLLER_PERIOD_REGULAR = "500"; // 0.5 second
private final static String POLLER_PERIOD_ERROR = "3000"; // 3 seconds
private final static AtomicInteger NUMBER_OF_ATTEMPTS = new AtomicInteger();
private final static AtomicInteger NUMBER_OF_SUCCESSES = new AtomicInteger();
private final static List<Instant> ATTEMPT_INSTANTS = Collections.synchronizedList(new ArrayList<>());
#Autowired
#Qualifier(QUEUE_CHANNEL_REGULAR)
private PollableChannel channelRegular;
#Test
void testTimingOfMessageProcessing() {
channelRegular.send(MessageBuilder.withPayload("Test message").build());
await()
.atMost(FIVE_MINUTES)
.with()
.pollInterval(ONE_HUNDRED_MILLISECONDS)
.until(
() -> {
if (NUMBER_OF_SUCCESSES.intValue() == 1) {
reportGaps();
return true;
}
return false;
}
);
}
private void reportGaps() {
List<Long> gaps = IntStream
.range(1, ATTEMPT_INSTANTS.size())
.mapToObj(
i -> Duration
.between(
ATTEMPT_INSTANTS.get(i - 1),
ATTEMPT_INSTANTS.get(i)
)
.toMillis()
)
.collect(Collectors.toList());
LOG.info("Gaps between attempts (in ms): {}", gaps);
}
#Configuration
#EnableIntegration
#Import(SiPollerApplicationTestEndpoint.class)
static class SiPollerApplicationTestConfig {
#Bean(name = QUEUE_CHANNEL_REGULAR)
public PollableChannel queueChannelRegular() {
return MessageChannels.queue(QUEUE_CHANNEL_REGULAR).get();
}
#Bean(name = QUEUE_CHANNEL_ERROR)
public PollableChannel queueChannelError() {
return MessageChannels.queue(QUEUE_CHANNEL_ERROR).get();
}
#Router(
inputChannel = QUEUE_CHANNEL_ERROR,
poller = #Poller(fixedRate = POLLER_PERIOD_ERROR)
)
public String retryProcessing() {
return QUEUE_CHANNEL_REGULAR;
}
}
#MessageEndpoint
static class SiPollerApplicationTestEndpoint {
#Autowired
#Qualifier(QUEUE_CHANNEL_ERROR)
private PollableChannel channelError;
#ServiceActivator(
inputChannel = QUEUE_CHANNEL_REGULAR,
poller = #Poller(fixedRate = POLLER_PERIOD_REGULAR)
)
public void handleMessage(Message<String> message) {
// Count and time attempts
int numberOfAttempts = NUMBER_OF_ATTEMPTS.getAndIncrement();
ATTEMPT_INSTANTS.add(Instant.now());
// First few times - refuse to process message and bounce it into
// error channel
if (numberOfAttempts < 5) {
channelError.send(message);
return;
}
// After that - process message
NUMBER_OF_SUCCESSES.getAndIncrement();
}
}
}
The pom.xml dependencies are:
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.integration</groupId>
<artifactId>spring-integration-core</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.awaitility</groupId>
<artifactId>awaitility</artifactId>
<version>4.0.2</version>
<scope>test</scope>
</dependency>
</dependencies>
Note the configuration for Pollers:
private final static String POLLER_PERIOD_REGULAR = "500"; // 0.5 second
private final static String POLLER_PERIOD_ERROR = "3000"; // 3 seconds
The regular channel is supposed to be polled once in half a second, and the error channel — once in three seconds.
The test simulates outages during message processing: the first five attempts to process the message are rejected. Also, the test records the Instant of every processing attempt. In the end, on my machine, the test outputs:
Gaps between attempts (in ms): [1, 0, 0, 0, 0]
In other words, the message is re-tried almost immediately after each failure.
It seems to me that I fundamentally misunderstand how Pollers work in Spring Integration. So my questions are:
Why is there such a dissonance between the poller configuration and the actual frequency of polling.
Does Spring Integration provide a way to implement the pattern I have described?

There are two settings that can affect this behavior.
QueueChannel pollers will drain the queue by default; setMaxMessagesPerPoll(1) to only receive one message each poll.
Also, by default, the QueueChannel default timeout is 1 second (1000ms).
So the first poll may be sooner than you think; set it to 0 to immediately exit if there are no messages present in the queue.

Kafka Spark Streaming Consumer will not receive any messages from Kafka Console Producer?

I'm trying to integrate spark and Kafka for consuming the messages from Kafka. I have producer code also to send messages on "temp" topic. Also, I'm using Kafka's Console Producer to producer the messages on "temp" topic.
I have created below code to consume the messages from the same "temp" topic but it will not receive single message also.
Program:
import java.util.Arrays;
import java.util.Map;
import java.util.HashMap;
import static org.apache.commons.lang3.StringUtils.SPACE;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
import org.apache.log4j.Logger;
import org.apache.spark.api.java.JavaSparkContext;
import scala.collection.immutable.ListSet;
import scala.collection.immutable.Set;
public class ConsumerDemo {
public void main() {
String zkGroup = "localhost:2181";
String group = "test";
String[] topics = {"temp"};
int numThreads = 1;
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount").setMaster("local[4]").set("spark.ui.port‌", "7077").set("spark.executor.memory", "1g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<>();
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
System.out.println("topics : " + Arrays.toString(topics));
JavaPairReceiverInputDStream<String, String> messages
= KafkaUtils.createStream(jssc, zkGroup, group, topicMap);
messages.print();
JavaDStream<String> lines = messages.map(Tuple2::_2);
//lines.print();
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
//wordCounts.print();
jssc.start();
jssc.awaitTermination();
}
public static void main(String[] args) {
System.out.println("Started...");
new ConsumerDemo().main();
System.out.println("Ended...");
}
}
I added following dependencies in the pom.xml file:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.9.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>0.9.0-incubating</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.3</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-core</artifactId>
<version>1.0.5</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.10</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>com.msiops.footing</groupId>
<artifactId>footing-tuple</artifactId>
<version>0.2</version>
</dependency>
Is I'm missing some dependency or issue is in code? Why this code will not receive any messages?

You are not calling the method where you have code to connect and consume messages from Kafka. Either write that logic in public static void main() or call the method where you have written this logic.

When using Kafka consumer, and especially when we are testing and debugging in development environment the producer may not be pushing messages to Kafka continuously.
In this scenario we need to take care of this Kafka consumer parameter auto.offset.reset which determines whether to read only new messages which are written to topic after the consumer starts running? or to read from the beginning of the topic
here is the official explanation given in Kafka documentation:
auto.offset.reset
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server
(e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
a sample code snippet on how to create KafkaDStream using kafkaParams as below:
Map<String,String> kafkaParams = new HashMap<>();
kafkaParams.put("zookeeper.connect", "localhost:2181");
kafkaParams.put("group.id", "test02"); //While you are testing the codein develeopment system, change this groupid each time you run the consumer
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("metadata.broker.list", "localhost:9092");
kafkaParams.put("bootstrap.servers", "localhost:9092");
Map<String, Integer> topics = new HashMap<String, Integer>();
topics.put("temp", 1);
StorageLevel storageLevel = StorageLevel.MEMORY_AND_DISK_SER();
JavaPairDStream<String, String> messages =
KafkaUtils.createStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topics,
storageLevel)
;
messages.print();

Cannot resolve symbol JavaSparkSessionSingleton

I am new to Spark streaming. What I am trying to achieve is read json string data from kafka, store it in a DStream and convert it to Dataset to be able to load it into Elasticsearch. I am using part of the code from this post.
This is the actual code:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.streaming.Duration;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.apache.spark.api.java.function.Function;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class SparkConsumer {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setAppName("readKafkajson").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));
// TODO: processing pipeline
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
Set<String> topics = Collections.singleton("kafkajson");
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(ssc, String.class, String.class, StringDecoder.class,
StringDecoder.class, kafkaParams, topics);
JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
public String call(Tuple2<String,String> message) throws Exception {
System.out.println(message._2());
return message._2();
};
});
System.out.println(" json is 0------ 0"+ json);
json.foreachRDD(rdd -> {
rdd.foreach(
record -> System.out.println(record));
});
//Create JavaRDD<Row>
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {DataTypes.createStructField("Message", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = **JavaSparkSessionSingleton**.getInstance(rdd.context().getConf());
Dataset<Row> msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();
}
});
ssc.start();
ssc.awaitTermination();
}
}
I am getting an error saying cannot resolve symbol JavaSparkSessionSingleton.
I am using Spark 2.0.1 and my maven dependencies looks like this:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
I am not sure what I am missing. Any help is appreciated.

The offical Spark doc leads you to create a Singleton class to hold your session, add that to the bottom of your class:
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession
.builder()
.config(sparkConf)
.getOrCreate();
}
return instance;
}
}
Sample from Spark doc, complete example here: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaSqlNetworkWordCount.java

Kafka Storm integration with java. kafka.api.OffsetRequest.DefaultClientId()Ljava/lang/String; Error

I am new to kafka and storm. I was trying to implement a java example which integrates Kafka and storm. I found an example online. I am trying to run the java program in eclipse IDE. I am not using maven.
I have storm-kafka-0.10.0.jar, kafka-0.6.jar, scala-library-2.10.3.jar and storm-core-0.10.0.jar as external jars.
Here is my java code.
KafkaStormSample.java
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.topology.TopologyBuilder;
import java.util.UUID;
import backtype.storm.spout.SchemeAsMultiScheme;
import storm.kafka.ZkHosts;
import storm.kafka.BrokerHosts;
import storm.kafka.SpoutConfig;
import storm.kafka.KafkaSpout;
import storm.kafka.StringScheme;
public class KafkaStormSample {
public static void main(String[] args) throws Exception{
Config config = new Config();
config.setDebug(true);
config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
String zkConnString = "localhost:2181";
String topic = "my-first-topic";
BrokerHosts hosts = new ZkHosts(zkConnString);
SpoutConfig kafkaSpoutConfig = new SpoutConfig (hosts, topic, "/" + topic,
UUID.randomUUID().toString());
kafkaSpoutConfig.bufferSizeBytes = 1024 * 1024 * 4;
kafkaSpoutConfig.fetchSizeBytes = 1024 * 1024 * 4;
//kafkaSpoutConfig.forceFromStart = true;
kafkaSpoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", new KafkaSpout(kafkaSpoutConfig));
//builder.setBolt("word-spitter", new SplitBolt()).shuffleGrouping("kafka-spout");
builder.setBolt("word-counter", new CountBolt()).shuffleGrouping("word-spitter");
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("KafkaStormSample", config, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
}
CountBolt.java
import java.util.Map;
import java.util.HashMap;
import backtype.storm.tuple.Tuple;
import backtype.storm.task.OutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.IRichBolt;
import backtype.storm.task.TopologyContext;
public class CountBolt implements IRichBolt{
Map<String, Integer> counters;
private OutputCollector collector;
#Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.counters = new HashMap<String, Integer>();
this.collector = collector;
}
#Override
public void execute(Tuple input) {
String str = input.getString(0);
if(!counters.containsKey(str)){
counters.put(str, 1);
}else {
Integer c = counters.get(str) +1;
counters.put(str, c);
}
collector.ack(input);
}
#Override
public void cleanup() {
for(Map.Entry<String, Integer> entry:counters.entrySet()){
System.out.println(entry.getKey()+" : " + entry.getValue());
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
#Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
When I try to run the kafkaStormSample.java I keep getting the below error.
Exception in thread "main" java.lang.NoSuchMethodError: kafka.api.OffsetRequest.DefaultClientId()Ljava/lang/String;
at storm.kafka.KafkaConfig.<init>(KafkaConfig.java:43)
at storm.kafka.SpoutConfig.<init>(SpoutConfig.java:40)
at KafkaStormSample.main(KafkaStormSample.java:23)
I made sure I have all the required jars. But still I think I am missing jar.
Any help would be appreciated.
Thanks !

I don't know much about those systems, but it looks to me like a library version mismatch.
One of the libraries (Storm in thids case) was compiled against a different version of kafka where that method is defined. Check your dependencies.
Is one of the reasons dependency management systems are helpful.
Update:
From their documentation they provide this set up on Maven:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.1.1</version>
<exclusions>
<exclusion>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
Seems your kafka version is too old.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JavaInputDStream is not working - java

I just have the same issue, in Spark 2.3 that method is abstract. Conclusion, use Spark 2.2. <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.2.0</version> <scope>provided</scope> </dependency>

Related

hadoop distcp via java resulting in NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

How do #Poller-s work in Spring Integration?

Kafka Spark Streaming Consumer will not receive any messages from Kafka Console Producer?

Cannot resolve symbol JavaSparkSessionSingleton

Kafka Storm integration with java. kafka.api.OffsetRequest.DefaultClientId()Ljava/lang/String; Error

Categories

Resources