How to spark-submit a Spark Streaming application

How to spark-submit a Spark Streaming application - java

I am new to Spark and does not have too much idea on it. I am working on an application in which data is traversing on different-2 Kafka topic and Spark Streaming reading the data from this topic. Its a SpringBoot project and i have 3 Spark consumer classes in it. The job of these SparkStreaming classes is to consume the data from a Kafka topic and send it to another topic. Code of SparkStreaming class is below-
#Service
public class EnrichEventSparkConsumer {
Collection<String> topics = Arrays.asList("eventTopic");
public void startEnrichEventConsumer(JavaStreamingContext javaStreamingContext) {
Map<String, Object> kafkaParams = new HashedMap();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "group1");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", true);
JavaInputDStream<ConsumerRecord<String, String>> enrichEventRDD = KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
JavaDStream<String> enrichEventDStream = enrichEventRDD.map((x) -> x.value());
JavaDStream<EnrichEventDataModel> enrichDataModelDStream = enrichEventDStream.map(convertIntoEnrichModel);
enrichDataModelDStream.foreachRDD(rdd1 -> {
saveDataToElasticSearch(rdd1.collect());
});
enrichDataModelDStream.foreachRDD(enrichDataModelRdd -> {
if(enrichDataModelRdd.count() > 0) {
if(executor != null) {
executor.executePolicy(enrichDataModelRdd.collect());
}
}
});
}
static Function convertIntoEnrichModel = new Function<String, EnrichEventDataModel>() {
#Override
public EnrichEventDataModel call(String record) throws Exception {
ObjectMapper mapper = new ObjectMapper();
EnrichEventDataModel csvDataModel = mapper.readValue(record, EnrichEventDataModel.class);
return csvDataModel;
}
};
private void saveDataToElasticSearch(List<EnrichEventDataModel> baseDataModelList) {
for (EnrichEventDataModel baseDataModel : baseDataModelList)
dataModelServiceImpl.save(baseDataModel);
}
}
I am calling the method startEnrichEventConsumer() using CommandLineRunner.
public class EnrichEventSparkConsumerRunner implements CommandLineRunner {
#Autowired
JavaStreamingContext javaStreamingContext;
#Autowired
EnrichEventSparkConsumer enrichEventSparkConsumer;
#Override
public void run(String... args) throws Exception {
//start Raw Event Spark Cosnumer.
JobContextImpl jobContext = new JobContextImpl(javaStreamingContext);
//start Enrich Event Spark Consumer.
enrichEventSparkConsumer.startEnrichEventConsumer(jobContext.streamingctx());
}
}
Now i want to submit these three Spark Streaming classes on to the cluster. I read somewhere that i have to create a Jar file first then after it i can use Spark-submit command but i have some questions in my mind -
Should i create a different project with these 3 Spark Streaming classes?
As of now i am using CommandLineRunner to initiate SparkStreaming then when to submit cluster , should i create main() method in these class?
Please tell me how to do it. Thanks in advance.

No need for a different project.
You should create entry point/ main which is responsible of the JavaStreamingContext creation.
Create your jar with dependencies, the dependencies in one single jar file, don't forget to put provided scope for all your spark dependencies since you will use cluster's libraries.
Executing assembled Spark application is using spark-submit command-line application as follows:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
For local submit
bin/spark-submit \
--class package.Main \
--master local[2] \
path/to/jar argument1 argument2

Related

How to pass Kafka's --producer.config through Java

I'm using the below command to send records to a secure Kafka
bin/kafka-console-producer.sh --topic <My Kafka topic name> --bootstrap-server <My custom bootstrap server> --producer.config /Users/DY/SSL/ssl.properties
As you can see I have added the ssl.properties file's path to the --producer.config switch.
The ssl.properties file contains details about how to connect to secure kafka, its contents are below:
security.protocol=SSL
ssl.truststore.location=<My custom value>
ssl.truststore.password=<My custom value>
ssl.key.password=<My custom value>
ssl.keystore.location=<My custom value>
ssl.keystore.password=<My custom value>
Now, I want to use replicate this command with java producer.
The code that I've written is as:
public class MyProducer {
public static void main(String[] args) {
{
Properties properties = new Properties();
properties.put("bootstrap.servers", <My bootstrap server>);
properties.put("key.serializer", StringSerializer.class);
properties.put("value.serializer", StringSerializer.class);
properties.put("producer.config", "/Users/DY/SSL/ssl.properties");
KafkaProducer<String, String> kafkaProducer = new KafkaProducer<String, String>(properties);
ProducerRecord<String, String> producerRecord = new ProducerRecord<>(
<My bootstrap server>, "Hello World from program");
Future<RecordMetadata> future = kafkaProducer.send(
producerRecord,
(metadata, exception) -> {
if(exception != null){
System.out.printf("some thing wrong");
exception.printStackTrace();
}
else{
System.out.println("Successfully transmitted");
}
});
future.get()
kafkaProducer.close();
}
}
}
This way of passing the properties.put("producer.config", "/Users/DY/SSL/ssl.properties"); however does not seem to work. Could anybody let me know what would be an appropriate way to do this

Rather than use any file to pass the properties individually, you can use static client configs as below;
Properties properties = new Properties();
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
// for SSL Encryption
properties.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, "SSL");
properties.put(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, "<My custom value>");
properties.put(SslConfigs.SSL_TRUSTSTORE_PASSWORD_CONFIG, "<My custom value>");
// for SSL Authentication
properties.put(SslConfigs.SSL_KEYSTORE_LOCATION_CONFIG, "<My custom value>");
properties.put(SslConfigs.SSL_KEYSTORE_PASSWORD_CONFIG, "<My custom value>");
properties.put(SslConfigs.SSL_KEY_PASSWORD_CONFIG, "<My custom value>");
Required classes are;
import org.apache.kafka.clients.CommonClientConfigs;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.common.config.SslConfigs;

You have to set each one as a discrete property in the producer Properties.
You could use Properties.load() with a FileInputStream or FileReader to load them from the file into your Properties object.

Checkpoint with spark file streaming in java

I want to implement checkpoint with spark file streaming application to process all unprocessed files from hadoop if in any case my spark streaming application stop/terminates. I am following this : streaming programming guide, but not found JavaStreamingContextFactory. Please help me what should I do.
My Code is
public class StartAppWithCheckPoint {
public static void main(String[] args) {
try {
String filePath = "hdfs://Master:9000/mmi_traffic/listenerTransaction/2020/*/*/*/";
String checkpointDirectory = "hdfs://Mongo1:9000/probeAnalysis/checkpoint";
SparkSession sparkSession = JavaSparkSessionSingleton.getInstance();
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
#Override public JavaStreamingContext create() {
SparkConf sparkConf = new SparkConf().setAppName("ProbeAnalysis");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(300));
JavaDStream<String> lines = jssc.textFileStream(filePath).cache();
jssc.checkpoint(checkpointDirectory);
return jssc;
}
};
JavaStreamingContext context = JavaStreamingContext.getOrCreate(checkpointDirectory, contextFactory);
context.start();
context.awaitTermination();
context.close();
sparkSession.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}

You must use Checkpointing
For checkpointing use stateful transformations either updateStateByKey or reduceByKeyAndWindow. There are a plenty of examples in spark-examples provided along with prebuild spark and spark source in git-hub. For your specific, see JavaStatefulNetworkWordCount.java;

Spring boot tests with Testcontainers' kafka without DirtiesContext

My goal is to use kafka test containers with spring boot context in tests without #DirtiesContext. Problem is that without starting container separately for each test class I have no idea how to consume messages that were produced only by particular test class or method.
So I end up consuming messages that were not a part of even test class that is running.
One solution might be to purge topic of messages. I have no idea how to do this, I've tried to restart container but then next test was not able to connect to kafka.
Second solution that I had in mind is to have consumer that will be created at the beginning of test method and somehow record messages from latest while other staff in test will be called. I've been able to do so with embeded kafka, I have no idea how to do this using test containers.
Current configuration looks like this:
#TestConfiguration
public class KafkaContainerConfig {
#Bean(initMethod = "start", destroyMethod = "stop")
public KafkaContainer kafkaContainer() {
return new KafkaContainer("5.0.3");
}
#Bean
public KafkaAdmin kafkaAdmin(KafkaProperties kafkaProperties, KafkaContainer kafkaContainer) {
kafkaProperties.setBootstrapServers(List.of(kafkaContainer.getBootstrapServers()));
return new KafkaAdmin(kafkaProperties.buildAdminProperties());
}
}
With annotation that will provide above configuration
#Target({ElementType.TYPE})
#Retention(RetentionPolicy.RUNTIME)
#Import(KafkaContainerConfig.class)
#EnableAutoConfiguration(exclude = TestSupportBinderAutoConfiguration.class)
#TestPropertySource("classpath:/application-test.properties")
#DirtiesContext
public #interface IncludeKafkaTestContainer {
}
And in test class itself with multiple such configuration it would looks like:
#IncludeKafkaTestContainer
#IncludePostgresTestContainer
#SpringBootTest(webEnvironment = RANDOM_PORT)
class SomeTest {
...
}
Currently consumer in test method is created this way:
KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer("topic_name");
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
List<ConsumerRecord<String, String>> topicMsgs = Lists.newArrayList(consumerRecords.iterator());
And:
public static KafkaConsumer<String, String> createKafkaConsumer(String topicName) {
Properties properties = new Properties();
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaContainer.getBootstrapServers());
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "testGroup_" + topicName);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class)
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);
kafkaConsumer.subscribe(List.of(topicName));
return kafkaConsumer;
}

kafka AdminClient API Timed out waiting for node assignment

I'm new to Kafka and am trying to use the AdminClient API to manage the Kafka server running on my local machine. I have it setup exactly the same as in the quick start section of the Kafka documentation. The only difference being that I have not created any topics.
I have no issues running any of the shell scripts on this setup but when I try to run the following java code:
public class ProducerMain{
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092");
try(final AdminClient adminClient =
KafkaAdminClient.create(props)){
try {
final NewTopic newTopic = new NewTopic("test", 1,
(short)1);
final CreateTopicsResult createTopicsResult =
adminClient.createTopics(
Collections.singleton(newTopic));
createTopicsResult.all().get();
}catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
}
}
}
Error: TimeoutException: Timed out waiting for a node assignment
Exception in thread "main" java.lang.RuntimeException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
at ProducerMain.main(ProducerMain.java:41)
<br>Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:258)
at ProducerMain.main(ProducerMain.java:38)
<br>Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
I have searched online for an indication as to what the problem could be but have found nothing so far. Any suggestions are welcome as I am at the end of my rope.

Sounds like your broker isn't healthy...
This code works fine
public class Main {
static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
Properties properties = new Properties();
properties.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.setProperty(AdminClientConfig.CLIENT_ID_CONFIG, "local-test");
properties.setProperty(AdminClientConfig.RETRIES_CONFIG, "3");
try (AdminClient client = AdminClient.create(properties)) {
final CreateTopicsResult res = client.createTopics(
Collections.singletonList(
new NewTopic("foo", 1, (short) 1)
)
);
res.all().get(5, TimeUnit.SECONDS);
} catch (InterruptedException | ExecutionException | TimeoutException e) {
logger.error("unable to create topic", e);
}
}
}
And I can see in the broker logs that the topic was created

I started kafka service with bitnami/kafka, and got exactly the same error.
Try to start kafka by this version, it works:
https://hub.docker.com/r/wurstmeister/kafka
$ docker run -d --name zookeeper-server --network app-tier \
-e ALLOW_ANONYMOUS_LOGIN=yes -p 2181:2181 zookeeper:3.6.2
$ docker run -d --name kafka-server --network app-tier --publish 9092:9092 \
--env KAFKA_ZOOKEEPER_CONNECT=zookeeper-server:2181 \
--env KAFKA_ADVERTISED_HOST_NAME=30.225.51.235 \
--env KAFKA_ADVERTISED_PORT=9092 \
wurstmeister/kafka
30.225.51.235 is ip address for the host machine.

Kafka Storm Integration using Kafka Spout

I am using KafkaSpout. Please find the test program below.
I am using Storm 0.8.1. Multischeme class is there in Storm 0.8.2. I will be using that. I just want to know how were the earlier versions working just by instantiating the StringScheme() class? Where can I download earlier versions of Kafka Spout? But I doubt that would be a correct alternative than to work on Storm 0.8.2. ??? (Confused)
When I run the code (given below) on storm cluster (i.e. when I push my topology) I get the following error (This happens when the Scheme part is commented else of course I will get compiler error as the class is not there in 0.8.1):
java.lang.NoClassDefFoundError: backtype/storm/spout/MultiScheme
at storm.kafka.TestTopology.main(TestTopology.java:37)
Caused by: java.lang.ClassNotFoundException: backtype.storm.spout.MultiScheme
In the code given below you may find the spoutConfig.scheme=new StringScheme(); part commented. I was getting compiler error if I don't comment that line which is but natural as there are no constructors in there. Also when I instantiate MultiScheme I get error as I dont have that class in 0.8.1.
public class TestTopology {
public static class PrinterBolt extends BaseBasicBolt {
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println(tuple.toString());
}
}
public static void main(String [] args) throws Exception {
List<HostPort> hosts = new ArrayList<HostPort>();
hosts.add(new HostPort("127.0.0.1",9092));
LocalCluster cluster = new LocalCluster();
TopologyBuilder builder = new TopologyBuilder();
SpoutConfig spoutConfig = new SpoutConfig(new KafkaConfig.StaticHosts(hosts, 1), "test", "/zkRootStorm", "STORM-ID");
spoutConfig.zkServers=ImmutableList.of("localhost");
spoutConfig.zkPort=2181;
//spoutConfig.scheme=new StringScheme();
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
builder.setSpout("spout",new KafkaSpout(spoutConfig));
builder.setBolt("printer", new PrinterBolt())
.shuffleGrouping("spout");
Config config = new Config();
cluster.submitTopology("kafka-test", config, builder.createTopology());
Thread.sleep(600000);
}

I had the same problem. Finally resolved it, and I put the complete running example up on github.
You are welcome to check it out here >
https://github.com/buildlackey/cep
(click on the storm+kafka directory for a sample program that should get you up and running).

We had a similar issue.
Our solution:
Open pom.xml
Change scope from provided to <scope>compile</scope>
If you want to know more about dependency scopes check the maven docu:
Maven docu - dependency scopes

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to spark-submit a Spark Streaming application - java

Related

How to pass Kafka's --producer.config through Java

Checkpoint with spark file streaming in java

Spring boot tests with Testcontainers' kafka without DirtiesContext

kafka AdminClient API Timed out waiting for node assignment

Kafka Storm Integration using Kafka Spout

Categories

Resources