I am trying to write a simple java kafka consumer to read data using similar code as in https://github.com/bkimminich/apache-kafka-book-examples/blob/master/src/test/kafka/consumer/SimpleHLConsumer.java.
Looks like my app is able to connect, but its not fetching any data. Please suggest.
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
//import scala.util.parsing.json.JSONObject
import scala.util.parsing.json.JSONObject;
public class SimpleHLConsumer {
private final ConsumerConnector consumer;
private final String topic;
public SimpleHLConsumer(String zookeeper, String groupId, String topic) {
Properties props = new Properties();
props.put("zookeeper.connect", zookeeper);
props.put("group.id", groupId);
// props.put("zookeeper.session.timeout.ms", "5000");
// props.put("zookeeper.sync.time.ms", "250");
// props.put("auto.commit.interval.ms", "1000");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(props));
this.topic = topic;
}
public void testConsumer() {
Map<String, Integer> topicCount = new HashMap<>();
topicCount.put(topic, 1);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = consumer.createMessageStreams(topicCount);
System.out.println(consumerStreams);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(topic);
System.out.println(streams);
System.out.println(consumer);
for (final KafkaStream stream : streams) {
ConsumerIterator<byte[], byte[]> it = stream.iterator();
System.out.println("for loop");
System.out.println(it);
System.out.println("Message from Single Topic: " + new String(it.next().message()));
//System.out.println("Message from Single Topic: " + new String(it.message()));
while (it.hasNext()) {
System.out.println("in While");
System.out.println("Message from Single Topic: " + new String(it.next().message()));
}
}
// if (consumer != null) {
// consumer.shutdown();
// }
}
public static void main(String[] args) {
String topic = "test";
SimpleHLConsumer simpleHLConsumer = new SimpleHLConsumer("localhost:2181", "testgroup", topic);
simpleHLConsumer.testConsumer();
}
}
Here is the output i see in eclipse. It does seem to connect to my zookeeper , but it just hangs there, it does not display any message at all.
log4j:WARN No appenders could be found for logger (kafka.utils.VerifiableProperties).
log4j:WARN Please initialize the log4j system properly.
SLF4J: The requested version 1.6 by your slf4j binding is not compatible with [1.5.5, 1.5.6]
SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further details.
{test=[testgroup kafka stream]}
[testgroup kafka stream]
kafka.javaapi.consumer.ZookeeperConsumerConnector#6200f9cb
for loop
Consumer iterator hasNext is blocking call. It will block indefinitely if no new message is available for consumption.
To verify this, change your code to
// Comment 2 lines below
// System.out.println(it);
// System.out.println("Message from Single Topic: " + new String(it.next().message()));
// Line below is blocking. Your code will hang till next message in topic.
// Add new message in topic using producer, message will appear in console
while (it.hasNext()) {
Better way is to execute code in separate thread. Use consumer.timeout.ms to specify time in ms, after which consumer will throw timeout exception
// keepRunningThread is flag to control when to exit consumer loop
while(keepRunningThread)
{
try
{
if(it.hasNext())
{
System.out.println(new String(it.next().message()));
}
}
catch(ConsumerTimeoutException ex)
{
// Timeout exception waiting for kafka message
// Wait for 5 (or t) seconds before checking for message again
Thread.sleep(5000);
}
}
Related
I have a cloudera Quickstart VM. i have installed Kafka parcels using Cloudera Manager and its working fine inside the VM using console based consumer and producer.
But when i try to use java based consumer it does not produce or consume messages. I can list the topics.
But i cannot consume messages.
following is my code.
package kafka_consumer;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
public class mclass {
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.0.75.1:9092");
// Just a user-defined string to identify the consumer group
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
// Enable auto offset commit
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true");
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
props.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
// List of topics to subscribe to
consumer.subscribe(Arrays.asList("second_topic"));
for (String k_topic : consumer.listTopics().keySet()) {
System.out.println(k_topic);
}
while (true) {
try {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("Offset = %d\n", record.offset());
System.out.printf("Key = %s\n", record.key());
System.out.printf("Value = %s\n", record.value());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
And following is the output of the code. while the console producer is producing messages but consumer is not able to receive it.
PS: i can telnet the port and ip of the Kafka broker. I can even list the topics. Consumer is constantly running without crashing but no messages is being consumed.
I am trying to create a KStream application in Eclipse using Java. right now I am referring to the word count program available on the internet for KStreams and modifying it.
What I want is that the data that I am reading from the input topic should be written to a file instead of being written to another output topic.
But when I am trying to print the KStream/KTable to the local file, I am getting the following entry in the output file:
org.apache.kafka.streams.kstream.internals.KStreamImpl#4c203ea1
How do I implement redirecting the output from the KStream to a file?
Below is the code:
package KStreamDemo.kafkatest;
package org.apache.kafka.streams.examples.wordcount;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.KeyValueMapper;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.ValueMapper;
import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
public class TemperatureDemo {
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "34.73.184.104:9092");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
System.out.println("#1###################################################################################################################################################################################");
// setting offset reset to earliest so that we can re-run the demo code with the same pre-loaded data
// Note: To re-run the demo, you need to use the offset reset tool:
// https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Application+Reset+Tool
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsBuilder builder = new StreamsBuilder();
System.out.println("#2###################################################################################################################################################################################");
KStream<String, String> source = builder.stream("iot-temperature");
System.out.println("#5###################################################################################################################################################################################");
KTable<String, Long> counts = source
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" "));
}
})
.groupBy(new KeyValueMapper<String, String, String>() {
#Override
public String apply(String key, String value) {
return value;
}
})
.count();
System.out.println("#3###################################################################################################################################################################################");
System.out.println("OUTPUT:"+ counts);
System.out.println("#4###################################################################################################################################################################################");
// need to override value serde to Long type
counts.toStream().to("iot-temperature-max", Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-wordcount-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
This is not correct
System.out.println("OUTPUT:"+ counts);
You would need to do counts.foreach, then print the messages out to a file.
Print Kafka Stream Input out to console? (just update to write to file instead)
However, probably better to write out the stream to a topic. And the use Kafka Connect to write out to a file. This is a more industry-standard pattern. Kafka Streams is encouraged to only move data between topics within Kafka, not integrate with external systems (or filesystems)
Edit connect-file-sink.properties with the topic information you want, then
bin/connect-standalone config/connect-file-sink.properties
I'm new to this topic. I'm Using JMS listener to listen to an Active MQ which contains split messages. I need to listen to the queue till last message and then send it to UI all together. I'm able to listen to the queue and grab messages but I do not know how many split messages will be available so i'm not able to send it all together. Is there any way to make listener to do the above operation? Like if there is no more messages available in queue, will jms listener produces a null value? Any idea or help will be really helpful.
I'm Using the below code to listen to Queue using JMS Listener.
private static final String ORDER_RESPONSE_QUEUE = "mail-response-queue";
#JmsListener(destination = ORDER_RESPONSE_QUEUE)
public void receiveMessage(final Message<InventoryResponse> message) throws JMSException {
LOG.info("+++++++++++++++++++++++++++++++++++++++++++++++++++++");
MessageHeaders headers = message.getHeaders();
LOG.info("Application : headers received : {}", headers);
InventoryResponse response = message.getPayload();
LOG.info("Application : response received : {}",response);
LOG.info("+++++++++++++++++++++++++++++++++++++++++++++++++++++");
}
Can i get Queue information using JMS Listener?
by jmx you have access to informations about destinations, like this for example you can know how messages are pending in a Queue.
Note that this can change if new messages are sent
long
org.apache.activemq.broker.jmx.DestinationViewMBean.getQueueSize()
#MBeanInfo(value="Number of messages in the destination which are yet
to be consumed. Potentially dispatched but unacknowledged.")
Returns the number of messages in this destination which are yet to be
consumed Returns:Returns the number of messages in this destination
which are yet to be consumed
import java.util.HashMap;
import java.util.Map;
import javax.management.MBeanServerConnection;
import javax.management.MBeanServerInvocationHandler;
import javax.management.ObjectName;
import javax.management.remote.JMXConnector;
import javax.management.remote.JMXConnectorFactory;
import javax.management.remote.JMXServiceURL;
import org.apache.activemq.broker.jmx.BrokerViewMBean;
import org.apache.activemq.broker.jmx.QueueViewMBean;
public class JMXGetDestinationInfos {
public static void main(String[] args) throws Exception {
JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://host:1099/jmxrmi");
Map<String, String[]> env = new HashMap<>();
String[] creds = {"admin", "activemq"};
env.put(JMXConnector.CREDENTIALS, creds);
JMXConnector jmxc = JMXConnectorFactory.connect(url, env);
MBeanServerConnection conn = jmxc.getMBeanServerConnection();
ObjectName activeMq = new ObjectName("org.apache.activemq:type=Broker,brokerName=localhost");
BrokerViewMBean mbean = MBeanServerInvocationHandler.newProxyInstance(conn, activeMq, BrokerViewMBean.class,
true);
for (ObjectName name : mbean.getQueues()) {
if (("Destination".equals(name.getKeyProperty("destinationName")))) {
QueueViewMBean queueMbean = MBeanServerInvocationHandler.newProxyInstance(conn, name,
QueueViewMBean.class, true);
System.out.println(queueMbean.getQueueSize());
}
}
}
}
Why not consuming messages and when there is no messages received you display ?? You have method below which returns null after a timeout if there no messages received.
ActiveMQMessageConsumer.receive(long timeout)
throws JMSException Receives the next message that arrives within the specified timeout interval. This call blocks until
a message arrives, the timeout expires, or this message consumer is
closed. A timeout of zero never expires, and the call blocks
indefinitely. Specified by: receive in interface MessageConsumer
Parameters: timeout - the timeout value (in milliseconds), a time out
of zero never expires. Returns: the next message produced for this
message consumer, or null if the timeout expires or this message
consumer is concurrently closed
UPDATE
may be like this :
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.HashMap;
import java.util.Map;
import javax.management.MBeanServerConnection;
import javax.management.MBeanServerInvocationHandler;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;
import javax.management.remote.JMXConnector;
import javax.management.remote.JMXConnectorFactory;
import javax.management.remote.JMXServiceURL;
import org.apache.activemq.broker.jmx.BrokerViewMBean;
import org.apache.activemq.broker.jmx.QueueViewMBean;
public class JMXGetDestinationInfos {
private QueueViewMBean queueMbean;
{
try {
JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://host:1099/jmxrmi");
Map<String, String[]> env = new HashMap<>();
String[] creds = { "admin", "activemq" };
env.put(JMXConnector.CREDENTIALS, creds);
JMXConnector jmxc = JMXConnectorFactory.connect(url, env);
MBeanServerConnection conn = jmxc.getMBeanServerConnection();
ObjectName activeMq = new ObjectName("org.apache.activemq:type=Broker,brokerName=localhost");
BrokerViewMBean mbean = MBeanServerInvocationHandler.newProxyInstance(conn, activeMq, BrokerViewMBean.class,
true);
for (ObjectName name : mbean.getQueues()) {
if (("Destination".equals(name.getKeyProperty("destinationName")))) {
queueMbean = MBeanServerInvocationHandler.newProxyInstance(conn, name, QueueViewMBean.class, true);
System.out.println(queueMbean.getQueueSize());
break;
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
#JmsListener(destination = ORDER_RESPONSE_QUEUE)
public void receiveMessage(final Message<InventoryResponse> message, javax.jms.Message amqMessage) throws JMSException {
LOG.info("+++++++++++++++++++++++++++++++++++++++++++++++++++++");
MessageHeaders headers = message.getHeaders();
LOG.info("Application : headers received : {}", headers);
InventoryResponse response = message.getPayload();
LOG.info("Application : response received : {}",response);
LOG.info("+++++++++++++++++++++++++++++++++++++++++++++++++++++");
//queueMbean.getQueueSize() is real time, each call return the real size
((org.apache.activemq.command.ActiveMQMessage) amqMessage ).acknowledge();
if(queueMbean != null && queueMbean.getQueueSize() == 0){
//display messages ??
}
}
}
because getQueueSize() return the umber of messages in the
destination which are yet to be consumed. Potentially dispatched but
unacknowledged.
One solution is to update the acknowledgeMode to org.apache.activemq.ActiveMQSession.INDIVIDUAL_ACKNOWLEDGE for sessions creation in your spring DefaultMessageListenerContainer.sessionAcknowledgeModeName and acknowledge each message individually and check after that if the size == 0 (size == 0 means all messages are dispatched and acknowledged).
I am new to kafka. I am trying to send message through java app and consume it in command line prompt, but the message is not getting displayed on CLI.
Following is the java code:
package com.kafka.prj;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.Properties;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.KafkaProducer;
public class KafkaProd {
private static KafkaProducer<String, String> producer;
public void initialize() {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// ProducerConfig producerConfig = new ProducerConfig(producerProps);
producer = new KafkaProducer<String, String>(props);
}
public void publishMesssage() throws Exception{
producer.send(new ProducerRecord<String, String>("test1", "dummy text msg"));
return;
}
public static void main(String[] args) {
KafkaProd kafkaProducer = new KafkaProd();
// Initialize producer
kafkaProducer.initialize();
// Publish message
try {
kafkaProducer.publishMesssage();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//Close the producer
producer.close();
}
}
In CLI, following is the command am using to consume the message sent in code above:
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test1 --from-beginning
The above command displays nothing, no error, no output.
Where am I getting wrong?
I was having the same problem, ie messages produced by kafka-console-producer.sh were visible on the kafka-console-consumer.sh console. But, on using the Java producer, kafka-console-consumer.sh console did not receive any messages. Also, the logs on the Java producer had this as the last line :
2017-07-10 16:36:42 INFO KafkaProducer:972 - Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.
This means that the Java producer is not able to connect to the Kafka brokers as given by the bootstrap.servers config on the Java producer. (Although, I was actually able to telnet to the broker port from the same machine as the Java producer ). Solution is to add the property : advertised.host.name on all the brokers. and make it equal to the IP/hostname of the broker. This should be in line with whatever you are providing in the bootstrap.servers.
My bootstrap.servers had the value - 192.168.10.12:9092,192.168.10.13:9092,192.168.10.14:9092, so on each of the brokers, I made advertised.host.name=192.168.10.12, advertised.host.name=192.168.10.13 and advertised.host.name=192.168.10.14 respectively.
I am using spark streaming with kafka topic. topic is created with 5 partitions. My all messages are published to the kafka topic using tablename as key.
Given this i assume all messages for that table should goto the same partition.
But i notice in the spark log messages for same table sometimes goes to executor's node-1 and sometime goes to executor's node-2.
I am running code in yarn-cluster mode using following command:
spark-submit --name DataProcessor --master yarn-cluster --files /opt/ETL_JAR/executor-log4j-spark.xml,/opt/ETL_JAR/driver-log4j-spark.xml,/opt/ETL_JAR/application.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=driver-log4j-spark.xml" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=executor-log4j-spark.xml" --class com.test.DataProcessor /opt/ETL_JAR/etl-all-1.0.jar
and this submission creates 1 driver lets say on node-1 and 2 executors on node-1 and node-2.
I don't want node-1 and node-2 executors to read the same partition. but this is happening
Also tried following configuration to specify consumer group but no difference.
kafkaParams.put("group.id", "app1");
This is how we are creating the stream using createDirectStream method
*Not through zookeeper.
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", "largest");
kafkaParams.put("group.id", "app1");
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
Complete Code:
import java.io.Serializable;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class DataProcessor2 implements Serializable {
private static final long serialVersionUID = 3071125481526170241L;
private static Logger log = LoggerFactory.getLogger("DataProcessor");
public static void main(String[] args) {
final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
DataProcessorContextFactory3 factory = new DataProcessorContextFactory3();
JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(sparkCheckPointDir, factory);
// Start the process
jssc.start();
jssc.awaitTermination();
}
}
class DataProcessorContextFactory3 implements JavaStreamingContextFactory, Serializable {
private static final long serialVersionUID = 6070911284191531450L;
private static Logger logger = LoggerFactory.getLogger(DataProcessorContextFactory.class);
DataProcessorContextFactory3() {
}
#Override
public JavaStreamingContext create() {
logger.debug("creating new context..!");
final String brokers = ApplicationProperties.getProperty(Consts.KAFKA_BROKERS_NAME);
final String topic = ApplicationProperties.getProperty(Consts.KAFKA_TOPIC_NAME);
final String app = "app1";
final String offset = ApplicationProperties.getProperty(Consts.KAFKA_CONSUMER_OFFSET, "largest");
logger.debug("Data processing configuration. brokers={}, topic={}, app={}, offset={}", brokers, topic, app,
offset);
if (StringUtils.isBlank(brokers) || StringUtils.isBlank(topic) || StringUtils.isBlank(app)) {
System.err.println("Usage: DataProcessor <brokers> <topic>\n" + Consts.KAFKA_BROKERS_NAME
+ " is a list of one or more Kafka brokers separated by comma\n" + Consts.KAFKA_TOPIC_NAME
+ " is a kafka topic to consume from \n\n\n");
System.exit(1);
}
final String majorVersion = "1.0";
final String minorVersion = "3";
final String version = majorVersion + "." + minorVersion;
final String applicationName = "DataProcessor-" + topic + "-" + version;
// for dev environment
SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName(applicationName);
// for cluster environment
//SparkConf sparkConf = new SparkConf().setAppName(applicationName);
final long sparkBatchDuration = Long
.valueOf(ApplicationProperties.getProperty(Consts.SPARK_BATCH_DURATION, "10"));
final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchDuration));
logger.debug("setting checkpoint directory={}", sparkCheckPointDir);
jssc.checkpoint(sparkCheckPointDir);
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(topic.split(",")));
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", offset);
kafkaParams.put("group.id", "app1");
// #formatter:off
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
// #formatter:on
processRDD(messages, app);
return jssc;
}
private void processRDD(JavaPairInputDStream<String, String> messages, final String app) {
JavaDStream<MsgStruct> rdd = messages.map(new MessageProcessFunction());
rdd.foreachRDD(new Function<JavaRDD<MsgStruct>, Void>() {
private static final long serialVersionUID = 250647626267731218L;
#Override
public Void call(JavaRDD<MsgStruct> currentRdd) throws Exception {
if (!currentRdd.isEmpty()) {
logger.debug("Receive RDD. Create JobDispatcherFunction at HOST={}", FunctionUtil.getHostName());
currentRdd.foreachPartition(new VoidFunction<Iterator<MsgStruct>>() {
#Override
public void call(Iterator<MsgStruct> arg0) throws Exception {
while(arg0.hasNext()){
System.out.println(arg0.next().toString());
}
}
});
} else {
logger.debug("Current RDD is empty.");
}
return null;
}
});
}
public static class MessageProcessFunction implements Function<Tuple2<String, String>, MsgStruct> {
#Override
public MsgStruct call(Tuple2<String, String> data) throws Exception {
String message = data._2();
System.out.println("message:"+message);
return MsgStruct.parse(message);
}
}
public static class MsgStruct implements Serializable{
private String message;
public static MsgStruct parse(String msg){
MsgStruct m = new MsgStruct();
m.message = msg;
return m;
}
public String toString(){
return "content inside="+message;
}
}
}
According to Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher), you can specify an explicit mapping of partitions to hosts.
Assume you have two hosts(h1 and h2), and the Kafka topic topic-name has three partitions. The following critical code will show you how to map a specified partition to a host in Java.
Map<TopicPartition, String> partitionMapToHost = new HashMap<>();
// partition 0 -> h1, partition 1 and 2 -> h2
partitionMapToHost.put(new TopicPartition("topic-name", 0), "h1");
partitionMapToHost.put(new TopicPartition("topic-name", 1), "h2");
partitionMapToHost.put(new TopicPartition("topic-name", 2), "h2");
List<String> topicCollection = Arrays.asList("topic-name");
Map<String, Object> kafkaParams = new HasMap<>();
kafkaParams.put("bootstrap.servers", "10.0.0.2:9092,10.0.0.3:9092");
kafkaParams.put("group.id", "group-id-name");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
JavaInputDStream<ConsumerRecord<String, String>> records = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferFixed(partitionMapToHost), // PreferFixed is the key
ConsumerStrategies.Subscribe(topicCollection, kafkaParams));
You can also use LocationStrategies.PreferConsistent(), which distribute partitions evenly across available executors, and assure that a specified partition is only consumed by a specified executor.
Using the DirectStream approach it's a correct assumption that messages sent to a Kafka partition will land in the same Spark partition.
What we cannot assume is that each Spark partition will be processed by the same Spark worker each time. On each batch interval, Spark task are created for each OffsetRange for each partition and sent to the cluster for processing, landing on some available worker.
What you are looking for partition locality. The only partition locality that the direct kafka consumer supports is the kafka host containing the offset range being processed in the case that you Spark and Kafka deployements are colocated; but that's a deployment topology that I don't see very often.
In case that your requirements dictate the need to have host locality, you should look into Apache Samza or Kafka Streams.