I am trying to build a pub/sub application and I am exploring the best tools out there. I am currently looking at Kafka and have a little demo app already running. However, I am running into a conceptual issue.
I have a producer (Java code):
String topicName = "MyTopic;
String key = "MyKey";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("acks", "all");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer <String, byte[]>(props);
byte[] data = <FROM ELSEWHERE>;
ProducerRecord<String, byte[]> record = new ProducerRecord<String, byte[]>(topicName, key, data);
try {
RecordMetadata result = producer.send(record).get();
}
catch (Exception e) {
// Nothing for now
}
producer.close();
When I start a consumer via the Kakfa command line tools:
kafka-console-consumer --bootstrap-server localhost:9092 --topic MyTopic
and then I execute the producer code, I see the data message show up on my consumer terminal.
However, if I do not run the consumer prior executing the producer, the message appears "lost". When I start the consumer (after executing the producer), nothing appears in the consumer terminal.
Does anyone know if it's possible to have the Kafka broker retain messages while there are no consumers connected? If so, how?
Append --from-beginning to the console consumer command to have it start consuming from the earliest offset. This is actually about the offset reset strategy which is controlled by config auto.offset.reset. Here is what this config means:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Related
I have manually created topic test with this command:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
and using this command:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
I inserted these records:
This is a message
This is another message
This is a message2
First, I consume messages through the command line like this:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
and all the records are successfully shown. Then, I try to implement a consumer in Java using this code:
public class KafkaSubscriber {
public void consume() {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test-consumer-group");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
Consumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("test"));
// also with this command
// consumer.subscribe(Arrays.asList("test"));
System.out.println("Starting to read data...");
try {
while (true) {
try {
ConsumerRecords<String, String> records = consumer.poll(100);
System.out.println("Number of records found: " + records.count());
for (ConsumerRecord rec : records) {
System.out.println(rec.value());
}
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}
catch (Exception e) {
e.printStackTrace();
} finally {
consumer.close();
}
}
But the output is:
Starting to read data...
0
0
0
0
0
....
Which means that it does not find any records in topic test. I also tried to publish some records after the Java consumer has started, but the same again. Any ideas what might be going wrong?
EDIT: After adding the following line:
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
the consumer now reads only when I write new records to the topic. It does not read all the records from the beggining.
By default, if no offsets have previously been committed for the group, the consumer starts at the end topics.
Hence if you are running it after having produced records, it won't receive them.
Notice in your kafka-console-consumer.sh, you have the --from-beginning flag which forces the consumer to instead start from the beginning of the topic.
One workaround, as suggested in a comment, is to set ConsumerConfig.AUTO_OFFSET_RESET_CONFIG to earliest. However I'd be careful with that setting as your consumer will consume from the start of the topics and this could be a lot of data in a real use case.
The easiest solution is now that you've run your consumer once and it has created a group, you can simply rerun the producer. After that when you run the consumer again it will pick up from its last position which will be before the new producer messages.
On the other hand, if you mean to always reconsume all messages then you have 2 options:
explicitely use seekToBeginning() when your consumer starts to move its position to the start of topics
set auto.offset.reset to earliest and disable auto offset commit by setting enable.auto.commit to false
I have a single node kafka broker and simple streams application. I created 2 topics (topic1 and topic2).
Produced on topic1 - processed message - write to topic2
Note: For each message produced only one message is written to destination topic
I produced a single message. After it was written to topic2, I stopped the kafka broker. After sometime I restarted the broker and produced another message on topic1. Now streams app processed that message 3 times. Now without stopping the broker I produced messages to topic1 and waited for streams app to write to topic2 before producing again.
Streams app is behaving strangely. Sometimes for one produced message there are 2 messages written to destination topic and sometimes 3. I don't understand why is this happening. I mean even the messages produced after broker restart are being duplicated.
Update 1:
I am using Kafka version 1.0.0 and Kafka-Streams version 1.1.0
Below is the code.
Main.java
String credentials = env.get("CREDENTIALS");
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "activity-collection");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.RECONNECT_BACKOFF_MS_CONFIG, 100000);
props.put(StreamsConfig.RECONNECT_BACKOFF_MAX_MS_CONFIG, 200000);
props.put(StreamsConfig.REQUEST_TIMEOUT_MS_CONFIG, 60000);
props.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 60000);
props.put(StreamsConfig.producerPrefix(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG), true);
props.put(StreamsConfig.producerPrefix(ProducerConfig.ACKS_CONFIG), "all");
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> activityStream = builder.stream("activity_contenturl");
KStream<String, String> activityResultStream = AppUtil.hitContentUrls(credentials , activityStream);
activityResultStream.to("o365_user_activity");
AppUtil.java
public static KStream<String, String> hitContentUrls(String credentials, KStream<String, String> activityStream) {
KStream<String, String> activityResultStream = activityStream
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
ArrayList<String> log = new ArrayList<String>();
JSONObject received = new JSONObject(value);
String url = received.get("url").toString();
String accessToken = ServiceUtil.getAccessToken(credentials);
JSONObject activityLog = ServiceUtil.getActivityLogs(url, accessToken);
log.add(activityLog.toString());
}
return log;
}
});
return activityResultStream;
}
Update 2:
In a single broker and single partition environment with the above config, I started the Kafka broker and streams app. Produced 6 messages on source topic and when I started a consumer on destination topic there are 36 messages and counting. They keep on coming.
So I ran this to see consumer-groups:
kafka_2.11-1.1.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list
Output:
streams-collection-app-0
Next I ran this:
kafka_2.11-1.1.0/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group streams-collection-app-0
Output:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 1 0 streams-collection-app-0-244b6f55-b6be-40c4-9160-00ea45bba645-StreamThread-1-consumer-3a2940c2-47ab-49a0-ba72-4e49d341daee /127.0.0.1 streams-collection-app-0-244b6f55-b6be-40c4-9160-00ea45bba645-StreamThread-1-consumer
After a while the output showed this:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 6 5 - - -
And then:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 7 6 - - -
seems you are facing with known limitation. Kafka topic by default stores messages at least 7 days, but committed offsets stored for 1 day (default config value offsets.retention.minutes = 1440). so if no messages were produced to your source topic during more than 1 day, after app restart all messages from topic will be reprocessed again (actually multiple times, depending on number of restarts, max 1 time per day per such topic with rare incoming messages).
you could find description of expiration committed offsets How does an offset expire for consumer group.
in kafka version 2.0 retention for committed offsets was increased KIP-186: Increase offsets retention default to 7 days.
to prevent reprocessing, you could add consumer property auto.offset.reset: latest (default value is earliest).
there is exist a small risk with latest: if no one produced message into a source topic longer that day, and after that you restarted app, you could lost some messages (only messages which arrived exactly during restart).
I hope someone can shed some light on my issue.
I'm making a series of REST calls to a micro-service I have built. When the service receives the calls, it persists some data to a database and then publishes a message to a Kafka topic.
I'm trying to write a test that makes the REST calls and then consumes the messages from the Kafka topic.
When my test tries to consume from Kafka, it doesnt appear to be consuming the latest messages.
I've used the kafka-consumer-groups.sh shell script which ships with Kafka to describe the state of my consumer. Here is what it looks like:
bash-4.3# ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group test
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).
Consumer group 'test' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test_topic 0 42 43 1 - - -
Note the lag of 1. This is what appears to be my issue.
Here is my Kafka consumer code:
public void consumeMessages() {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
try (KafkaConsumer < String, String > kafkaConsumer = new KafkaConsumer < > (props)) {
kafkaConsumer.subscribe(Collections.singletonList("test_topic"));
ConsumerRecords < String, String > consumerRecords = kafkaConsumer.poll(5000);
for (ConsumerRecord < String, String > consumerRecord: consumerRecords) {
System.out.printf("offset = %d, message = %s%n", consumerRecord.offset(), consumerRecord.value());
}
}
}
Any help will be greatly received.
Thanks,
Ben :)
I'm completely new to Kafka and i have some troubles using the KafkaProducer.
The send Method of the producer blocks exactly 1min and then the application proceeds without an exception. This is obviously some timeout but no Exception is thrown.
I can also see nothing really in the logs.
The servers seam to be setup correctly. If i use the bin/kafka-console-consumer and producer applications i can send and receive messages correctly. Also the code seams to work to some extend.
If i want to write to a topic which does not exist yet i can see in the /tmp/kafka-logs folder the new entry and also in the console output of the KafkaServer.
Here is the Code i use:
Properties props = ResourceUtils.loadProperties("kafka.properties");
Producer<String, String> producer = new KafkaProducer<>(props);
for (String line : lines)
{
producer.send(new ProducerRecord<>("topic", Id, line));
producer.flush();
}
producer.close();
The properties in the kafka.properties file:
bootstrap.servers=localhost:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer
acks=all
retries=0
batch.size=16384
linger.ms=1
buffer.memory=33554432
So, producer.send blocks for 1minute and then it continues. At the end nothing is stored in Kafka, but the new topic is created.
Thank you for any help!
Try set the bootstrap.servers to 127.0.0.1:9092
I have written a simple program to read data from Kafka and print in flink. Below is the code.
public static void main(String[] args) throws Exception {
Options flinkPipelineOptions = PipelineOptionsFactory.create().as(Options.class);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Class<?> unmodColl = Class.forName("java.util.Collections$UnmodifiableCollection");
env.getConfig().addDefaultKryoSerializer(unmodColl, UnmodifiableCollectionsSerializer.class);
env.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE);
flinkPipelineOptions.setJobName("MyFlinkTest");
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setCheckpointingInterval(1000L);
flinkPipelineOptions.setNumberOfExecutionRetries(5);
flinkPipelineOptions.setExecutionRetryDelay(3000L);
Properties p = new Properties();
p.setProperty("zookeeper.connect", "localhost:2181");
p.setProperty("bootstrap.servers", "localhost:9092");
p.setProperty("group.id", "test");
FlinkKafkaConsumer09<Notification> kafkaConsumer = new FlinkKafkaConsumer09<>("testFlink",new ProtoDeserializer(),p);
DataStream<Notification> input = env.addSource(kafkaConsumer);
input.rebalance().map(new MapFunction<Notification, String>() {
#Override
public String map(Notification value) throws Exception {
return "Kafka and Flink says: " + value.toString();
}
}).print();
env.execute();
}
I need flink to process my data in kafka exactly once and I have few questions on how it can be done.
When does FlinkKafkaConsumer09 commits the processed offsets to kafka?
Say my topic has 10 messages, the consumer processes all 10 messages. When I stop the job and start it again, it starts processing random messages from the set of previously read messages. I need to ensure none of my messages are processed twice.
Please advice. Appreciate all the help. Thanks.
This page describes the fault tolerance guarantees of the Flink Kafka connector.
You can use Flink's savepoints to re-start a job in an exactly-once (state preserving) manner.
The reason why you are seeing the messages again is because the offsets committed by Flink to the Kafka broker / Zookeeper are not in line with Flink's registered state.
You'll always see messages processed multiple times after restore / failure in Flink, even with exactly-once semantics enabled. The exactly-once guarantees in Flink are with respect to registered state, not for the records send to the operators.
Slightly off-topic: What are these lines for? They are not passed to Flink anywhere.
Options flinkPipelineOptions = PipelineOptionsFactory.create().as(Options.class);
flinkPipelineOptions.setJobName("MyFlinkTest");
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setCheckpointingInterval(1000L);
flinkPipelineOptions.setNumberOfExecutionRetries(5);
flinkPipelineOptions.setExecutionRetryDelay(3000L);