According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again
Related
I want to create single kafka consumer for several topics. Method constructor for consumer allows me to transfer arguments for a list of topics inside subscription, like that:
private Consumer createConsumer() {
Properties props = getConsumerProps();
Consumer<String, byte[]> consumer = new KafkaConsumer<>(props);
ArrayList<String> topicMISL = new ArrayList<>();
for (String s:Connect2Redshift.kafkaTopics) {
topicMISL.add(systemID + "." + s);
}
consumer.subscribe(topicMISL);
return consumer;
}
private boolean consumeMessages( Duration duration, Consumer<String, byte[]> consumer) {
try { Long start = System.currentTimeMillis();
ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(duration);
}
}
Afterwards I want to poll records from kafka into stream every 3 sec and process them, but I wonder what is inside this consumer - how will records from different topics be polled - at first one topic, then another, or in parallel. Could it be that one topic with large amount of messages would be processed all the time and another topic with small amount of messages would wait?
in general it depends on your topic settings. Kafka scales by using multiple partitions per topic.
If you have 3 partitions on 1 topic, kafka can read from them in parallel
The same is true for multiple topics, reading can happen in parallel
If you have a partition that receives a lot more messages than the others, you may run into the scenario of a consumer lag for this particular partition. Tweaking the batch size and consumer settings may help them, also compressing messages.
Ideally making sure to distribute the load evenly avoids this scenario.
Look into this blog article, it gave me a good understanding of the internals: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
}
}
Also need to do commit for offset by finding offset and commit using consumer.commitSync
We have a Kafka Consumer setup like below
#Bean
public ConsumerFactory<String, Object> consumerFactory() {
final Map<String, Object> props = kafkaProperties.buildConsumerProperties();
return new DefaultKafkaConsumerFactory<>(props);
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> batchFactory(
final ConsumerFactory<String, Object> consumerFactory,
#Value("${someProp.batch}") final boolean enableBatchListener,
#Value("${someProp.concurrency}") final int consumerConcurrency,
#Value("${someProp.error.backoff.ms}") final int errorBackoffInterval
) {
final SeekToCurrentBatchErrorHandler errorHandler = new SeekToCurrentBatchErrorHandler();
errorHandler.setBackOff(new FixedBackOff(errorBackoffInterval, UNLIMITED_ATTEMPTS));
final var containerFactory = new ConcurrentKafkaListenerContainerFactory<String, Object>();
containerFactory.setConsumerFactory(consumerFactory);
containerFactory.getContainerProperties().setAckMode(MANUAL_IMMEDIATE);
containerFactory.getContainerProperties().setMissingTopicsFatal(false);
containerFactory.setBatchListener(enableBatchListener);
containerFactory.setConcurrency(consumerConcurrency);
containerFactory.setBatchErrorHandler(errorHandler);
return containerFactory;
}
someProp:
concurrency: 16
batch: true
error.backoff.ms: 2000
spring:
kafka:
bootstrap-servers: ${KAFKA_BOOTSTRAP_SERVERS}
consumer:
groupId: some-grp
autoOffsetReset: earliest
keyDeserializer: org.apache.kafka.common.serialization.StringDeserializer
valueDeserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
properties:
schema.registry.url: ${SCHEMA_REGISTRY_URL}
specific.avro.reader: true
security.protocol: SSL
In batch listener method annotated with #KafkaListener, we call acknowledgment.acknowledge() at the end of processing of the list. Assuming that when the service comes up, I already have a million messages in the topic ready to be consumed by the service, I have following questions with respect to this scenario as I could not find documentation which talks in detail regarding the batch listening:
The listener will read 500 messages in the list. 500 because max.poll.records is not set and hence defaults to 500, so the list will have 500 messages. Is this understanding correct?
Given the above, where does the consumer concurrency come into picture? Does the stated configuration mean I will have 16 consumers each of which can read 500 messages in parallel from the same topic?
I understand, in this case I must have at least 16 partitions to make use of all the consumers otherwise I would be left with consumers who do nothing?
Due to SeekToCurrentBatchErrorHandler, the batch will be replayed in case there is any exception in processing inside the listener method. So, if in a particular batch there is an exception processing the 50th message, first 49 will be played again (basically duplicates, which I am fine with), next 50 to 500 messages will be played and tried for processing as usual. Is this understanding correct?
If there are multiple batches being read continuously and a particular consumer thread gets stuck with the SeekToCurrentBatchErrorHandler, how is the offset commit handled, as other consumer threads would still be processing the messages successfully thus moving the offset pointer way forward then the stuck consumers offsets
The doc for MANUAL_IMMEDIATE states
/**
* User takes responsibility for acks using an
* {#link AcknowledgingMessageListener}. The consumer
* immediately processes the commit.
*/
MANUAL_IMMEDIATE,
Does this mean calling acknowledgment.acknowledge() is not sufficient and AcknowledgingMessageListener has to be used in some way? If yes, what is the preferred approach.
You will get "up to" 500; there is no guarantee you will get exactly 500.
Yes; 16 consumers (assuming you have at least 16 partitions).
Correct.
Correct; but version 2.5 now has the RecoveringBatchErrorHandler whereby you can throw a special exception to tell it where in the batch the error occurred; it will commit the offsets of the successful records and seek the remaining ones.
The consumers get unique partitions so a consumer that is "stuck" has no impact on other consumers.
I am not sure what you are asking there; if you are calling ack.acknowledge() you are already using an AcknowledgingMessageListener (#KafkaListener always has that capability; we only populate the ack with a manual ack mode.
However, you really don't need to use manual acks for this use case; the container will commit the offsets automatically when the listener exits normally; no need to unnecessarily complicate your code.
#Configuration
public class KafkaConfiguration {
#Value("${kafka.boot.server}")
private String kafkaServer;
#Bean
public KafkaTemplate<String,String> kafkaTemplate(){
return new KafkaTemplate<>(producerConfig());}
#Bean
public ProducerFactory<String,String> producerConfig() {
Map<String,Object> config= new HashMap<>();
config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServer);
config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,StringSerializer.class );
config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,JsonSerializer.class); return new DefaultKafkaProducerFactory<>(config);
}
}
What are the prerequisites for kafka? What do you suggest for publishing message? What other ways possible are there?
delivery.timeout.ms:
If massive events appear in a short time is your case, this value should be higher because when the network is busy, your client will complain about NetworkException, and increasing it you can see less NetworkException.
Understanding what is delivery.timeout.ms:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer?source=post_page-----fa3910d9aa54----------------------#KIP-91ProvideIntuitiveUserTimeoutsinTheProducer-TestPlan
acks: If you need no data loss. You have to set it to all. The default is 1 and
the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.
retries: It depends on your kafka client version. Now the default retries is Integer.Max,
but for the earlier versions you will like to set retries to a higher value so your producer does not stop due to one simple exception that the leader partition is not reachable.
Exactly-once: If you app requires exactly-once, You have to refer to enable.idempotence and transactional.id
Note that the configs mentioned here should be possible to find the corresponding enums in your java client
Further reference of producer settings:
https://docs.confluent.io/current/installation/configuration/producer-configs.html
I'm using Kafka and we have a use case to build a fault tolerant system where not even a single message should be missed. So here's the problem:
If publishing to Kafka fails due to any reason (ZooKeeper down, Kafka broker down etc) how can we robustly handle those messages and replay them once things are back up again. Again as I say we cannot afford even a single message failure.
Another use case is we also need to know at any given point in time how many messages were failed to publish to Kafka due to any reason i.e. something like counter functionality and now those messages needs to be re-published again.
One of the solution is to push those messages to some database (like Cassandra where writes are very fast but we also need counter functionality and I guess Cassandra counter functionality is not that great and we don't want to use that.) which can handle that kind of load and also provide us with the counter facility which is very accurate.
This question is more from architecture perspective and then which technology to use to make that happen.
PS: We handle some where like 3000TPS. So when system start failing those failed messages can grow very fast in very short time. We're using java based frameworks.
Thanks for your help!
The reason Kafka was built in a distributed, fault-tolerant way is to handle problems exactly like yours, multiple failures of core components should avoid service interruptions. To avoid a down Zookeeper, deploy at least 3 instances of Zookeepers (if this is in AWS, deploy them across availability zones). To avoid broker failures, deploy multiple brokers, and ensure you're specifying multiple brokers in your producer bootstrap.servers property. To ensure that the Kafka cluster has written your message in a durable manor, ensure that the acks=all property is set in the producer. This will acknowledge a client write when all in-sync replicas acknowledge reception of the message (at the expense of throughput). You can also set queuing limits to ensure that if writes to the broker start backing up you can catch an exception and handle it and possibly retry.
Using Cassandra (another well thought out distributed, fault tolerant system) to "stage" your writes doesn't seem like it adds any reliability to your architecture, but does increase the complexity, plus Cassandra wasn't written to be a message queue for a message queue, I would avoid this.
Properly configured, Kafka should be available to handle all your message writes and provide suitable guarantees.
I am super late to the party. But I see something missing in above answers :)
The strategy of choosing some distributed system like Cassandra is a decent idea. Once the Kafka is up and normal, you can retry all the messages that were written into this.
I would like to answer on the part of "knowing how many messages failed to publish at a given time"
From the tags, I see that you are using apache-kafka and kafka-consumer-api.You can write a custom call back for your producer and this call back can tell you if the message has failed or successfully published. On failure, log the meta data for the message.
Now, you can use log analyzing tools to analyze your failures. One such decent tool is Splunk.
Below is a small code snippet than can explain better about the call back I was talking about:
public class ProduceToKafka {
private ProducerRecord<String, String> message = null;
// TracerBulletProducer class has producer properties
private KafkaProducer<String, String> myProducer = TracerBulletProducer
.createProducer();
public void publishMessage(String string) {
ProducerRecord<String, String> message = new ProducerRecord<>(
"topicName", string);
myProducer.send(message, new MyCallback(message.key(), message.value()));
}
class MyCallback implements Callback {
private final String key;
private final String value;
public MyCallback(String key, String value) {
this.key = key;
this.value = value;
}
#Override
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception == null) {
log.info("--------> All good !!");
} else {
log.info("--------> not so good !!");
log.info(metadata.toString());
log.info("" + metadata.serializedValueSize());
log.info(exception.getMessage());
}
}
}
}
If you analyze the number of "--------> not so good !!" logs per time unit, you can get the required insights.
God speed !
Chris already told about how to keep the system fault tolerant.
Kafka by default supports at-least once message delivery semantics, it means when it try to send a message something happens, it will try to resend it.
When you create a Kafka Producer properties, you can configure this by setting retries option more than 0.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:4242");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
For more info check this.
I have written a simple program to read data from Kafka and print in flink. Below is the code.
public static void main(String[] args) throws Exception {
Options flinkPipelineOptions = PipelineOptionsFactory.create().as(Options.class);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Class<?> unmodColl = Class.forName("java.util.Collections$UnmodifiableCollection");
env.getConfig().addDefaultKryoSerializer(unmodColl, UnmodifiableCollectionsSerializer.class);
env.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE);
flinkPipelineOptions.setJobName("MyFlinkTest");
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setCheckpointingInterval(1000L);
flinkPipelineOptions.setNumberOfExecutionRetries(5);
flinkPipelineOptions.setExecutionRetryDelay(3000L);
Properties p = new Properties();
p.setProperty("zookeeper.connect", "localhost:2181");
p.setProperty("bootstrap.servers", "localhost:9092");
p.setProperty("group.id", "test");
FlinkKafkaConsumer09<Notification> kafkaConsumer = new FlinkKafkaConsumer09<>("testFlink",new ProtoDeserializer(),p);
DataStream<Notification> input = env.addSource(kafkaConsumer);
input.rebalance().map(new MapFunction<Notification, String>() {
#Override
public String map(Notification value) throws Exception {
return "Kafka and Flink says: " + value.toString();
}
}).print();
env.execute();
}
I need flink to process my data in kafka exactly once and I have few questions on how it can be done.
When does FlinkKafkaConsumer09 commits the processed offsets to kafka?
Say my topic has 10 messages, the consumer processes all 10 messages. When I stop the job and start it again, it starts processing random messages from the set of previously read messages. I need to ensure none of my messages are processed twice.
Please advice. Appreciate all the help. Thanks.
This page describes the fault tolerance guarantees of the Flink Kafka connector.
You can use Flink's savepoints to re-start a job in an exactly-once (state preserving) manner.
The reason why you are seeing the messages again is because the offsets committed by Flink to the Kafka broker / Zookeeper are not in line with Flink's registered state.
You'll always see messages processed multiple times after restore / failure in Flink, even with exactly-once semantics enabled. The exactly-once guarantees in Flink are with respect to registered state, not for the records send to the operators.
Slightly off-topic: What are these lines for? They are not passed to Flink anywhere.
Options flinkPipelineOptions = PipelineOptionsFactory.create().as(Options.class);
flinkPipelineOptions.setJobName("MyFlinkTest");
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setCheckpointingInterval(1000L);
flinkPipelineOptions.setNumberOfExecutionRetries(5);
flinkPipelineOptions.setExecutionRetryDelay(3000L);