Adjusting parallism based on number of partitions assigned in Consumer.committablePartitionedSource - java

I am trying to use Consumer.committablePartitionedSource() and creating stream per partition as shown below
public void setup() {
control = Consumer.committablePartitionedSource(consumerSettings,
Subscriptions.topics("chat").withPartitionAssignmentHandler(new PartitionAssignmentListener()))
.mapAsyncUnordered(Integer.MAX_VALUE, pair -> setupSource(pair, committerSettings))
.toMat(Sink.ignore(), Consumer::createDrainingControl)
.run(Materializer.matFromSystem(actorSystem));
}
private CompletionStage<Done> setupSource(Pair<TopicPartition, Source<ConsumerMessage.CommittableMessage<String, String>, NotUsed>> pair, CommitterSettings committerSettings) {
LOGGER.info("SETTING UP PARTITION-{} SOURCE", pair.first().partition());
return pair.second().mapAsync(16, msg -> CompletableFuture.supplyAsync(() -> consumeMessage(msg), actorSystem.dispatcher())
.thenApply(param -> msg.committableOffset()))
.withAttributes(ActorAttributes.supervisionStrategy(ex -> Supervision.restart()))
.runWith(Committer.sink(committerSettings), Materializer.matFromSystem(actorSystem));
}
While setting up the source per partition I am using parallelism which I want to change based on no of partitions assigned to the node. That I can do that in the first assignment of partitions to the node. But as new nodes join the cluster assigned partitions are revoked and assigned. This time stream not emitting already existing partitions(due to kafka cooperative rebalancing protocol) to reconfigure parallelism.
Here I am sharing the same dispatcher across all sources and if I keep the same parallelism on rebalancing I feel the fair chance to each partition message processing is not possible. Am I correct? Please correct me

If I understand you correctly you want to have a fixed parallelism across dynamically changing number of Sources that come and go as Kafka is rebalancing topic partitions.
Have a look at first example in the Alpakka Kafka documentation here. It can be adjusted to your example like this:
Consumer.DrainingControl<Done> control =
Consumer.committablePartitionedSource(consumerSettings, Subscriptions.topics("chat"))
.wireTap(p -> LOGGER.info("SETTING UP PARTITION-{} SOURCE", p.first().partition()))
.flatMapMerge(Integer.MAX_VALUE, Pair::second)
.mapAsync(
16,
msg -> CompletableFuture
.supplyAsync(() -> consumeMessage(msg),
actorSystem.dispatcher())
.thenApply(param -> msg.committableOffset()))
.withAttributes(
ActorAttributes.supervisionStrategy(
ex -> Supervision.restart()))
.toMat(Committer.sink(committerSettings), Consumer::createDrainingControl)
.run(Materializer.matFromSystem(actorSystem));
So basically the Consumer.committablePartitionedSource() will emit a Source anytime Kafka assigns partition to this consumer and will terminate such Source when previously assigned partition is rebalanced and taken away from this consumer.
The flatMapMerge will take those Sources and merge the messages they output.
All those messages will compete in the mapAsync stage to get processed. The fairness of this competing is really down to the flatMapMerge above that should give equal chance for all the Sources to emit their messages. Regardless of how many Sources are outputing messages, they will all share a fixed parallelism here, which I believe is what you're after.
All those messages eventually get to the Commiter.sink that handles offset committing.

Related

Handling commits for errors with #KafkaListener in batch consumers

We have a Kafka Consumer setup like below
#Bean
public ConsumerFactory<String, Object> consumerFactory() {
final Map<String, Object> props = kafkaProperties.buildConsumerProperties();
return new DefaultKafkaConsumerFactory<>(props);
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> batchFactory(
final ConsumerFactory<String, Object> consumerFactory,
#Value("${someProp.batch}") final boolean enableBatchListener,
#Value("${someProp.concurrency}") final int consumerConcurrency,
#Value("${someProp.error.backoff.ms}") final int errorBackoffInterval
) {
final SeekToCurrentBatchErrorHandler errorHandler = new SeekToCurrentBatchErrorHandler();
errorHandler.setBackOff(new FixedBackOff(errorBackoffInterval, UNLIMITED_ATTEMPTS));
final var containerFactory = new ConcurrentKafkaListenerContainerFactory<String, Object>();
containerFactory.setConsumerFactory(consumerFactory);
containerFactory.getContainerProperties().setAckMode(MANUAL_IMMEDIATE);
containerFactory.getContainerProperties().setMissingTopicsFatal(false);
containerFactory.setBatchListener(enableBatchListener);
containerFactory.setConcurrency(consumerConcurrency);
containerFactory.setBatchErrorHandler(errorHandler);
return containerFactory;
}
someProp:
concurrency: 16
batch: true
error.backoff.ms: 2000
spring:
kafka:
bootstrap-servers: ${KAFKA_BOOTSTRAP_SERVERS}
consumer:
groupId: some-grp
autoOffsetReset: earliest
keyDeserializer: org.apache.kafka.common.serialization.StringDeserializer
valueDeserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
properties:
schema.registry.url: ${SCHEMA_REGISTRY_URL}
specific.avro.reader: true
security.protocol: SSL
In batch listener method annotated with #KafkaListener, we call acknowledgment.acknowledge() at the end of processing of the list. Assuming that when the service comes up, I already have a million messages in the topic ready to be consumed by the service, I have following questions with respect to this scenario as I could not find documentation which talks in detail regarding the batch listening:
The listener will read 500 messages in the list. 500 because max.poll.records is not set and hence defaults to 500, so the list will have 500 messages. Is this understanding correct?
Given the above, where does the consumer concurrency come into picture? Does the stated configuration mean I will have 16 consumers each of which can read 500 messages in parallel from the same topic?
I understand, in this case I must have at least 16 partitions to make use of all the consumers otherwise I would be left with consumers who do nothing?
Due to SeekToCurrentBatchErrorHandler, the batch will be replayed in case there is any exception in processing inside the listener method. So, if in a particular batch there is an exception processing the 50th message, first 49 will be played again (basically duplicates, which I am fine with), next 50 to 500 messages will be played and tried for processing as usual. Is this understanding correct?
If there are multiple batches being read continuously and a particular consumer thread gets stuck with the SeekToCurrentBatchErrorHandler, how is the offset commit handled, as other consumer threads would still be processing the messages successfully thus moving the offset pointer way forward then the stuck consumers offsets
The doc for MANUAL_IMMEDIATE states
/**
* User takes responsibility for acks using an
* {#link AcknowledgingMessageListener}. The consumer
* immediately processes the commit.
*/
MANUAL_IMMEDIATE,
Does this mean calling acknowledgment.acknowledge() is not sufficient and AcknowledgingMessageListener has to be used in some way? If yes, what is the preferred approach.
You will get "up to" 500; there is no guarantee you will get exactly 500.
Yes; 16 consumers (assuming you have at least 16 partitions).
Correct.
Correct; but version 2.5 now has the RecoveringBatchErrorHandler whereby you can throw a special exception to tell it where in the batch the error occurred; it will commit the offsets of the successful records and seek the remaining ones.
The consumers get unique partitions so a consumer that is "stuck" has no impact on other consumers.
I am not sure what you are asking there; if you are calling ack.acknowledge() you are already using an AcknowledgingMessageListener (#KafkaListener always has that capability; we only populate the ack with a manual ack mode.
However, you really don't need to use manual acks for this use case; the container will commit the offsets automatically when the listener exits normally; no need to unnecessarily complicate your code.

Is there a way to get the last message from Kafka topic?

I have a Kafka topic with multiple partitions and I wonder if there is a way in Java to fetch the last message for the topic. I don't care for the partitions I just want to get the latest message.
I have tried #KafkaListener but it fetches the message only when the topic is updated. If there is nothing published after the application is opened nothing is returned.
Maybe the listener is not the right approach to the problem at all?
This following snippet worked for me. You may try this. Explanation in the comments.
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList(topic));
consumer.poll(Duration.ofSeconds(10));
consumer.assignment().forEach(System.out::println);
AtomicLong maxTimestamp = new AtomicLong();
AtomicReference<ConsumerRecord<String, String>> latestRecord = new AtomicReference<>();
// get the last offsets for each partition
consumer.endOffsets(consumer.assignment()).forEach((topicPartition, offset) -> {
System.out.println("offset: "+offset);
// seek to the last offset of each partition
consumer.seek(topicPartition, (offset==0) ? offset:offset - 1);
// poll to get the last record in each partition
consumer.poll(Duration.ofSeconds(10)).forEach(record -> {
// the latest record in the 'topic' is the one with the highest timestamp
if (record.timestamp() > maxTimestamp.get()) {
maxTimestamp.set(record.timestamp());
latestRecord.set(record);
}
});
});
System.out.println(latestRecord.get());
You'll have to consume the latest message from each partition and then do a comparison on the client side (using the timestamp on the message, if it contains it). The reason for this is that Kafka does not guarantee inter-partition ordering. Inside a partition, you can be sure that the message with the largest offset is the latest message pushed to it.

Conditional logic on a Reactor Flux

I am a Reactor newbie. I am trying to develop the following application logic:
Read messages from a Kafka topic source.
Transform the massages.
Write a subset of the transformed messages to a new Kafka topic target.
Explicitly acknowledge the reading operation for all the messages originally read from topic source.
The only solution I found is to rewrite the above business logic as it follows.
Read messages from a Kafka topic source.
Transform the massages.
Immediately acknowledge the message not be written to topic target.
Filter all the above messages.
Write the rest of the transformed messages to the new Kafka topic target.
Explicitly acknowledge the reading operation for these messages
The code implementing the second logic is the following:
receiver.receive()
.flatMap(this::processMessage)
.map(this::acknowledgeMessagesNotToWriteInKafka)
.filter(this::isMessageToWriteInKafka)
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
Clearly, receiver type is KafkaReceiver, and method sendToKafka uses a KafkaSender. One of the things I don't like is that I am using a map to acknowledge some messages.
Is there any better solution to implement the original logic?
This is not exactly your four business logic steps, but I think it's a little bit closer to what you want.
You could acknowledge the "discarded" messages that won't be written in .doOnDiscard after .filter...
receiver.receive()
.flatMap(this::processMessage)
.filter(this::isMessageToWriteInKafka)
.doOnDiscard(ReceiverRecord.class, record -> record.receiverOffset().acknowledge())
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
Note: you'll need to use the proper object type that was discarded. I don't know what type of object the Publisher returned from processMessage emits, but I assume you can get the ReceiverRecord or ReceiverOffset from it in order to acknowledge it.
Alternatively, you could combine filter/doOnDiscard into a single .handle operator...
receiver.receive()
.flatMap(this::processMessage)
.handle((m, sink) -> {
if (isMessageToWriteInKafka(m)) {
sink.next(m);
} else {
m.getReceiverRecord().getReceiverOffset().acknowledge();
}
})
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());

How to verify the size of partition or Topic in Kafka?

I want to verify the partition size before producing the the record in Kafka.
I have a custom partitioned class which gives me exact partition number in which my message is supposed to drop.
Now my requirement is I want to check the size of partition before sending my record.
List<String> users = userService.findAllUsers();
for (String user : users) {
String msg = "Hello " + user;
//Check size here
producer.send(new ProducerRecord<String, String>(topic, user, msg), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
e.printStackTrace();
}
}
});
Is there any way in kafka I can achieve this ?
Capacity is purely a Kafka-broker level aspect.
Basically, if there is disk space in broker's data directory, you can deliver the message. The message are cleaned up by broker based on time and partition size (that's in broker configuration), so if you configure your broker accordingly, you might always have space - the old messages would just get thrown away. It might not suit your business usecase though.
Also, responding to your comment Can we check size of Topic, you can actually check the current size of partition by using beginningOffsets & endOffsets methods in KafkaConsumer. Beware that these methods might block if the partitions do not exist (at least in 0.10.2). e.g. when you request data for partition 4 when topic actually contains 3 partitions.
Kafka 0.11 introduces administrative capabilities in client, however it is still work in progress.

Kafka pattern subscription. Rebalancing is not being triggered on new topic

According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again

Categories

Resources