I have a Beam pipeline to consume streaming events with multiple stages (PTransforms) to process them. See the following code,
pipeline.apply("Read Data from Stream", StreamReader.read())
.apply("Decode event and extract relevant fields", ParDo.of(new DecodeExtractFields()))
.apply("Deduplicate process", ParDo.of(new Deduplication()))
.apply("Conversion, Mapping and Persisting", ParDo.of(new DataTransformer()))
.apply("Build Kafka Message", ParDo.of(new PrepareMessage()))
.apply("Publish", ParDo.of(new PublishMessage()))
.apply("Commit offset", ParDo.of(new CommitOffset()));
The streaming events read by using the KafkaIO and the StreamReader.read() method implementation is like this,
public static KafkaIO.Read<String, String> read() {
return KafkaIO.<String, String>read()
.withBootstrapServers(Constants.BOOTSTRAP_SERVER)
.withTopics(Constants.KAFKA_TOPICS)
.withConsumerConfigUpdates(Constants.CONSUMER_PROPERTIES)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class);
}
After we read a streamed event/message through the KafkaIO, we can commit the offset.
What i need to do is commit the offset manually, inside the last Commit offset PTransform when all the previous PTransforms executed.
The reason is, I am doing some conversions, mappings and persisting in the middle of the pipeline and when all the things done without failing, I need to commit the offset.
By doing so, if the processing fails in the middle, i can consume same record/event again and process.
My question is, how do I commit the offset manually?
Appreciate if its possible to share resources/sample codes.
Well, for sure, there are Read.commitOffsetsInFinalize() method, that is supposed to commit offsets while finalising the checkpoints, and AUTO_COMMIT consumer config option, that is used to auto-commit read records by Kafka consumer.
Though, in your case, it won't work and you need to do it manually by grouping the offsets of the same topic/partitiona/window and creating a new instance of Kafka client in your CommitOffset DoFn which will commit these offsets. You need to group the offsets by partition, otherwise it may be a race condition with committing the offsets of the same partition on different workers.
Related
Need to trigger a consumption of particular topic on condition basis. How to do that in cloud stream Kafka?
Detailed Scenario:
We are processing messages from Kafka and updating in database. if DB is down/has any issues we are redirecting messages to a different topic.
Later when the DB is up again, we need to pause the actual consumption and poll the topic of failed messages and once those data are updated to DB then we need to resume the actual consumption.
We need to consider the order of messages here. Hence we took this approach.
Am currently trying with ConsumeFactory, but it's returning an empty collection.
Consumer<byte[], byte[]> consumer = consumerFactory.createConsumer("0", "consumer-
1");
consumer.subscribe(Arrays.asList("some-topic"));
ConsumerRecords<byte[], byte[]> poll = consumer.poll(Duration.ofMillis(10000));
poll.forEach(record -> {
log.info("record {}", record);
});
Could someone help here or please suggest if we have any other option to handle this?
I am trying to use Consumer.committablePartitionedSource() and creating stream per partition as shown below
public void setup() {
control = Consumer.committablePartitionedSource(consumerSettings,
Subscriptions.topics("chat").withPartitionAssignmentHandler(new PartitionAssignmentListener()))
.mapAsyncUnordered(Integer.MAX_VALUE, pair -> setupSource(pair, committerSettings))
.toMat(Sink.ignore(), Consumer::createDrainingControl)
.run(Materializer.matFromSystem(actorSystem));
}
private CompletionStage<Done> setupSource(Pair<TopicPartition, Source<ConsumerMessage.CommittableMessage<String, String>, NotUsed>> pair, CommitterSettings committerSettings) {
LOGGER.info("SETTING UP PARTITION-{} SOURCE", pair.first().partition());
return pair.second().mapAsync(16, msg -> CompletableFuture.supplyAsync(() -> consumeMessage(msg), actorSystem.dispatcher())
.thenApply(param -> msg.committableOffset()))
.withAttributes(ActorAttributes.supervisionStrategy(ex -> Supervision.restart()))
.runWith(Committer.sink(committerSettings), Materializer.matFromSystem(actorSystem));
}
While setting up the source per partition I am using parallelism which I want to change based on no of partitions assigned to the node. That I can do that in the first assignment of partitions to the node. But as new nodes join the cluster assigned partitions are revoked and assigned. This time stream not emitting already existing partitions(due to kafka cooperative rebalancing protocol) to reconfigure parallelism.
Here I am sharing the same dispatcher across all sources and if I keep the same parallelism on rebalancing I feel the fair chance to each partition message processing is not possible. Am I correct? Please correct me
If I understand you correctly you want to have a fixed parallelism across dynamically changing number of Sources that come and go as Kafka is rebalancing topic partitions.
Have a look at first example in the Alpakka Kafka documentation here. It can be adjusted to your example like this:
Consumer.DrainingControl<Done> control =
Consumer.committablePartitionedSource(consumerSettings, Subscriptions.topics("chat"))
.wireTap(p -> LOGGER.info("SETTING UP PARTITION-{} SOURCE", p.first().partition()))
.flatMapMerge(Integer.MAX_VALUE, Pair::second)
.mapAsync(
16,
msg -> CompletableFuture
.supplyAsync(() -> consumeMessage(msg),
actorSystem.dispatcher())
.thenApply(param -> msg.committableOffset()))
.withAttributes(
ActorAttributes.supervisionStrategy(
ex -> Supervision.restart()))
.toMat(Committer.sink(committerSettings), Consumer::createDrainingControl)
.run(Materializer.matFromSystem(actorSystem));
So basically the Consumer.committablePartitionedSource() will emit a Source anytime Kafka assigns partition to this consumer and will terminate such Source when previously assigned partition is rebalanced and taken away from this consumer.
The flatMapMerge will take those Sources and merge the messages they output.
All those messages will compete in the mapAsync stage to get processed. The fairness of this competing is really down to the flatMapMerge above that should give equal chance for all the Sources to emit their messages. Regardless of how many Sources are outputing messages, they will all share a fixed parallelism here, which I believe is what you're after.
All those messages eventually get to the Commiter.sink that handles offset committing.
I am a Reactor newbie. I am trying to develop the following application logic:
Read messages from a Kafka topic source.
Transform the massages.
Write a subset of the transformed messages to a new Kafka topic target.
Explicitly acknowledge the reading operation for all the messages originally read from topic source.
The only solution I found is to rewrite the above business logic as it follows.
Read messages from a Kafka topic source.
Transform the massages.
Immediately acknowledge the message not be written to topic target.
Filter all the above messages.
Write the rest of the transformed messages to the new Kafka topic target.
Explicitly acknowledge the reading operation for these messages
The code implementing the second logic is the following:
receiver.receive()
.flatMap(this::processMessage)
.map(this::acknowledgeMessagesNotToWriteInKafka)
.filter(this::isMessageToWriteInKafka)
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
Clearly, receiver type is KafkaReceiver, and method sendToKafka uses a KafkaSender. One of the things I don't like is that I am using a map to acknowledge some messages.
Is there any better solution to implement the original logic?
This is not exactly your four business logic steps, but I think it's a little bit closer to what you want.
You could acknowledge the "discarded" messages that won't be written in .doOnDiscard after .filter...
receiver.receive()
.flatMap(this::processMessage)
.filter(this::isMessageToWriteInKafka)
.doOnDiscard(ReceiverRecord.class, record -> record.receiverOffset().acknowledge())
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
Note: you'll need to use the proper object type that was discarded. I don't know what type of object the Publisher returned from processMessage emits, but I assume you can get the ReceiverRecord or ReceiverOffset from it in order to acknowledge it.
Alternatively, you could combine filter/doOnDiscard into a single .handle operator...
receiver.receive()
.flatMap(this::processMessage)
.handle((m, sink) -> {
if (isMessageToWriteInKafka(m)) {
sink.next(m);
} else {
m.getReceiverRecord().getReceiverOffset().acknowledge();
}
})
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
I have a custom Kafka Consumer in which I use to send some requests to a REST API.
According to the response from the API, I either commit the offset or skip the message without commit.
Minimal example:
while (true) {
ConsumerRecords<String, Object> records = consumer.poll(200);
for (ConsumerRecord<String, Object> record : records) {
// Sending a POST request and retrieving the answer
// ...
if (responseCode.startsWith("2")) {
try {
consumer.commitSync();
} catch(CommitFailedException ex) {
ex.printStackTrace();
}
} else {
// Do Nothing
}
}
}
Now when a response from the REST API does not start with a 2 the offset is not committed, but the message is not re-consumed. How can I force the consumer to re-consume messages with uncommitted offsets?
Make sure your data is idempotent if you are planning to use seek(). Since you are selectively committing offsets, the records left out are possibly going to be before committed (successfully processed) records. If you do seek() - which is moving your groupId's pointer to uncommitted offset and start the replay, you will get those successfully processed messages also. It also has potential of becoming an infinite loop.
Alternatively, you can save unsuccessful record's metadata in memory or db and replay topic from beginning with "poll(retention.ms)" so that all records are replayed back but add a filter to process only those through API whose metadata matches with what you had saved earlier. Do this as a batch processing once every hour or few hours.
Committing offsets is just a way to store the current offset, also know as position, of the Consumer. So in case it stops, it (or the new consumer instance taking over) can find its previous position and restart consuming from there.
So even if you don't commit, the consumer's position is moved once you receive records. If you want to reconsume some records, you have to change the consumer's current position.
With the Java client, you can set the position using seek().
In your scenario, you probably want to calculate the new position relative to the current position. If so you can find the current position using position().
Below are the alternate approaches you can take(instead of seek) :
When REST is failed, move the message to a adhoc kafka topic. You can write another program to read the messages of this topic on a scheduled manner.
When REST is failed, write the Request to a flat flat. Use a shell/any script to read each request and send it on a scheduled basis.
The new Kafka version (0.11) supports exactly once semantics.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I've got a producer setup with kafka transactional code in java like this.
producer.initTransactions();
try {
producer.beginTransaction();
for (ProducerRecord<String, String> record : payload) {
producer.send(record);
}
Map<TopicPartition, OffsetAndMetadata> groupCommit = new HashMap<TopicPartition, OffsetAndMetadata>() {
{
put(new TopicPartition(TOPIC, 0), new OffsetAndMetadata(42L, null));
}
};
producer.sendOffsetsToTransaction(groupCommit, "groupId");
producer.commitTransaction();
} catch (ProducerFencedException e) {
producer.close();
} catch (KafkaException e) {
producer.abortTransaction();
}
I'm not quite sure how to use the sendOffsetsToTransaction and the the intended use case of it. AFAIK, consumer groups is a multithreaded read feature on consumer end.
javadoc says
" Sends a list of consumed offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered consumed only if the transaction is committed successfully. This method should be used when you need to batch consumed and produced messages together, typically in a consume-transform-produce pattern."
How would produce maintain a list of consumed offsets? Whats the point of it?
This is only relevant to workflows in which you are consuming and then producing messages based on what you consumed. This function allows you to commit offsets you consumed only if the downstream producing succeeds. If you consume data, process it somehow, and then produce the result, this enables transactional guarantees across the consumption/production.
Without transactions, you normally use Consumer#commitSync() or Consumer#commitAsync() to commit consumer offsets. But if you use these methods before you've produced with your producer, you will have committed offsets before knowing whether the producer succeeded sending.
So, instead of committing your offsets with the consumer, you can use Producer#sendOffsetsToTransaction() on the producer to commit the offsets instead. This sends the offsets to the transaction manager handling the transaction. It will commit the offsets only if the entire transactions—consuming and producing—succeeds.
(Note: when you send the offsets to commit, you should add 1 to the offset last read, so that future reads resume from the offset you haven't read. This is true regardless of whether you commit with the consumer or the producer. See: KafkaProducer sendOffsetsToTransaction need offset+1 to successfully commit current offset).