Need to trigger a consumption of particular topic on condition basis. How to do that in cloud stream Kafka?
Detailed Scenario:
We are processing messages from Kafka and updating in database. if DB is down/has any issues we are redirecting messages to a different topic.
Later when the DB is up again, we need to pause the actual consumption and poll the topic of failed messages and once those data are updated to DB then we need to resume the actual consumption.
We need to consider the order of messages here. Hence we took this approach.
Am currently trying with ConsumeFactory, but it's returning an empty collection.
Consumer<byte[], byte[]> consumer = consumerFactory.createConsumer("0", "consumer-
1");
consumer.subscribe(Arrays.asList("some-topic"));
ConsumerRecords<byte[], byte[]> poll = consumer.poll(Duration.ofMillis(10000));
poll.forEach(record -> {
log.info("record {}", record);
});
Could someone help here or please suggest if we have any other option to handle this?
Related
I'm using Apache KafkaConsumer. I want to check if the consumer has any messages to return without polling. If I poll the consumer and there aren't any messages, then I get the message "Attempt to heartbeat failed since the group is rebalancing" in an infinite loop until the timeout expires, even though I have a records.isEmpty() clause. This is a snippet of my code:
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(10));
if (records.isEmpty()) {
log.info("No More Records");
consumer.close();
}
else {
records.iterator().forEachRemaining(record -> log.info("RECORD: " + record);
);
This works fine until records are empty. Once it is empty, it logs "Attempt to heartbeat failed since the group is rebalancing" many times, logs "No More Records" once, and then continues to log the heartbeat error. What can I do to combat this and how can I elegantly check (without any heartbeat messages) that there are no more records to poll?
Edit: I asked another question and the full code and context is on this link: How to get messages from Kafka Consumer one by one in java?
Thanks in advance!
Out of comment: "Since I have a UI and want to receive a message one by one by clicking the "receive" button, there might be a case when there are no more messages to be polled."
In that case you need to create a new KafkaConsumer every time someone clicks on the "receive" button and then close it afterwards.
If you want to use the same KafkaConsumer for the lifetime of your client, you need to let the broker know that it is still alive (by sending a heartbeat, which is implicitly done through calling the poll method). Otherwise, as you have already experienced, the broker thinks your KafkaConsumer is dead and will initiate a rebalancing. As there is no other active Consumer available this rebalancing will not stop.
Streaming application is rolled out in production and right after 10 days observing errors/warnings in the CustomProductionExceptionHandler for expired transactions which belongs to older day window.
FLOW :
INPUT TOPIC --> STREAMING APPLICATION(Produces stats and emits after day window closed) --> OUTPUT TOPIC
Producer continuously trying to publish records to OUTPUT Topic which is already expired with older window and logs an error into CustomProductionExceptionHandler.
I have reduced batch size and kept default but this change is not yet promoted to production.
CustomProductionExceptionHandler Implementation: To Avoid streaming to die due to NeworkException,TimeOutException.
With this implementation producer does not retry and in case of any exceptions it does CONTINUE.. On other side upon returning FAIL.. stream thread dies and does not auto restart..Need suggestions..
public class CustomProductionExceptionHandler implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(final ProducerRecord<byte[], byte[]> record,
final Exception exception) {
String recordKey = new String(record.key());
String recordVal = new String(record.value());
String recordTopic = record.topic();
logger.error("Kafka message marked as processed although it failed. Message: [{}:{}], destination topic: [{}]", recordKey,recordVal,recordTopic,exception);
return ProductionExceptionHandlerResponse.CONTINUE;
}
}
Exception:
2019-12-20 16:31:37.576 ERROR com.jpmc.gpg.exception.CustomProductionExceptionHandler.handle(CustomProductionExceptionHandler.java:19) kafka-producer-network-thread | profile-day-summary-generator-291e69b1-5a3d-4d49-8797-252c2ae05607-StreamThread-19-producerid - Kafka message marked as processed although it failed. Message: [{"statistics":{}], destination topic: [OUTPUT-TOPIC]
org.apache.kafka.common.errors.TimeoutException: Expiring * record(s) for TOPIC:1086149 ms has passed since batch creation
Trying to get answer for below questions.
1) Why producer is trying to publish older transactions to OUTPUT Topic for which day window is already closed?
Example - Producer is trying to send 12/09 day window transaction but current opened window is 12/20
2) Streaming threads could have been died without CustomProductionExceptionHandler -->
ProductionExceptionHandlerResponse.CONTINUE.
Do we have any way that Producer can do retries in case of NetworkException or TimeoutException and
then continue instead of stream thread die?
Problem of specifying ProductionExceptionHandlerResponse.CONTINUE in the
CustomProductionExceptionHandler is - In case of any exception it skips
that record publishing to output topic and proceed with next records. No Resiliency.
1) It's not really possible to answer this question without knowing what your program does. Note, that in general, Kafka Streams works on event-time and handle out-of-order data.
2) You can configure all internally used client of a Kafka Streams application (ie, consumer, producer, admin client, and restore consumer) by specifying the corresponding client configuration in the Properties you pass into KafkaStreams. If you wand different configs for different clients, you can prefix them accordingly, ie, producer.retries instead of retries. Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/config-streams.html#ak-consumers-producer-and-admin-client-configuration-parameters
I have the following rabbitMq consumer:
Consumer consumer = new DefaultConsumer(channel) {
#Override
public void handleDelivery(String consumerTag, Envelope envelope, MQP.BasicProperties properties, byte[] body) throws IOException {
String message = new String(body, "UTF-8");
sendNotificationIntoTopic(message);
saveIntoDatabase(message);
}
};
Following situation can occur:
Message was send into topic successfully
Connection to database was lost so database insert was failed.
As a result we have data inconsistency.
Expected result either both action were successfully executed or both were not executed at all.
Any solutions how can I achieve it?
P.S.
Currently I have following idea(please comment upon)
We can suppose that broker doesn't lose any messages.
We have to be subscribed on topic we want to send.
Save entry into database and set field status with value 'pending'
Attempt to send data to topic. If send was successfull - update field status with value 'success'
We have to have a sheduled job which have to check rows with pending status. At the moment 2 cases are possible:
3.1 Notification wasn't send at all
3.2 Notification was send but save into database was failed(probability is very low but it is possible)
So we have to distinquish that 2 cases somehow: we may store messages from topic in the collection and job can check if the message was accepted or not. So if job found a message which corresponds the database row we have to update status to "success". Otherwise we have to remove entry from database.
I think my idea has some weaknesses(for example if we have multinode application we have to store messages in hazelcast(or analogs) but it is additional point of hypothetical failure)
Here is an example of Try Cancel Confirm pattern https://servicecomb.apache.org/docs/distributed_saga_3/ that should be capable of dealing with your problem. You should tolerate some chance of double submission of the data via the queue. Here is an example:
Define abstraction Operation and Assign ID to the operation plus a timestamp.
Write status Pending to the database (you can do this in the same step as 1)
Write a listener that polls the database for all operations with status pending and older than "timeout"
For each pending operation send the data via the queue with the assigned ID.
The recipient side should be aware of the ID and if the ID has been processed nothing should happen.
6A. If you need to be 100% that the operation has completed you need a second queue where the recipient side will post a message ID - DONE. If such consistency is not necessary skip this step. Alternatively it can post ID -Failed reason for failure.
6B. The submitting side either waits for a message from 6A of completes the operation by writing status DONE to the database.
Once a sertine timeout has passed or certain retry limit has passed. You write status to operation FAIL.
You can potentialy send a message to the recipient side opertaion with ID rollback.
Notice that all this steps do not involve a technical transactions. You can do this with a non transactional database.
What I have written is a variation of the Try Cancel Confirm Pattern where each recipient of message should be aware of how to manage its own data.
In the listener save database row with field staus='pending'
Another job(separated thread) will obtain all pending rows from DB and following for each row:
2.1 send data to topic
2.2 save into database
If we failured on the step 1 - everything is ok - data in consistent state because job won't know anything about that data
if we failured on the step 2.1 - no problem, next job invocation will attempt to handle it
if we failured on the step 2.2 - If we failured here - it means that next job invocation will handle the same data again. From the first glance you can think that it is a problem. But your consumer has to be idempotent - it means that it has to understand that message was already processed and skip the processing. This requirement is a consequence that all message brokers have guarantees that message will be delivered AT LEAST ONCE. So our consumers have to be ready for duplicated messages anyway. No problem again.
Here's the pseudocode for how i'd do it: (Assuming the dao layer has transactional capability and your messaging layer doesnt)
//Start a transaction
try {
String message = new String(body, "UTF-8");
// Ordering is important here as I'm assuming the database has commit and rollback capabilities, but the messaging system doesnt.
saveIntoDatabase(message);
sendNotificationIntoTopic(message);
} catch (MessageDeliveryException) {
// rollback the transaction
// Throw a domain specific exception
}
//commit the transaction
Scenarios:
1. If the database fails, the message wont be sent as the exception will break the code flow .
2. If the database call succeeds and the messaging system fails to deliver, catch the exception and rollback the database changes
All the actions necessary for logging and replaying the failures can be outside this method
If there is enough time to modify the design, it is recommended to use JTA like APIs to manage 2phase commit. Even weblogic and WebSphere support XA resource for 2 phase commit.
If timeline is less, it is suggested perform as below to reduce the failure gap.
Send data topic (no commit) (incase topic is down, retry to be performed with an interval)
Write data into DB
Commit DB
Commit Topic
Here failure will happen only when step 4 fails. It will result in same message send again. So receiving system will receive duplicate message. Each message has unique messageID and CorrelationID in JMS2.0 structure. So finding duplicate is bit straight forward (but this is to be handled at receiving system)
Both case will work for clustered environment as well.
Strict to your case, thought below steps might help to overcome your issue
Subscribe a listener listener-1 to your topic.
Process-1
Add DB entry with status 'to be sent' for message msg-1
Send message msg-1 to topic. Retry sending incase of any topic failure
If step 2 failed after certain retry, process-1 has to resend the msg-1 before sending any new messages OR step-1 to be rolled back
Listener-1
Using subscribed listener, read reference(meesageID/correlationID) from Topic, and update DB status to SENT, and read/remove message from topic. Incase reference-read success and DB update failed, topic still have message. So next read will update DB. Incase DB update success and message removal failed. Listener will read again and tries to update message which is already done. So can be ignored after validation.
Incase listener itself down, topic will have messages until listener reading the messages. Until then SENT messages will be in status 'to be sent'.
I have a simple kafka setup. A producer is producing messages to a single partition with a single topic at a high rate. A single consumer is consuming messages from this partition. During this process, the consumer may pause processing messages several times. The pause can last a couple of minutes. After the producer stops producing messages, all messages queued up will be processed by the consumer. It appears that messages produced by the producer are not being seen immediately by the consumer. I am using kafka 0.10.1.0. What can be happening here? Here is the section of code that consumes the message:
while (true)
{
try
{
ConsumerRecords<String, byte[]> records = consumer.poll(100);
for (final ConsumerRecord<String, byte[]> record : records)
{
serviceThread.submit(() ->
{
externalConsumer.accept(record);
});
}
consumer.commitAsync();
} catch (org.apache.kafka.common.errors.WakeupException e)
{
}
}
where consumer is a KafkaConsumer with auto commit disabled, max poll record of 100, and session timeout of 30000. serviceThread is an ExecutorService.
The producer just involves the KafkaProducer.send call to send a ProducerRecord.
All configurations on the broker are left as kafka defaults.
I am also using kafka-consumer-groups.sh to check what is happening when consumer is not consuming the message. But when this happens, the kafka-consumer-groups.sh will be hanging there also, not able to get information back. Sometimes it triggers the consumer re-balance. But not always.
For those who can find this helpful. I've encountered this problem (when kafka silently supposedly stops consuming) often enough and every single time it wasn't actually problem with Kafka.
Usually it is some long-running or hanged silent process that keeps Kafka from committing the offset. For example a DB client trying to connect to the DB. If you wait for long enough (e.g. 15 minutes for SQLAlchemy and Postgres), you will see a exception will be printed to the STDOUT, saying something like connection timed out.
I'm using Kafka and we have a use case to build a fault tolerant system where not even a single message should be missed. So here's the problem:
If publishing to Kafka fails due to any reason (ZooKeeper down, Kafka broker down etc) how can we robustly handle those messages and replay them once things are back up again. Again as I say we cannot afford even a single message failure.
Another use case is we also need to know at any given point in time how many messages were failed to publish to Kafka due to any reason i.e. something like counter functionality and now those messages needs to be re-published again.
One of the solution is to push those messages to some database (like Cassandra where writes are very fast but we also need counter functionality and I guess Cassandra counter functionality is not that great and we don't want to use that.) which can handle that kind of load and also provide us with the counter facility which is very accurate.
This question is more from architecture perspective and then which technology to use to make that happen.
PS: We handle some where like 3000TPS. So when system start failing those failed messages can grow very fast in very short time. We're using java based frameworks.
Thanks for your help!
The reason Kafka was built in a distributed, fault-tolerant way is to handle problems exactly like yours, multiple failures of core components should avoid service interruptions. To avoid a down Zookeeper, deploy at least 3 instances of Zookeepers (if this is in AWS, deploy them across availability zones). To avoid broker failures, deploy multiple brokers, and ensure you're specifying multiple brokers in your producer bootstrap.servers property. To ensure that the Kafka cluster has written your message in a durable manor, ensure that the acks=all property is set in the producer. This will acknowledge a client write when all in-sync replicas acknowledge reception of the message (at the expense of throughput). You can also set queuing limits to ensure that if writes to the broker start backing up you can catch an exception and handle it and possibly retry.
Using Cassandra (another well thought out distributed, fault tolerant system) to "stage" your writes doesn't seem like it adds any reliability to your architecture, but does increase the complexity, plus Cassandra wasn't written to be a message queue for a message queue, I would avoid this.
Properly configured, Kafka should be available to handle all your message writes and provide suitable guarantees.
I am super late to the party. But I see something missing in above answers :)
The strategy of choosing some distributed system like Cassandra is a decent idea. Once the Kafka is up and normal, you can retry all the messages that were written into this.
I would like to answer on the part of "knowing how many messages failed to publish at a given time"
From the tags, I see that you are using apache-kafka and kafka-consumer-api.You can write a custom call back for your producer and this call back can tell you if the message has failed or successfully published. On failure, log the meta data for the message.
Now, you can use log analyzing tools to analyze your failures. One such decent tool is Splunk.
Below is a small code snippet than can explain better about the call back I was talking about:
public class ProduceToKafka {
private ProducerRecord<String, String> message = null;
// TracerBulletProducer class has producer properties
private KafkaProducer<String, String> myProducer = TracerBulletProducer
.createProducer();
public void publishMessage(String string) {
ProducerRecord<String, String> message = new ProducerRecord<>(
"topicName", string);
myProducer.send(message, new MyCallback(message.key(), message.value()));
}
class MyCallback implements Callback {
private final String key;
private final String value;
public MyCallback(String key, String value) {
this.key = key;
this.value = value;
}
#Override
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception == null) {
log.info("--------> All good !!");
} else {
log.info("--------> not so good !!");
log.info(metadata.toString());
log.info("" + metadata.serializedValueSize());
log.info(exception.getMessage());
}
}
}
}
If you analyze the number of "--------> not so good !!" logs per time unit, you can get the required insights.
God speed !
Chris already told about how to keep the system fault tolerant.
Kafka by default supports at-least once message delivery semantics, it means when it try to send a message something happens, it will try to resend it.
When you create a Kafka Producer properties, you can configure this by setting retries option more than 0.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:4242");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
For more info check this.