I am very new to Kafka and I am working on a project to learn and understand Kafka.
I am running Kafka on my laptop so I have 1 consumer and 1 producer and I'm working with Java (Spring Boot) to listen to those streams and consume the messages.
Let's say I have 2 different groups created, called "automatic" and "manual".
For the "automatic" one, I do not want the messages to perform things right away. I want to aggregate the messages for 1 minute and when 1 minute passes, I want it to fire off some custom event.
But for the "manual" one, I want the message to consume it and fire off the event right away.
But when I send message from producer it will go to this the common topic itself and there is a property in the message which says if it is a "manual" or "automatic" type.
Here is my Kafka topic declaration in my application.properties file.
spring.cloud.stream.kafka.bindings.automatic.consumer.configuration.client.id=automatic-consumption-event
spring.cloud.stream.bindings.automatic.destination=main.event
spring.cloud.stream.bindings.automatic.binder=test-stream-app
spring.cloud.stream.bindings.automatic.group=consumer-automatic-group
spring.cloud.stream.bindings.automatic.consumer.concurrency=1
spring.cloud.stream.kafka.bindings.manual.consumer.configuration.client.id=manual-consumption-event
spring.cloud.stream.bindings.manual.destination=main.event
spring.cloud.stream.bindings.manual.binder=test-app
spring.cloud.stream.bindings.manual.group=consumer-manual-group
spring.cloud.stream.bindings.manual.consumer.concurrency=1
I have created 2 separate methods to be consumed and perform different actions like this.
private windows;
#PostConstruct()
private void init() {
this.windows = SessionWindows.with(Duration.ofSeconds(5)).grace(Duration.Zero);
}
public void automatic(Ktream<string, CustomObjectType> eventStream) {
eventStream.filter((x, y) -> y != null && !y.isManual(), Named.as("automatic_event"))
.groupByKey(Grouped.with("participant_id", Serdes.String(), Serdes.Long()))
.windowedBy(windows)
.reduce(Long::sum, Named.as("participant_id_sum"))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream(Named.as("participant_id_stream"))
.foreach(this::fireCustomEvent);
}
#StreamListener("manual-event")
public void manual(#Payload String payload) {
var parsedObject = this.parseJSON(payload);
if(!payload.isManual()) {
return;
}
this.fireCustomEvent();
}
private CustomObjectType parseJSON(String json) {
return JSONObject.parseObject(json.substring(json.indexOf("{")), CustomObjectType.class);
}
private void fireCustomEvent(){
// Should do something.
}
I ran the producer with this command on my system.
bin/kafka-console-producer.sh --topic main.event --property "parse.key=true" --property "key.separator=:" --bootstrap-server localhost:62341
And I ran consumer with this command:
bin/kafka-consumer.sh --topic main.event --from-beginning --bootstrap-server localhost:62341
These are the events I'm passing by the producer.
123: {"eventHeader": null, "data": {"headline": "You are winner", "id": "42", "isManual": true}}
987: {"eventHeader": null, "data": {"headline": "You will win", "id": "43", "isManual": false}}
Whenever the event is passed by producer, I can see my manual() triggering with the message. But it is doing expected thing of taking message and firing the event right away. But, it is consuming both the type of messages and the problem is that the "automatic" messages are not aggregating anymore. Because they have been taken by the consumer.
Every time I restart my spring boot application, the automatic() method triggers but it does not find any messages to be filtered because they were consumed already, as per my understanding.
Can someone help me figure out where am I causing the confusion?
I'm not sure I understand the question. Spring will start both functions "automatically". But you have a typo of Ktream in the automatic() parameters
consuming both the type of messages
Right... Because both exist in the same topic. Perhaps you want to use branch/split operator in Kafka Streams to make a separate topic of all manual events, which your "manual" method reads instead?
because they were consumed already
That doesn't matter. What matters is that offsets were committed. You can reconsume a topic as many times as you want, as long as the data is retained in the topic.
To force reconsumption, you can use
KafkaConsumer.seek
kafka-consumer-groups --reset-offsets after you stop the app
give the app a new application.id/group.id along with consumer config auto.offset.reset=earliest
Related
I'm using Apache KafkaConsumer. I want to check if the consumer has any messages to return without polling. If I poll the consumer and there aren't any messages, then I get the message "Attempt to heartbeat failed since the group is rebalancing" in an infinite loop until the timeout expires, even though I have a records.isEmpty() clause. This is a snippet of my code:
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(10));
if (records.isEmpty()) {
log.info("No More Records");
consumer.close();
}
else {
records.iterator().forEachRemaining(record -> log.info("RECORD: " + record);
);
This works fine until records are empty. Once it is empty, it logs "Attempt to heartbeat failed since the group is rebalancing" many times, logs "No More Records" once, and then continues to log the heartbeat error. What can I do to combat this and how can I elegantly check (without any heartbeat messages) that there are no more records to poll?
Edit: I asked another question and the full code and context is on this link: How to get messages from Kafka Consumer one by one in java?
Thanks in advance!
Out of comment: "Since I have a UI and want to receive a message one by one by clicking the "receive" button, there might be a case when there are no more messages to be polled."
In that case you need to create a new KafkaConsumer every time someone clicks on the "receive" button and then close it afterwards.
If you want to use the same KafkaConsumer for the lifetime of your client, you need to let the broker know that it is still alive (by sending a heartbeat, which is implicitly done through calling the poll method). Otherwise, as you have already experienced, the broker thinks your KafkaConsumer is dead and will initiate a rebalancing. As there is no other active Consumer available this rebalancing will not stop.
Hello I have this issue that I'm trying to solve. Basically I have a Kafka Streams topology that will read JSON messages from a Kafka topic and that message gets deserialized into a POJO. Then ideally it will read check that message for a certain boolean flag. If that flag is true it will do some transformation and then write it back to the topic. However if the flag is false, I'm trying to have it not write anything but I'm not sure how I can go about it. With the MP Reactive Messaging I can just use an RxJava 2 Flowable Stream and return something like Flowable.empty() but I can't use that method here it seems.
JsonbSerde<FinancialMessage> financialMessageSerde = new JsonbSerde<>(FinancialMessage.class);
StreamsBuilder builder = new StreamsBuilder();
builder.stream(
TOPIC_NAME,
Consumed.with(Serdes.Integer(), financialMessageSerde)
)
.mapValues (
message -> checkCondition(message)
)
.to (
TOPIC_NAME,
Produced.with(Serdes.Integer(), financialMessageSerde)
);
The below is the function call logic.
public FinancialMessage checkCondition(FinancialMessage rawMessage) {
FinancialMessage receivedMessage = rawMessage;
if (receivedMessage.compliance_services) {
receivedMessage.compliance_services = false;
return receivedMessage;
}
else return null;
}
If the boolean is false it just returns a JSON body with "null".
I've tried changing the return type of the checkCondition function wrapped like
public Flowable<FinancialMessage> checkCondition (FinancialMessage rawMessage)
And then having the return from the if be like Flowable.just(receivedMessage) or Flowable.empty() but I can't seem to serialize the Flowable object. This might be a silly question but is there a better way to go about this?
Note that Kafka messages are immutable and not deleted after read, and if you read/write from the same topic with a single application, a message would be processed infinitely often (or to be more precise different copies of it) if you don't have a condition to "break" the cycle.
Also, if for example 5 services read from the same topic, all 5 services get a copy of every event. And if one service write back, all other 4 services and the writing service itself will read the message again. Thus, you get quite some data amplification.
If you have different services to react on the original input message consecutively, you could have one topic between each pair of consecutive services to really build a pipeline though.
Last, you say if the boolean flag is true you want to transform the message and emit (I assume for the next service to consumer). And for false you want to do nothing. I a further assume that for a message only a single flag will be true and a successful transformation also switches the flag (to enable processing by the next service). For this case, it's best if you can ensure that each original input message has the same initial boolean flag set to build your pipeline. Thus, only the corresponding service will read messages with its boolean flag set (you don't even need to check the boolean flag as your upstream write ensures that it's set; you could only have a sanity check).
If you don't know which boolean flag is set initially and all services read from the same input topic, just filtering out the message is correct. If all services read all messages, 4 services will filter the message while one service will process it and emit a new message with a different flag. For this architecture, a single topic might work: if a message is processed by all services and all boolean flags are false (after all services processed the message), and you write it back to the input topic, all services would drop the last copy correctly. However, using a single topic implies a lot of redundant reading/writing.
Maybe the best architecture is, to have your original input topic, and one additional input topic for each service. You also use an additional "dispatcher" service that read from the original input topics, and branches() the KStream into the service input topics according to the boolean flag. This way, each service will read only messages with the right flag set to true. Furthermore, each service will write to the input topic of the other services also using branch() after the message transformation to write it to the input topic of the correct next service. Last, you would want an output topic that each service can write into after a message is fully processed.
I'm writing an application with Spring Boot so to write to Kafka I do:
#Autowired
private KafkaTemplate<String, String> kafkaTemplate;
and then inside my method:
kafkaTemplate.send(topic, data)
But I feel like I'm just relying on this to work, how can I know if this has worked? If it's asynchronous, is it a good practice to return a 200 code and hoped it did work? I'm confused. If Kafka isn't available, won't this fail? Shouldn't I be prompted to catch an exception?
Along with what #mjuarez has mentioned you can try playing with two Kafka producer properties. One is ProducerConfig.ACKS_CONFIG, which lets you set the level of acknowledgement that you think is safe for your use case. This knob has three possible values. From Kafka doc
acks=0: Producer doesn't care about acknowledgement from server, and considers it as sent.
acks=1: This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers.
acks=all: This means the leader will wait for the full set of in-sync replicas to acknowledge the record.
The other property is ProducerConfig.RETRIES_CONFIG. Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error.
Yes, if Kafka is not available, that .send() call will fail, but if you send it async, no one will be notified. You can specify a callback that you want to be executed when the future finally finishes. Full interface spec here: https://kafka.apache.org/20/javadoc/org/apache/kafka/clients/producer/Callback.html
From the official Kafka javadoc here: https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
Fully non-blocking usage can make use of the Callback parameter to
provide a callback that will be invoked when the request is complete.
ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("the-topic", key, value);
producer.send(myRecord,
new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if(e != null) {
e.printStackTrace();
} else {
System.out.println("The offset of the record we just sent is: " + metadata.offset());
}
}
});
you can use below command while sending messages to kafka:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topic-name
while above command is running you should run your code and if sending messages being successful then the message must be printed on the console.
Furthermore, likewise any other connection to any resources if the connection could not be established, then doing any kinds of operations would result some exception raises.
I am building a system that will receive messages via a Message broker (Currently, JMS) from different systems. All the messages from all the senders systems have a deviceId and there is no order in the reception of the message.
For instance, system A can send a message with deviceId=1 and system b be can send a message with deviceId=2.
My goal is not to start processing of the messages concerning the same deviceId unless I got all the message from all the senders with the same deviceId.
For example, if I have 3 systems A, B and C sending messages to my system :
System A sends messageA1 with deviceId=1
System B sends messageB1 with deviceId=1
System C sends messageC1 with deviceId=3
System C sends messageC2 with deviceId=1 <--- here I should start processing of messageA1, messageB1 and messageC2 because they are having the same deviceID 1.
Should this problem be resolved by using some sync mechanism in my system , by the message broker or an integration framework like spring-integration/apache camel ?
A similar solution with the Aggregator (what #Artem Bilan mentioned) can also be implemented in Camel with a custom AggregationStrategy and with controlling the Aggregator completion by using the Exchange.AGGREGATION_COMPLETE_CURRENT_GROUP property.
The following might be a good starting point. (You can find the sample project with tests here)
Route:
from("direct:start")
.log(LoggingLevel.INFO, "Received ${headers.system}${headers.deviceId}")
.aggregate(header("deviceId"), new SignalAggregationStrategy(3))
.log(LoggingLevel.INFO, "Signaled body: ${body}")
.to("direct:result");
SignalAggregationStrategy.java
public class SignalAggregationStrategy extends GroupedExchangeAggregationStrategy implements Predicate {
private int numberOfSystems;
public SignalAggregationStrategy(int numberOfSystems) {
this.numberOfSystems = numberOfSystems;
}
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
Exchange exchange = super.aggregate(oldExchange, newExchange);
List<Exchange> aggregatedExchanges = exchange.getProperty("CamelGroupedExchange", List.class);
// Complete aggregation if we have "numberOfSystems" (currently 3) different messages (where "system" headers are different)
// https://github.com/apache/camel/blob/master/camel-core/src/main/docs/eips/aggregate-eip.adoc#completing-current-group-decided-from-the-aggregationstrategy
if (numberOfSystems == aggregatedExchanges.stream().map(e -> e.getIn().getHeader("system", String.class)).distinct().count()) {
exchange.setProperty(Exchange.AGGREGATION_COMPLETE_CURRENT_GROUP, true);
}
return exchange;
}
#Override
public boolean matches(Exchange exchange) {
// make it infinite (4th bullet point # https://github.com/apache/camel/blob/master/camel-core/src/main/docs/eips/aggregate-eip.adoc#about-completion)
return false;
}
}
Hope it helps!
You can do this in Apache Camel using a caching component. I think there is the EHCache component.
Essentially:
You receive a message with a given deviceId say deviceId1.
You look up in your cache to see which messages have been received for deviceId1.
As long as you have not received all three you add the current system/message to the cache.
Once all messages are there you process and clear the cache.
You could then off course route each incoming message to a specific deviceId based queue for temporary storage. This can be JMS, ActiveMQ or something similar.
Spring Integration provides component for exactly this kind of tasks - do not emit until the whole group is collected. And it's name an Aggregator. Your deviceId is definitely a correlationKey. The releaseStrategy really may be based on the number of systems - how much deviceId1 messages you are waiting before proceed to the next step.
I have a simple kafka setup. A producer is producing messages to a single partition with a single topic at a high rate. A single consumer is consuming messages from this partition. During this process, the consumer may pause processing messages several times. The pause can last a couple of minutes. After the producer stops producing messages, all messages queued up will be processed by the consumer. It appears that messages produced by the producer are not being seen immediately by the consumer. I am using kafka 0.10.1.0. What can be happening here? Here is the section of code that consumes the message:
while (true)
{
try
{
ConsumerRecords<String, byte[]> records = consumer.poll(100);
for (final ConsumerRecord<String, byte[]> record : records)
{
serviceThread.submit(() ->
{
externalConsumer.accept(record);
});
}
consumer.commitAsync();
} catch (org.apache.kafka.common.errors.WakeupException e)
{
}
}
where consumer is a KafkaConsumer with auto commit disabled, max poll record of 100, and session timeout of 30000. serviceThread is an ExecutorService.
The producer just involves the KafkaProducer.send call to send a ProducerRecord.
All configurations on the broker are left as kafka defaults.
I am also using kafka-consumer-groups.sh to check what is happening when consumer is not consuming the message. But when this happens, the kafka-consumer-groups.sh will be hanging there also, not able to get information back. Sometimes it triggers the consumer re-balance. But not always.
For those who can find this helpful. I've encountered this problem (when kafka silently supposedly stops consuming) often enough and every single time it wasn't actually problem with Kafka.
Usually it is some long-running or hanged silent process that keeps Kafka from committing the offset. For example a DB client trying to connect to the DB. If you wait for long enough (e.g. 15 minutes for SQLAlchemy and Postgres), you will see a exception will be printed to the STDOUT, saying something like connection timed out.