I am using Kafka streaming and i have a doubt.
My code is
final KStream<String, Entity> inStream = builder.stream(TOPIC);
inStream.map((key, entity) -> {
....
return new KeyValue<>(key, entity);
}).to(NEW_TOPIC);`
Value of NEW_TOPIC is present in entity object. My problem is how to i extract the value of this NEW_TOPIC from entity in case of multiple task running.
My problems drill down to the fact that if there are multiple task, would kafka-streaming process the incoming message till end (by calling the to() method to push it back to new kafka topic) and then pull new message from incoming topic? If this is the behavior, i can store this value in a local /final variable to use it later. If this is not the behavior then i need to use some other way.
Related
I need to send data from database to Kafka. Any data should not be lost and the message order should be strictly kept as they are fetched from the database. After the messages are sent, I need to remove them from the database. Once completed, this task will be repeated again and again (it is scheduled via #Scheduler).
I have come to the conclusion that the guaranty of not losing any message and keeping the order of them requires the following: before sending a new message, I need to make sure that the previous one was successfully delivered to the Kafka broker (Acks=all, min.insync.replicas=2). If a message is not delivered to the broker, there is no point to send the next one. As a result, the solution turns out to be synchronous. Here is my code example:
public List<String> sendMessages(String topicName, List<Object> data) {
List<String> successIds = new ArrayList<>();
for (Object value : data) {
ListenableFuture<SendResult<String, Object>> listenableFuture = kafkaTemplate.send(topicName, value.getSiebelId(), value);
try {
listenableFuture.get(3, TimeUnit.SECONDS);
} catch (Exception e) {
log.warn("todo");
break;
}
successIds.add(value.getId());
}
return successIds;
}
successIds contains the id of messages which are successfully delivered to the broker. Next, I use them to delete the corresponding data in the database. If during the operation of sending messages from the List<Object> data some message was not delivered to the broker for some reason, then we end the iteration earlier and delete exactly what managed to get into successIds. In the next iteration, we will start with those messages that were not included in the successIds, because they have not been removed from the database.
From this solution requires the rejection of asynchrony, which will certainly lead to a decrease in the performance. I already tested it and it works very slowly. I am new to Kafka so would like an expert opinion. Is this solution optimal?
The following solution below give me much faster productivity (~ 100 times on my test data) comparing with listenableFuture.get posted in the question. Here I add a callback to listenableFuture where in onSuccess method I put the successfully sent ids into the list. After the iteration over the list, I call flush() on kafkaTemplate.
#Override
public List<String> sendMessages(String topicName, List<T> data) {
List<String> successIds = new ArrayList<>();
data.forEach(value ->
kafkaTemplate.send(topicName, value.getSiebelId(), value)
.addCallback(new ListenableFutureCallback<>() {
#Override
public void onSuccess(SendResult<String, Object> result) {
successIds.add(value.getId());
}
#Override
public void onFailure(Throwable exception) {
log.warn("todo");
}
}));
kafkaTemplate.flush();
return successIds;
}
The output with successIds of this solution however might differ from the one in the question. Say, I have 5 messages to send. If 3th message is not delivered (say due to problem with the network), the 4th, 5th will still be sent to the broker and might be delivered (if the network problem is fixed). So successIds={1,2,4,5}. Later on 3th messages will be sent with a new list iteration and therefore might be delivered after 5th message. So it delivers faster, give the guaranty that no message is lost, but will not give 100% guaranty of keeping the order. This is not ideal, but maybe I will try to use it as a compromise.
In the same situation, the solution with listenableFuture.get will not even send 4th, 5th messages, and gets successIds={1,2}. Not delivered messages 3th, 4th, 5th will be sent in the proper order in a new list iteration.
I could not explain properly why I gain much productivity with the presented solution. I guess kafkaTemplate.flush() somehow do asych stuff, while listenableFuture.get put the request in the sequence to wait the corresponding responses.
P.S. Interesting to note, that if I use the same code but remove line kafkaTemplate.flush() and instead I initialize bean kafkaTemplate with autoflash=true, then it works again slowly.
I am implementing Spring Boot application in Java, using Spring Cloud Stream with Kafka Streams binder.
I need to implement blocking operation inside of KStream map method like so:
public Consumer<KStream<?, ?>> sink() {
return input -> input
.mapValues(value -> methodReturningCompletableFuture(value).get())
.foreach((key, value) -> otherMethod(key, value));
}
completableFuture.get() throws exceptions (InterruptedException, ExecutionException)
How to handle these exceptions so that the chained method doesn't get executed and the Kafka message is not acknowledged? I cannot afford message loss, sending it to a dead letter topic is not an option.
Is there a better way of blocking inside map()?
You can try the branching feature in Kafka Streams to control the execution of the chained methods. For example, here is a pseudo-code that you can try.
You can possibly use this as a starting point and adapt this to your particular use case.
final Map<String, ? extends KStream<?, String>> branches =
input.split()
.branch(k, v) -> {
try {
methodReturningCompletableFuture(value).get();
return true;
}
catch (Exception e) {
return false;
}
}, Branched.as("good-records"))
.defaultBranch();
final KStream<?, String> kStream = branches.get("good-records");
kStream.foreach((key, value) -> otherMethod(key, value));
The idea here is that you will only send the records that didn't throw an exception to the named branch good-records, everything else goes into a default branch which we simply ignore in this pseudo-code. Then you invoke additional chained methods (as this foreach call shows) only for those "good" records.
This does not solve the problem of not acknowledging the message after an exception is thrown. That seems to be a bit challenging. However, I am curious about that use case. When an exception happens and you handle it, why don't you want to ack the message? The requirements seem to be a bit rigid without using a DLT. The ideal solution here is that you might want to introduce some retries and once exhausted from the retries, send the record to a DLT which makes Kafka Streams consumer acknowledges the message. Then the application moves on to the next offset.
The call methodReturningCompletableFuture(value).get() simply waits until a default or configured timeout is reached, assuming that methodReturningCompletableFuture() returns a Future object. Therefore, that is already a good approach to wait inside the KStream map operation. I don't think anything else is necessary to make it wait further.
When subscribing to change streams using the blocking Spring Data Mongo implementation one can call await to wait for a subscription to become active:
Subscription subscription = startBlockingMongoChangeStream();
subscription.await(Duration.of(2, SECONDS));
Document someDocument = ..
writeDocumentToMongoDb(someDocument);
The startBlockingMongoChangeStream is implemented along these lines:
public Subscription startBlockingMongoChangeStream() {
MessageListenerContainer container = new DefaultMessageListenerContainer(template);
container.start();
MessageListener<ChangeStreamDocument<Document>, Document> listener = System.out::println;
ChangeStreamRequestOptions options = new ChangeStreamRequestOptions("user", ChangeStreamOptions.empty());
return container.register(new ChangeStreamRequest<>(listener, options), Document.class);
}
If await is not used in the example above there's a chance (virtually 100% chance if the JVM is hot) that someDocument is written before the subscription is active and thus the someDocument is missed. So adding await mitigates this issue.
I'm looking for a way to achieve the same thing when using the reactive implementation. The code now looks something like this:
Disposable disposable = startReactiveMongoChangeStream().subscribe(); // (1)
Document someDocument = ..
writeDocumentToMongoDb(someDocument).subscribe(); // (2)
The problem here is, again, that someDocument is written before the subscription returned by startReactiveMongoChangeStream has started and thus the document is missed.
Also note that this is a somewhat contrived example since in my actually application writeDocumentToMongoDb (2) is not aware of the startReactiveMongoChangeStream subscription (1) so I cannot simply flatMap (1) and call (2). The startReactiveMongoChangeStream method is implemented along these lines:
public Flux<ChangeStreamEvent<String>> startReactiveMongoChangeStream() {
return reactiveTemplate.changeStream(String.class)
.watchCollection("user")
.listen();
}
How can I "simulate" the await functionality available in the blocking implementation in the reactive implementation?
TL;DR
There are no means for synchronization in the reactive API
Explanation
First, let's look at both implementations to understand why this is.
The blocking implementation uses MongoDB's cursor API to obtain a cursor. Obtaining a cursor includes a conversation with the server. After MessageListenerContainer has obtained the cursors, it switches the subscription task to active which means that you have awaited the stage where the first cursor was fetched.
The reactive implementation operates on a ChangeStreamPublisher. From the reactive streams protocol, one can get notified when an element is emitted, when the stream completes or fails. There's no notification available when the server-side activity starts or completes. Therefore, you cannot wait until the reactive API receives the first cursor. Since cursors may be empty, the first cursor might not emit any value at all.
I think the MongoDB driver could provide a callback-style API to get notified that the stream is active. That's however something to report in the MongoDB issue tracker.
Hello I have this issue that I'm trying to solve. Basically I have a Kafka Streams topology that will read JSON messages from a Kafka topic and that message gets deserialized into a POJO. Then ideally it will read check that message for a certain boolean flag. If that flag is true it will do some transformation and then write it back to the topic. However if the flag is false, I'm trying to have it not write anything but I'm not sure how I can go about it. With the MP Reactive Messaging I can just use an RxJava 2 Flowable Stream and return something like Flowable.empty() but I can't use that method here it seems.
JsonbSerde<FinancialMessage> financialMessageSerde = new JsonbSerde<>(FinancialMessage.class);
StreamsBuilder builder = new StreamsBuilder();
builder.stream(
TOPIC_NAME,
Consumed.with(Serdes.Integer(), financialMessageSerde)
)
.mapValues (
message -> checkCondition(message)
)
.to (
TOPIC_NAME,
Produced.with(Serdes.Integer(), financialMessageSerde)
);
The below is the function call logic.
public FinancialMessage checkCondition(FinancialMessage rawMessage) {
FinancialMessage receivedMessage = rawMessage;
if (receivedMessage.compliance_services) {
receivedMessage.compliance_services = false;
return receivedMessage;
}
else return null;
}
If the boolean is false it just returns a JSON body with "null".
I've tried changing the return type of the checkCondition function wrapped like
public Flowable<FinancialMessage> checkCondition (FinancialMessage rawMessage)
And then having the return from the if be like Flowable.just(receivedMessage) or Flowable.empty() but I can't seem to serialize the Flowable object. This might be a silly question but is there a better way to go about this?
Note that Kafka messages are immutable and not deleted after read, and if you read/write from the same topic with a single application, a message would be processed infinitely often (or to be more precise different copies of it) if you don't have a condition to "break" the cycle.
Also, if for example 5 services read from the same topic, all 5 services get a copy of every event. And if one service write back, all other 4 services and the writing service itself will read the message again. Thus, you get quite some data amplification.
If you have different services to react on the original input message consecutively, you could have one topic between each pair of consecutive services to really build a pipeline though.
Last, you say if the boolean flag is true you want to transform the message and emit (I assume for the next service to consumer). And for false you want to do nothing. I a further assume that for a message only a single flag will be true and a successful transformation also switches the flag (to enable processing by the next service). For this case, it's best if you can ensure that each original input message has the same initial boolean flag set to build your pipeline. Thus, only the corresponding service will read messages with its boolean flag set (you don't even need to check the boolean flag as your upstream write ensures that it's set; you could only have a sanity check).
If you don't know which boolean flag is set initially and all services read from the same input topic, just filtering out the message is correct. If all services read all messages, 4 services will filter the message while one service will process it and emit a new message with a different flag. For this architecture, a single topic might work: if a message is processed by all services and all boolean flags are false (after all services processed the message), and you write it back to the input topic, all services would drop the last copy correctly. However, using a single topic implies a lot of redundant reading/writing.
Maybe the best architecture is, to have your original input topic, and one additional input topic for each service. You also use an additional "dispatcher" service that read from the original input topics, and branches() the KStream into the service input topics according to the boolean flag. This way, each service will read only messages with the right flag set to true. Furthermore, each service will write to the input topic of the other services also using branch() after the message transformation to write it to the input topic of the correct next service. Last, you would want an output topic that each service can write into after a message is fully processed.
I've defined a class extending Processor and I'm using a KeyValueStore to store temporary store some messages before to send them to the sink topic. In particular, I receive on a source topic a set of fragmented messages and, once all are received, I've to assemble them and send, the concatenated message to the sink topic. In the Processor:process() method, once I've sent the message with the forward() I want to delete, by means of the delete(K key) method, the message from the state store.
In the processor I've created the state store with
StoreBuilder<KeyValueStore<byte[], String>> storeBuilder = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("message-store"), Serdes.ByteArray(), Serdes.String());
The problem is that the removal does not happen and when I send another message with the same key, I still receive in the value the previous messages.
Code:
kvStore = (KeyValueStore<byte[], String>) this.context.getStateStore("message-store");
kvStore.delete((byte[]key)
The put and the get of the state store work properly.
Is there anything wrong with this approach?