In my pipeline's setup I cannot see side outputs for Session Window. I'm using Flink 1.9.1
Version 1.
What I have is this:
messageStream.
.keyBy(tradeKeySelector)
.window(ProcessingTimeSessionWindows.withDynamicGap(new TradeAggregationGapExtractor()))
.sideOutputLateData(lateTradeMessages)
.process(new CumulativeTransactionOperator())
.name("Aggregate Transaction Builder");
lateTradeMessages implementes SessionWindowTimeGapExtractor and returns 5 secodns.
Further I have this:
messageStream.getSideOutput(lateTradeMessages)
.keyBy(tradeKeySelector)
.process(new KeyedProcessFunction<Long, EnrichedMessage, Transaction>() {
#Override
public void processElement(EnrichedMessage value, Context ctx, Collector<Transaction> out) throws Exception {
System.out.println("Process Late messages For Aggregation");
out.collect(new Transaction());
}
})
.name("Process Late messages For Aggregation");
The problem is that I never see "Process Late messages For Aggregation" when I'm sending messages with same key that should miss window time.
When Session Window passes and I "immediately" sent a new message for the same key it triggers new Session Window without going into Late SideOutput.
Not sure What I'm doing wrong here.
What I would like to achieve here, is to catch "late events" and try to
reprocess them.
I will appreciate any help.
Version 2, after #Dominik WosiĆski comment:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, 1000));
env.setParallelism(1);
env.disableOperatorChaining();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(1000);
DataStream<RawMessage> rawBusinessTransaction = env
.addSource(new FlinkKafkaConsumer<>("business",
new JSONKeyValueDeserializationSchema(false), properties))
.map(new KafkaTransactionObjectMapOperator())
.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<RawMessage>() {
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(System.currentTimeMillis());
}
#Override
public long extractTimestamp(RawMessage element, long previousElementTimestamp) {
return element.messageCreationTime;
}
})
.name("Kafka Transaction Raw Data Source.");
messageStream
.keyBy(tradeKeySelector)
.window(EventTimeSessionWindows.withDynamicGap(new TradeAggregationGapExtractor()))
.sideOutputLateData(lateTradeMessages)
.process(new CumulativeTransactionOperator())
.name("Aggregate Transaction Builder");
Watermarks are progressing, I've checked in Flink's Metrics. The Window operator is execution, but still there are no Late Outputs.
BTW, Kafka topic can be idle, so I have to emit new WaterMarks periodically.
The watermark approach looks very suspicious to me. Usually, you would output the latest event timestamp at this point.
Just some background information, so that it's easier to understand.
Late events refer to events that come after the watermark processed to a time after the event. Consider the following example:
event1 #time 1
event2 #time 2
watermark1 #time 3
event3 #time 1 <-- late event
event4 #time 4
Your watermark approach would pretty much render all past events as late events (a bit of tolerance because of the 1s watermark interval). This would also make reprocessing and catchups impossible.
However, you are actually not seeing any late events which is even more surprising to me. Can you double-check your watermark approach, describe your use case, and provide example data? Often times, the implementation is not ideal for the actual use case and it should be solved in a different way.
You are using ProcessingTime in Your case, this means that the system time is used to measure the flow of the time in the DataStream.
For each event, the timestamp assigned to this event is the moment that You receive the data in Your Flink Pipeline. This means that there is no way to have events out-of-order for Flink processing time. Because of that, You will never have late elements for Your windows.
If You switch to EventTime, then for proper input data You should be able to see the late elements being passed to side output.
You probably should take look at the documentation, where there are various concepts of time in Flink explained.
Related
I need to send data from database to Kafka. Any data should not be lost and the message order should be strictly kept as they are fetched from the database. After the messages are sent, I need to remove them from the database. Once completed, this task will be repeated again and again (it is scheduled via #Scheduler).
I have come to the conclusion that the guaranty of not losing any message and keeping the order of them requires the following: before sending a new message, I need to make sure that the previous one was successfully delivered to the Kafka broker (Acks=all, min.insync.replicas=2). If a message is not delivered to the broker, there is no point to send the next one. As a result, the solution turns out to be synchronous. Here is my code example:
public List<String> sendMessages(String topicName, List<Object> data) {
List<String> successIds = new ArrayList<>();
for (Object value : data) {
ListenableFuture<SendResult<String, Object>> listenableFuture = kafkaTemplate.send(topicName, value.getSiebelId(), value);
try {
listenableFuture.get(3, TimeUnit.SECONDS);
} catch (Exception e) {
log.warn("todo");
break;
}
successIds.add(value.getId());
}
return successIds;
}
successIds contains the id of messages which are successfully delivered to the broker. Next, I use them to delete the corresponding data in the database. If during the operation of sending messages from the List<Object> data some message was not delivered to the broker for some reason, then we end the iteration earlier and delete exactly what managed to get into successIds. In the next iteration, we will start with those messages that were not included in the successIds, because they have not been removed from the database.
From this solution requires the rejection of asynchrony, which will certainly lead to a decrease in the performance. I already tested it and it works very slowly. I am new to Kafka so would like an expert opinion. Is this solution optimal?
The following solution below give me much faster productivity (~ 100 times on my test data) comparing with listenableFuture.get posted in the question. Here I add a callback to listenableFuture where in onSuccess method I put the successfully sent ids into the list. After the iteration over the list, I call flush() on kafkaTemplate.
#Override
public List<String> sendMessages(String topicName, List<T> data) {
List<String> successIds = new ArrayList<>();
data.forEach(value ->
kafkaTemplate.send(topicName, value.getSiebelId(), value)
.addCallback(new ListenableFutureCallback<>() {
#Override
public void onSuccess(SendResult<String, Object> result) {
successIds.add(value.getId());
}
#Override
public void onFailure(Throwable exception) {
log.warn("todo");
}
}));
kafkaTemplate.flush();
return successIds;
}
The output with successIds of this solution however might differ from the one in the question. Say, I have 5 messages to send. If 3th message is not delivered (say due to problem with the network), the 4th, 5th will still be sent to the broker and might be delivered (if the network problem is fixed). So successIds={1,2,4,5}. Later on 3th messages will be sent with a new list iteration and therefore might be delivered after 5th message. So it delivers faster, give the guaranty that no message is lost, but will not give 100% guaranty of keeping the order. This is not ideal, but maybe I will try to use it as a compromise.
In the same situation, the solution with listenableFuture.get will not even send 4th, 5th messages, and gets successIds={1,2}. Not delivered messages 3th, 4th, 5th will be sent in the proper order in a new list iteration.
I could not explain properly why I gain much productivity with the presented solution. I guess kafkaTemplate.flush() somehow do asych stuff, while listenableFuture.get put the request in the sequence to wait the corresponding responses.
P.S. Interesting to note, that if I use the same code but remove line kafkaTemplate.flush() and instead I initialize bean kafkaTemplate with autoflash=true, then it works again slowly.
I am implementing Spring Boot application in Java, using Spring Cloud Stream with Kafka Streams binder.
I need to implement blocking operation inside of KStream map method like so:
public Consumer<KStream<?, ?>> sink() {
return input -> input
.mapValues(value -> methodReturningCompletableFuture(value).get())
.foreach((key, value) -> otherMethod(key, value));
}
completableFuture.get() throws exceptions (InterruptedException, ExecutionException)
How to handle these exceptions so that the chained method doesn't get executed and the Kafka message is not acknowledged? I cannot afford message loss, sending it to a dead letter topic is not an option.
Is there a better way of blocking inside map()?
You can try the branching feature in Kafka Streams to control the execution of the chained methods. For example, here is a pseudo-code that you can try.
You can possibly use this as a starting point and adapt this to your particular use case.
final Map<String, ? extends KStream<?, String>> branches =
input.split()
.branch(k, v) -> {
try {
methodReturningCompletableFuture(value).get();
return true;
}
catch (Exception e) {
return false;
}
}, Branched.as("good-records"))
.defaultBranch();
final KStream<?, String> kStream = branches.get("good-records");
kStream.foreach((key, value) -> otherMethod(key, value));
The idea here is that you will only send the records that didn't throw an exception to the named branch good-records, everything else goes into a default branch which we simply ignore in this pseudo-code. Then you invoke additional chained methods (as this foreach call shows) only for those "good" records.
This does not solve the problem of not acknowledging the message after an exception is thrown. That seems to be a bit challenging. However, I am curious about that use case. When an exception happens and you handle it, why don't you want to ack the message? The requirements seem to be a bit rigid without using a DLT. The ideal solution here is that you might want to introduce some retries and once exhausted from the retries, send the record to a DLT which makes Kafka Streams consumer acknowledges the message. Then the application moves on to the next offset.
The call methodReturningCompletableFuture(value).get() simply waits until a default or configured timeout is reached, assuming that methodReturningCompletableFuture() returns a Future object. Therefore, that is already a good approach to wait inside the KStream map operation. I don't think anything else is necessary to make it wait further.
I have a spring boot web application with the functionality to update an entity called StudioLinking. This entity describes a temporary, mutable, descriptive logical link between two IoT devices for which my web app is their cloud service. The Links between these devices are ephemeral in nature, but the StudioLinking Entity persists on the database for reporting purposes. StudioLinking is stored to the SQL based datastore in the conventional way using Spring Data/ Hibernate. From time to time this StudioLinking entity will be updated with new information from a Rest API. When that link is updated the devices need to respond (change colors, volume, etc). Right now this is handled with polling every 5 seconds but this creates lag from when a human user enters an update into the system and when the IoT devices actually update. It could be as little as a millisecond or up to 5 seconds! Clearly increasing the frequency of the polling is unsustainable and the vast majority of the time there are no updates at all!
So, I am trying to develop another Rest API on this same application with HTTP Long Polling which will return when a given StudioLinking entity is updated or after a timeout. The listeners do not support WebSocket or similar leaving me with Long Polling. Long polling can leave a race condition where you have to account for the possibility that with consecutive messages one message may be "lost" as it comes in between HTTP requests (while the connection is closing and opening, a new "update" might come in and not be "noticed" if I used a Pub/Sub).
It is important to note that this "subscribe to updates" API should only ever return the LATEST and CURRENT version of the StudioLinking, but should only do so when there is an actual update or if an update happened since the last checkin. The "subscribe to updates" client will initially POST an API request to setup a new listening session and pass that along so the server knows who they are. Because it is possible that multiple devices will need to monitor updates to the same StudioLinking entity. I believe I can acomplish this by using separately named consumers in the redis XREAD. (keep this in mind for later in the question)
After hours of research I believe the way to acomplish this is using redis streams.
I have found these two links regarding Redis Streams in Spring Data Redis:
https://www.vinsguru.com/redis-reactive-stream-real-time-producing-consuming-streams-with-spring-boot/
https://medium.com/#amitptl.in/redis-stream-in-action-using-java-and-spring-data-redis-a73257f9a281
I also have read this link about long polling, both of these links just have a sleep timer during the long polling which is for demonstration purposes but obviously I want to do something useful.
https://www.baeldung.com/spring-deferred-result
And both these links were very helpful. Right now I have no problem figuring out how to publish the updates to the Redis Stream - (this is untested "pseudo-code" but I don't anticipate having any issues implementing this)
// In my StudioLinking Entity
#PostUpdate
public void postToRedis() {
StudioLinking link = this;
ObjectRecord<String, StudioLinking> record = StreamRecords.newRecord()
.ofObject(link)
.withStreamKey(streamKey); //I am creating a stream for each individual linking probably?
this.redisTemplate
.opsForStream()
.add(record)
.subscribe(System.out::println);
atomicInteger.incrementAndGet();
}
But I fall flat when it comes to subscribing to said stream: So basically what I want to do here - please excuse the butchered pseudocode, it is for idea purposes only. I am well aware that the code is in no way indicative of how the language and framework actually behaves :)
// Parameter studioLinkingID refers to the StudioLinking that the requester wants to monitor
// updateList is a unique token to track individual consumers in Redis
#GetMapping("/subscribe-to-updates/{linkId}/{updatesId}")
public DeferredResult<ResponseEntity<?>> subscribeToUpdates(#PathVariable("linkId") Integer linkId, #PathVariable("updatesId") Integer updatesId) {
LOG.info("Received async-deferredresult request");
DeferredResult<ResponseEntity<?>> output = new DeferredResult<>(5000l);
deferredResult.onTimeout(() ->
deferredResult.setErrorResult(
ResponseEntity.status(HttpStatus.REQUEST_TIMEOUT)
.body("IT WAS NOT UPDATED!")));
ForkJoinPool.commonPool().submit(() -> {
//----------------------------------------------
// Made up stuff... here is where I want to subscribe to a stream and block!
//----------------------------------------------
LOG.info("Processing in separate thread");
try {
// Subscribe to Redis Stream, get any updates that happened between long-polls
// then block until/if a new message comes over the stream
var subscription = listenerContainer.receiveAutoAck(
Consumer.from(studioLinkingID, updateList),
StreamOffset.create(studioLinkingID, ReadOffset.lastConsumed()),
streamListener);
listenerContainer.start();
} catch (InterruptedException e) {
}
output.setResult("IT WAS UPDATED!");
});
LOG.info("servlet thread freed");
return output;
}
So is there a good explanation to how I would go about this? I think the answer lies within https://docs.spring.io/spring-data/redis/docs/current/api/org/springframework/data/redis/core/ReactiveRedisTemplate.html but I am not a big enough Spring power user to really understand the terminology within Java Docs (the Spring documentation is really good, but the JavaDocs is written in the dense technical language which I appreciate but don't quite understand yet).
There are two more hurdles to my implementation:
My exact understanding of spring is not at 100% yet. I haven't yet reached that a-ha moment where I really fully understand why all these beans are floating around. I think this is the key to why I am not getting things here... The configuration for the Redis is floating around in the Spring ether and I am not grasping how to just call it. I really need to keep investigating this (it is a huge hurdle to spring for me).
These StudioLinking are short lived, so I need to do some cleanup too. I will implement this later once I get the whole thing up off the ground, I do know it will be needed.
Why don't you use a blocking polling mechanism? No need to use fancy stuff of spring-data-redis. Just use simple blocking read of 5 seconds, so this call might take around 6 seconds or so. You can decrease or increase the blocking timeout.
class LinkStatus {
private final boolean updated;
LinkStatus(boolean updated) {
this.updated = updated;
}
}
// Parameter studioLinkingID refers to the StudioLinking that the requester wants to monitor
// updateList is a unique token to track individual consumers in Redis
#GetMapping("/subscribe-to-updates/{linkId}/{updatesId}")
public LinkStatus subscribeToUpdates(
#PathVariable("linkId") Integer linkId, #PathVariable("updatesId") Integer updatesId) {
StreamOperations<String, String, String> op = redisTemplate.opsForStream();
Consumer consumer = Consumer.from("test-group", "test-consumer");
// auto ack block stream read with size 1 with timeout of 5 seconds
StreamReadOptions readOptions = StreamReadOptions.empty().block(Duration.ofSeconds(5)).count(1);
List<MapRecord<String, String, String>> records =
op.read(consumer, readOptions, StreamOffset.latest("test-stream"));
return new LinkStatus(!CollectionUtils.isEmpty(records));
}
When subscribing to change streams using the blocking Spring Data Mongo implementation one can call await to wait for a subscription to become active:
Subscription subscription = startBlockingMongoChangeStream();
subscription.await(Duration.of(2, SECONDS));
Document someDocument = ..
writeDocumentToMongoDb(someDocument);
The startBlockingMongoChangeStream is implemented along these lines:
public Subscription startBlockingMongoChangeStream() {
MessageListenerContainer container = new DefaultMessageListenerContainer(template);
container.start();
MessageListener<ChangeStreamDocument<Document>, Document> listener = System.out::println;
ChangeStreamRequestOptions options = new ChangeStreamRequestOptions("user", ChangeStreamOptions.empty());
return container.register(new ChangeStreamRequest<>(listener, options), Document.class);
}
If await is not used in the example above there's a chance (virtually 100% chance if the JVM is hot) that someDocument is written before the subscription is active and thus the someDocument is missed. So adding await mitigates this issue.
I'm looking for a way to achieve the same thing when using the reactive implementation. The code now looks something like this:
Disposable disposable = startReactiveMongoChangeStream().subscribe(); // (1)
Document someDocument = ..
writeDocumentToMongoDb(someDocument).subscribe(); // (2)
The problem here is, again, that someDocument is written before the subscription returned by startReactiveMongoChangeStream has started and thus the document is missed.
Also note that this is a somewhat contrived example since in my actually application writeDocumentToMongoDb (2) is not aware of the startReactiveMongoChangeStream subscription (1) so I cannot simply flatMap (1) and call (2). The startReactiveMongoChangeStream method is implemented along these lines:
public Flux<ChangeStreamEvent<String>> startReactiveMongoChangeStream() {
return reactiveTemplate.changeStream(String.class)
.watchCollection("user")
.listen();
}
How can I "simulate" the await functionality available in the blocking implementation in the reactive implementation?
TL;DR
There are no means for synchronization in the reactive API
Explanation
First, let's look at both implementations to understand why this is.
The blocking implementation uses MongoDB's cursor API to obtain a cursor. Obtaining a cursor includes a conversation with the server. After MessageListenerContainer has obtained the cursors, it switches the subscription task to active which means that you have awaited the stage where the first cursor was fetched.
The reactive implementation operates on a ChangeStreamPublisher. From the reactive streams protocol, one can get notified when an element is emitted, when the stream completes or fails. There's no notification available when the server-side activity starts or completes. Therefore, you cannot wait until the reactive API receives the first cursor. Since cursors may be empty, the first cursor might not emit any value at all.
I think the MongoDB driver could provide a callback-style API to get notified that the stream is active. That's however something to report in the MongoDB issue tracker.
We have a scenario in our system where to kafka topic XYZ User details are published by some other producing application A (different system) and my application B is consuming from that topic.
The requirement is application B needs to consume that event 45 minutes after(or any configurable time) it is put in kafka topic XYZ by A (reason for this delay is that another REST api of some system C needs to trigger based on this User details event for particular user to confirm if it has some flag set for that user and that flag can be set at any point in that 45 minutes duration, although it could have been solved if C does not have the capability to publish to kafka or notify us in any way).
Our application B is written in spring.
The solution I tried was taking event from Kafka and checking the timestamp of the first event in the queue and if it is already 45 minutes for that event then process it or if it is less than 45 minutes then pause polling kafka container for that amount of time till it reaches 45 minutes using MessageListnerContainer pause() method.
Something like below -
#KafkaListener(id = "delayed_listener", topics = "test_topic", groupId = "test_group")
public void delayedConsumer(#Payload String message,
Acknowledgment acknowledgment) {
UserDataEvent userDataEvent = null;
try {
userDataEvent = this.mapper.readValue(message, TopicRequest.class);
} catch (JsonProcessingException e) {
logger.error("error while parsing message");
}
MessageListenerContainer delayedContainer = this.kafkaListenerEndpointRegistry.getListenerContainer("delayed_listener");
if (userDataEvent.getPublishTime() > 45 minutes) // this will be some configured value
{
long sleepTimeForPolling = userDataEvent.getPublishTime() - System.currentTimeMillis();
// give negative ack to put already polled messages back to kafka topic
acknowledgment.nack(1000);
// pause container, and later resume it
delayedContainer.pause();
ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(1);
scheduledExecutorService.schedule(() -> {
delayedContainer.resume();
}, sleepTimeForPolling, TimeUnit.MILLISECONDS);
return;
}
// if message was already 45 minutes old then process it
this.service.processMessage(userDataEvent);
acknowledgment.acknowledge();
}
Though it works for single partition but i am not sure if this is a right approach, any comments on that? also i see multiple partitions it will cause problems, as above pause method call will pause the whole container and if one of the partition has old message it will not be consumed if container was paused because of new message in some other partition.
Can i use this pause logic at partition level somehow?
Any better/recommended solution for achieving this delayed processing after a certain amount of configurable time which I can adopt in this scenario rather than doing what I did above?
Kafka is not really designed for such scenarios.
One way I could see that technique working would be to set the container concurrency to the same as the number of partitions in the topic so that each partition is processed by a different consumer on a different thread; then pause/resume the individual Consumer<?, ?>s instead of the whole container.
To do that, add the Consumer<?, ?> as an additional parameter; to resume the consumer, set the idleEventInterval and check the timer in an event listener (ListenerContainerIdleEvent). The Consumer<?, ?> is a property of the event so you can call resume() there.