I have a flink process that listens to Kafka. The Messages consumed are then to be saved in a concurrent hash map for a period of time and then need to be sinked to cassandra.
The Operator chain goes something like
DataStream<Message> datastream = KafkaSource.createsource();
DataStream<Message> decodededMessage = datastream.flatmap(new DecodeMessage());
decodedMessage.assigneTimestampsandWatermarks(new AscendingTimestampExtractor<Message>(){
public long extractAscendingTimestamp(Message m){
return message.getTimestamp();
}
}).keyBy((KeySelector<Message>) x-> x.getID())
.process(new Timerfunction())
.addSink(new MySink());
class TimerFunction extends KeyedProcessFunction<Integer,Message,Message>{
private ValueState<Message> x;
public void processElement(){
//some logic to create timestamp for one minute
context.timerService().registerEventTimeTimer(x.getTimestamp());
}
public void onTimer()
// output values on trigger
}
I got some doubts while working with eventime
Message will have a unique id and timestamp and some other attributes. There could be a million unique keys in a minute. Will keyBy operation effect performance?
I need to cover a scenario as below
X Message with ID 1 arrives at 8hrs 1minute and 1sec
Y Message with ID 2 arrives at 8hrs 1minute and 4th sec
Since im using Id as Key Both these Messages should have a timer set to trigger at 8hrs 2min 0sec.
As per flink documentation if timestamp of timers are same it will be triggered just once.
I'm facing a problem where source becomes idle for few minutes the timer keeps waiting for next watermark and
never triggers. How to deal with idle source?
Is using processingtime a better option in this case?
Also i have a restriction to use Flink v1.8 so would need some info with respect to that version.
Thanks in Advance
I don't fully understand your question; there's too much context missing. But I can offer a few points:
(1) keyBy is expensive: it forces serialization/deserialization along with a network shuffle.
(2) Timers are deduplicated if and only if they are for the same timestamp and the same key.
(3) As for the idle source problem, the event time timers will eventually fire when events begin to flow again, as that will advance the watermark(s). If can't wait, you can use something like https://github.com/aljoscha/flink/blob/6e4419e550caa0e5b162bc0d2ccc43f6b0b3860f/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/timestamps/ProcessingTimeTrailingBoundedOutOfOrdernessTimestampExtractor.java, or switch to processing time.
Related
I have a Flink pipeline configured with a Kafka connector.
I have set a watermark generation frequency set to 2 seconds using:
env.getConfig().setAutoWatermarkInterval(2000);
Now my tumbling window is of 60 seconds for the stream window where we do some aggregations and we have event time-based processing based on the timestamp of one of our data fields.
I have not configured allowedLateness for my watermark strategy or for my stream.
final ConnectorConfig topicConfig = config.forTopic("mytopic");
final FlinkKafkaConsumer<MyPojo> myEvents = new FlinkKafkaConsumer<>(
topicConfig.name(),
AvroDeserializationSchema.forSpecific(MyPojo.class),
topicConfig.forConsumer()
);
myEvents.setStartFromLatest();
myEvents.assignTimestampsAndWatermarks(
WatermarkStrategy
.<MyPojo>forBoundedOutOfOrderness(
Duration.ofSeconds(30))
.withIdleness(Duration.ofSeconds(120))
.withTimestampAssigner((evt, timestamp) -> evt.event_timestamp_field));
Q.1 From what I am reading, the window for my time 0-60 will be computed after 90 seconds and 30-90 at 120 seconds and so on. However since we are doing tumbling window i.e no overlaps, my guess is there is no 30-90 window, the next window after 0-60 is 60-120 that gets triggered at 150 second mark, am I right?
Q.2 Without allowedLateness all late data will be discarded eg. A event with timestamp of 45 that arrives after 90 seconds is considered out of order and will be out of the first window i.e 0-60.For window 60-120, the event timestamp does not match so it will be discarded and not included in the window fired at 150 second mark, am I right?
Q.3. For the source idleness duration, I choose 120 saying that if any Kakfa partition for the topic is inactive with data, then mark it as idle after 2 minutes and then send the watermark for other active partitions. My question was on selection of this number i.e 2 minutes and if it has anything to do with the window duration (60 seconds) or the out of orderness(30 seconds). If so what should I be keeping in mind here for an apt selection such that I won't have data stranded late due to non-advancing watermarks due to idle partitions?
Or is that 120 is too long a wait that I could potentially miss data and hence I should be setting this to something much less than the OutOfOrderness duration to ensure 0 data loss?
EDIT: Added some more code
Q1: Yes, that's correct.
Q2: Yes, that's also correct.
Q3: The details here depend on whether you are having the Kafka source apply the WatermarkStrategy, in which case it will do per-partition watermarking, or whether the WatermarkStrategy is deployed as a separate operator somewhere after (typically chained immediately after) the source operator.
In the first case (with per-partition watermarking done by the FlinkKafkaConsumer) you will do something like this:
FlinkKafkaConsumer<MyType> kafkaSource = new FlinkKafkaConsumer<>(...);
kafkaSource.assignTimestampsAndWatermarks(WatermarkStrategy ...);
DataStream<MyType> stream = env.addSource(kafkaSource);
whereas doing the watermarking separately, after the source, looks like this:
DataStream<MyType> events = env.addSource(...);
DataStream<MyType> timestampedEvents = events
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<MyType>forBoundedOutOfOrderness(Duration ...)
.withTimestampAssigner((event, timestamp) -> event.timestamp));
When the watermarking is done on a per-partition basis, then a single idle partition will hold back the watermark for the consumer/source instance handling that partition -- until the idleness timeout kicks in (120 seconds in your example). By contrast, if the watermarking is done in a separate operator chained to the source, then only if all of the partitions assigned to that source instance (the one with an idle partition) are idle will the watermarks be held up (again, for 120 seconds).
But regardless of those details, there will hopefully be no data loss. There will be a period of time during which windows are not triggered (because the watermarks are not advancing), but events will continue to be processed and assigned to their appropriate windows. Once watermarks resume, those windows will close and deliver their results.
The situation in which data loss will occur is if the partition was idle because some failure upstream caused a disruption that eventually produces a bunch of late events. After the idleness timeout expires, the watermark will advance, and if the source was idle because something was broken upstream (rather than because there simply were no events), those events that eventually arrive will be late (unless your bounded-out-of-orderness delay is large enough to accommodate them). If you choose to ignore late events, then those events will be lost.
I'm trying to use Kafka Streams (i.e. not a simple Kafka Consumer) to read from a retry topic with events that have previously failed to process. I wish to consume from the retry topic, and if processing still fails (for example, if an external system is down), I wish to put the event back on the retry topic. Thus I don't want to keep consuming immediately, but instead wait a while before consuming, in order to not flood the systems with messages that are temporarily unprocessable.
Simplified, the code currently does this, and I wish to add a delay to it.
fun createTopology(topic: String): Topology {
val streamsBuilder = StreamsBuilder()
streamsBuilder.stream<String, ArchivalData>(topic, Consumed.with(Serdes.String(), ArchivalDataSerde()))
.peek { key, msg -> logger.info("Received event for key $key : $msg") }
.map { key, msg -> enrich(msg) }
.foreach { key, enrichedMsg -> archive(enrichedMsg) }
return streamsBuilder.build()
}
I have tried to use Window Delay to set this up, but have not managed to get it to work. I could of course do a sleep inside a peek, but that would leave a thread hanging and does not sound like a very clean solution.
The exact details of how the delay would work is not terribly important to my use case. For example, all of these would work fine:
All events on the topic in the past x seconds are all consumed at once. After it begins / finishes to consume, the stream waits x seconds before consuming again
Every event is processed x seconds after being put on the topic
The stream consumes messages with a delay of x seconds between every event
I would be very grateful if someone could provide a few lines of Kotlin or Java code that would accomplish any of the above.
You cannot really pause reading from the input topic using Kafka Streams—the only way to "delay" would be to call a "sleep", but as you mentioned, that blocks the whole thread and is not a good solution.
However, what you can do is to use a stateful processor, e.g., process() (with attached state store) instead of foreach(). If the retry fails, you don't put the record back into the input topic, but you put it into the store and also register a punctuation with desired retry delay. If the punctuation fires, you retry and if the retry succeeds, you delete the entry from the store and cancel the punctuation; otherwise, you wait until the punctuation fires again.
All code examples I've seen kind of work like this
subscribe to pullsubscription
get back subscriptionID, watermark
now loop through getEvents() until done, updating watermark
possibly unsubscribe.
In short they assume you are doing the pulling in a single thread/process, and
will not need to again pull using the same watermark/subscription ID again.
The API itself doesn't have a "resumePullScription(subscriptionID,watermark). It just
has beginSubscribe(folders,events,watermark). It's unclear to me whether I can
use that watermark again later with another beginSubscribe, since the subscriptionID
cannot be supplied.
I want to subscribe and get a watermark at time T0
At another time T1, within the timeout interval I want to getEvents again. This is a separate thread, so I need to reconnect to existing subscription/watermark.
It seems like I have two choices for time T1
unsubscribe # time T0,and then resubscribe # time T1 with watermark, but wont watermark be lost because of the unsubscribe?
resubscribe passing just the watermark, but will ews be smart enough to hook up to same subscription? or will watermark be ignored? or will subscription budget grow..?
At any rate it's not actually very clear what happens when subscription expires. I would assume watermark would go, but I see info claiming watermark will survive for 30 days. So then, whats the point of subscription id ?
The PullSubscription class in the EWS Manaaged API doesn't have a constructor to allow you to instantiate it by itself (I guess this was a boarder case in their design). So if you want to do this you would need to use either some ProxyCode eg http://msdn.microsoft.com/en-us/library/office/exchangewebservices.geteventstype(v=exchg.150).aspx or use raw soap and a httpclass to issue the GetEvents request and parse the result.
Basically while the Subscription is valid (eg within the timeout period) you should be able to use GetEvents with the SubscriptionId and a valid watermark (the watermark should be good for the 30 days.If you have unsubscribed the event the watermark wouldn't be valid because it would have been removed from the Events Table.
Cheers
Glen
Scenario:
I want to test a communication between 2 devices. They communicate by frames.
I start up the application (on device 1) and I send a number of frames (each frames contains a unique (int) ID). Device 2 receives each frame and sends an acknowledgement (and just echo's the ID) or it doesn't. (when frame got lost)
When device 1 receives the ACK I want to compare the time it took to send and receive the ACK back.
From looking around SO
How do I measure time elapsed in Java?
System.nanoTime() is probably the best way to monitor the elapsed time. However this is all happening in different threads according to the classic producer-consumer pattern where a thread (on device 1) is always reading and another is managing the process (and also writing the frames). Now thank you for bearing with me my question is:
Question: Now for the problem: I need to convey the unique ID from the ACK frame from the reading thread to the managing thread. I've done some research and this seems to be an good candidate for wait/notify system or not? Or perhaps I just need a shared array that contains data of each frame send? But than how does the managing thread know it happened?
Context I want to compare these times because I want to research what factors can hamper communication.
Why don't you just populate a shared map with <unique id, timestamp> pairs? You can expire old entries by periodically removing entries older than a certain amount.
I suggest you reformulate your problem with tasks (Callable). Create a task for the writer and one for the reader role. Submit these in pairs in an ExecutorService and let the Java concurrency framework handle the concurrency for you. You only have to think about what will be the result of a task and how would you want to use it.
// Pseudo code
ExecutorService EXC = Executors.newCachedThreadPool();
Future<List<Timestamp>> readerFuture = EXC.submit(new ReaderRole(sentFramwNum));
Future<List<Timestamp>> writerFuture = EXC.submit(new WriterRole(sentFrameNum));
List<Timestamp> writeResult = writerFuture.get(); // wait for the completion of writing
List<Timestamp> readResult = readerFuture.get(); // wait for the completion of reading
This is pretty complex stuff but much cleaner and more stable that a custom developed synchronization solution.
Here is a pretty good tutorial for the Java concurrency framework: http://www.vogella.com/articles/JavaConcurrency/article.html#threadpools
I have an requirement where I have to send the alerts when the record in db is not updated/changed for specified intervals. For example, if the received purchase order doesn't processed within one hour, the reminder should be sent to the delivery manager.
The reminder/alert should sent exactly at the interval (including seconds). If the last modified time is 13:55:45 means, the alert should be triggered 14:55:45. There could be million rows needs to be tracked.
The simple approach could be implementing a custom scheduler and all the records will registered with it. But should poll the database to look for the change every second and it will lead to performance problem.
UPDATE:
Another basic approach would be a creating a thread for each record and put it on sleep for 1 hour (or) Use some queuing concept which has timeout. But still it has performance problems
Any thoughts on better approach to implement the same?
probably using internal JMS queue would be better solution - for example you may want to use scheduled message feature http://docs.jboss.org/hornetq/2.2.2.Final/user-manual/en/html/examples.html#examples.scheduled-message with hornetq.
You can ask broker to publish alert message after exactly 1h. From the other hand during processing of some trading activity you can manually delete this message meaning that the trade activity has been processed without errors.
Use Timer for each reminder.i.e. If the last modified time is 17:49:45 means, the alert should be triggered 18:49:45 simply you should create a dynamic timer scheduling for each task it'll call exact after one hour.
It is not possible in Java, if you really insist on the "Real-timeness". In Java you may encouter Garbage collector's stop-the-world phase and you can never guarantee the exact time.
If the approximate time is also permissible, than use some kind of scheduled queue as proposed in other answers, if not, than use real-time Java or some native call.
If we can assume that the orders are entered with increasing time then:
You can use a Queue with elements that have the properties time-of-order and order-id.
Each new entry that is added to the DB is also enqueued to this Queue.
You can check the element at the start of the Queue each minute.
When checking the element at the start of the Queue, if an hour has passed from the time-of-order, then search for the entry with order-id in the DB.
If found and was not updated then send a notification, else dequeue it from the Queue .