Flink - Consume from two streams and switch after N seconds

Flink - Consume from two streams and switch after N seconds - java

In Flink (1.14.0) application I have two streams to consume (all sources are kafka topics) , one with raw events and second with configurations. Configurations are stored in ListState<>. When under specific key there is no configuration, raw data events are ignored. There is a small time window when configurations are still processed in second stream but first stream is consuming events and I'm losing potential valid data to process. Configurations are stored in compressed kafka topic and are loaded at the beginning of job running.
I tried with custom operator implementation using InputSelectable. After some time N, timer switch from second stream to all streams but no data is pushed in first one, I used approach from this example
but its look like this 'switch' need to be performed when there is still data in second stream.
Is there a way to handle this scenario without implementing custom window aggregators to play role of time buffer for first stream?

Related

Java Kafka Consumer store state in memory?

I'm having a usecase where I need to "batch process" events data for customers.
Every piece of event data would have a customerId.
In my application layer (java), I will need to batch up all the events per customer id and then apply my business logic. My business logic needs all the events per customer to be available. Basically, I'm grouping by customerId before I can do anything with it.
Approach:
Ingest all the events to a Kafka Topic with partition key as "customerId". Therefore the events belonging to a specific customer always goes to the same consumer. In the consumer, I can gather the events in memory (perhaps using a simple expiry map or so) and do a batch process. In this approach, my entire batch is transient and stored in the application memory.
Caveats:
When Kafka partitions rebalancing happens (for whatever reasons) and when different partitions are re-assigned to different consumers, the data becomes inconsistent. Not sure if there's any way to overcome that.
I'm wondering what is a practical approach for such "batch" use cases? Is Kafka-Streams the right candidate for this? But this is not an infinite stream. The batch data set clearly has a start and end. End event is used as a trigger to perform the business logic.

The events will be ordered per customerId, but without a StickyAssignor in the consumer instances, they will not "go to" be consumed by the same consumer, especially in the event of replaces in a distributed environment
If you have some data in a compact topic that acts as your raw events, and consuming them all into some cache will build up your materialized view, then that's what Kafka Streams does with changelog topics, yes. You can also build this logic on your own with a plain consumer like the Confluent Schema Registry does with its _schemas topic and multiple internal Hashmaps

Join 2 unbounded Pcollections on key

I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key.
As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams in a particular window and joins it. Which is not what I need.
The result expected is in one stream the messages are coming at a very low frequency, and from other stream we are getting messages at a high frequency. I want that if the value of the key has not arrived on both the streams we won't do a join till then and after it arrives do the join.
Is it possible using the current beam paradigm ?

In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].
However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.
Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:
What if you have buffered 1M events for one key and there is still no event appear from another stream? Do you need to empty the state? What if the event appear right after you emptied the state?
If eventually there is one event that appear to finish a JOIN, is the cost of buffering 1M events acceptable for JOIN a single event from another stream?
How to handle late date on both streams? Say You have joined <1, a> from left stream on <1, b> from right stream. Later there is another <1, c> from left stream, how do you know that you only need to emit <1, <c, b>>, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.
Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:
watermark: tells when windows are complete such that events will not long come (and further events are treated as late data)
Lateness SLA control: control the time you cache data for join.
refinement on output data: update output correctly if allowed new events arrive.
Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:
windowing is not flexible enough to support your case where streams have huge different frequencies (so fixed and sliding window does not fit). And you also don't know the arrival rate of streams (so session window does not really fit as you have to give a gap between session windows).
retraction is missing such that you cannot refine your output once late events arrive.
To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html

This isn't something that is well supported by the Beam model today, but there are a few ways you can do it. These examples assume each key appears exactly once on each stream, if that isn't the case you'll need to adjust them.
One option is to use the Global Window and Stateful DoFn instead of a Join. The Global Window effectively turns windowing off. A stateful DoFn lets you store data about the key you are processing in a "state cell" for later use. When you receive a record, you would check the state cell for a value. If you find one, do the join, emit the value, and clear the state. If there isn't anything, store the current value.
Another option is to use Session Windows and Join. The session window "GapDuration" is effectively a timeout on a given key. This works as long as you have a time bound in which you will see the Key on both streams. You'll also want to setup an element count trigger "AfterPane.elementCountAtLeast(2)" so you don't have to wait for the full timeout after seeing the second piece of data.

How to sort infinite event streams using reactive programming?

Problem Statement:
I have the following stream, 1,2,4,6,3,5.... I expect the events reached to the subscribers as 123456...
For Simplicity:
There cannot be possibility where one element is missing i.e you will not have 1,2,3,4,5,6...
The sent messages can be deleted from in-memory data structure giving space for the others.
This is a infinite stream, sometime may be large enough to be stored everything in memory(May be at the worst case lead to memory-exception which is fine.
You can pile up the events using window(n) or other methods, but then the events are expected to be published in sequence.
With respect to my code,
I have a Flowable that gets inbound data with events. These events are not in order. These messages are expected to reach subscribers in order(ascending order of event ID).
Please let me know
how can I achieve this either using rxJava or without rx?
What could be the optimal design for this without any event loss?

Effective strategy to avoid duplicate messages in apache kafka consumer

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?

The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.

This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon

I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.

There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once

Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

RxJava vs Java 8 Parallelism Stream

What are all the similarities and diferences between them, It looks like Java Parallel Stream has some of the element available in RXJava, is that right?

Rx is an API for creating and processing observable sequences. The Streams API is for processing iterable sequences. Rx sequences are push-based; you are notified when an element is available. A Stream is pull-based; it "asks" for items to process. They may appear similar because they both support similar operators/transforms, but the mechanics are essentially opposites of each other.

Stream is pull based. Personally I feel it is Oracle's answer to C# IEnumerable<>, LINQ and their related extension methods.
RxJava is push based, which I am not sure whether it is .NET's reactive extensions released first or Rx project goes live first.
Conceptually they are totally different and their applications are also different.
If you are implementing a text searching program on a text file that's so large that you can't load everything and fit into memory, you would probably want to use Stream since you can easily determine if you have next lines available by keeping track of your iterator, and scan line by line.
Another application of Stream would be parallel calculations on a collection of data. Nowadays every machine has multiple cores but you won't know easily exactly how many cores your client machine are available. It would be hard to pre-configure the number of threads to operate. So we use parallel stream and let the JVM to determine that for us (supposed to be more optimal).
On the other hand, if you are implementing a program that takes an user input string and searches for available videos on the web, you would use RX since you won't even know when the program will start getting any results (or receive an error of network timeout). To make your program responsive you have to let the program "subscribe" for network updates and complete signals.
Another common application of Rx is on GUI to "detect user finished input" without requiring the user to click a button to confirm. For example you want to have a text field whenever the user stops typing you start searching without waiting a "Search button" click. In this case you use Rx to create an observable on "KeyEvent" and "throttle" (e.g. at 500ms), so that whenever he stopped typing for 500ms you receive an onNext() to "start searching".

There is also a difference in threading.
Stream#parallel splits the sequence into parts, and each part is processed in the separate thread.
Observable#subscribeOn and Observable#observeOn are both 'move' execution to another thread, but don't split the sequence.
In other words, for any particular processing stage:
parallel Stream may process different elements on different threads
Observable will use one thread for the stage
E. g. we have Observable/Stream of many elements and two processing stages:
Observable.create(...)
.observeOn(Schedulers.io())
.map(x -> stage1(x))
.observeOn(Schedulers.io())
.map(y -> stage2(y))
.forEach(...);
Stream.generate(...)
.parallel()
.map(x -> stage1(x))
.map(y -> stage2(y))
.forEach(...);
Observable will use no more than 2 additional threads (one per stage), so no two x'es or y's are accessed by different threads. Stream, on the countrary, may span each stage across several threads.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.