The input stream consists of data in JSON array of objects format.
Each object has one field/key named state by which we need to separate the input stream, see below example
Object1 -> "State":"Active"
Object2 -> "State":"Idle"
Object3 -> "State":"Blocked"
Object4 -> "State":"Active"
We have to start processing/thread as soon as we receive a particular state, keep on getting the data if a new state is similar to the previous state let the previous thread handle it else start a new thread for a new state. Also, it is required to run each thread for finite time and all the threads should run in parallel.
Please suggest how can I do it in Apache Flink. Pseudo codes and links would be helpful.
This can be done with Flink's Datastream API. Each JSON object can be treated as a tuple, which can be processed with any of the Flink Operators.
/----- * * | Active
------ (KeyBy) ------ * | Idle
\----- * | Blocked
Now, you can split the single data stream into multiple streams using the KeyBy operator. This operator splits and clubs together, all the tuples with a particular key (State in your case) into a keyedstream which is processed in parallel. Internally, this is implemented with hash partitioning.
Any new keys(States) are dynamically handled as new keyedstreams are created for them.
Explore the documentation for implementation purpose.
From your description, I believe you'd need to first have an operator with a parallelism of 1, that "chunks" events by the state, and adds a "chunk id" to the output record. Whenever you get an event with a new state, you'd increment the chunk id.
Then key by the chunk id, which will parallelize downstream processing. Add a custom function which is keyed by the chunk id, and has a window duration of 10 minutes. This is where the bulk of your data processing will occur.
And as #narush noted above, you should read through the documentation that he linked to, so you understand how windows work in Flink.
Related
I have a flink application that process stream of data and write some result to database. The stream is keyd by id. A database operation could take o quite of time (e.g 3 min) and can be only one operation for specified id key to protect against locks. At this moment, this sink operation could not be process with paralell and have to be set parallelism to 1.
process
.keyBy(new ProductKeySelector())
.addSink(new ProductSink())
.setParallelism(1)
I want to lock stream with actual processing id event and take another, out of order, and wait until same id end process then run process to it. It's will be process like blocking queue.
Update:
example:
kafkaKeyedStream
.map(new MapToProductType())
.keyBy(new ProductKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new ProductAggregateFunction())
.addSink(new ProductSink());
From Kafka Source i recieved data:
enter image description here
As you can see, data are grouped by window function (first value in data is the key) and te results are process by sink function. For this example, let's say that processing will take 20s per each part of data. So if i have 1 thread its not a problem, because the next data waiting for process, but if i set parallelism= 2 then first part of data will be still process by one thread, and after 10s anoter thread start process next part of data with the same key as first. And this create a lock on database.
I would like in a situation where one thread is already processing data under a specific key,
the second thread did not take data on the same key, and took either a different one or did nothing if there is nothing else
If your DB operation could take 3 minutes, you don't want to use a regular JDBC sink. Instead, look at Flink's Async IO support. You'd want to keyBy(id), and then inside of your custom RichAsyncFunction operator you can keep track of whether you've got an active DB request for a given id.
I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key.
As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams in a particular window and joins it. Which is not what I need.
The result expected is in one stream the messages are coming at a very low frequency, and from other stream we are getting messages at a high frequency. I want that if the value of the key has not arrived on both the streams we won't do a join till then and after it arrives do the join.
Is it possible using the current beam paradigm ?
In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].
However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.
Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:
What if you have buffered 1M events for one key and there is still no event appear from another stream? Do you need to empty the state? What if the event appear right after you emptied the state?
If eventually there is one event that appear to finish a JOIN, is the cost of buffering 1M events acceptable for JOIN a single event from another stream?
How to handle late date on both streams? Say You have joined <1, a> from left stream on <1, b> from right stream. Later there is another <1, c> from left stream, how do you know that you only need to emit <1, <c, b>>, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.
Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:
watermark: tells when windows are complete such that events will not long come (and further events are treated as late data)
Lateness SLA control: control the time you cache data for join.
refinement on output data: update output correctly if allowed new events arrive.
Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:
windowing is not flexible enough to support your case where streams have huge different frequencies (so fixed and sliding window does not fit). And you also don't know the arrival rate of streams (so session window does not really fit as you have to give a gap between session windows).
retraction is missing such that you cannot refine your output once late events arrive.
To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html
This isn't something that is well supported by the Beam model today, but there are a few ways you can do it. These examples assume each key appears exactly once on each stream, if that isn't the case you'll need to adjust them.
One option is to use the Global Window and Stateful DoFn instead of a Join. The Global Window effectively turns windowing off. A stateful DoFn lets you store data about the key you are processing in a "state cell" for later use. When you receive a record, you would check the state cell for a value. If you find one, do the join, emit the value, and clear the state. If there isn't anything, store the current value.
Another option is to use Session Windows and Join. The session window "GapDuration" is effectively a timeout on a given key. This works as long as you have a time bound in which you will see the Key on both streams. You'll also want to setup an element count trigger "AfterPane.elementCountAtLeast(2)" so you don't have to wait for the full timeout after seeing the second piece of data.
I have read that the streams are evaluated lazily in the pipeline. It is not executed unless the terminal opertion is invoked on the expression. But, what are the benefits of this lazy evaluation of the expressions? Can someone please explain with some examples
Lazy evaluation of stream pipelines in Java has nothing to do with lambdas.
All but the last stations in a stream pipeline are evaluated only when the last station asks for more data. When the last station pulls the one second from the last, that one pulls from the one before it and so forth until finally the reqeust for the new data makes all the way to supplier. The supplier then provides next one value which gets propagated back (actually forth) through the pipeline to the last stage (a collector or a forEach) and the cycle repeats.
This approach does not require the pipeline implementation to over-produce and/or buffer anything. This is great for CPU and memory footprint, which is usually a great bias for Java applications.
Example:
Stream
.generate(new Random(42)::nextLong) // <-- supplies the next random number every time the next station pulls it
.limit(5) // <-- pulls from the generator when is pulled itself
// terminates the pipeline after 5 pulls
.collect(toList()); // <-- pulls data from the pipeline into the list
Say I have a Flink SourceFunction<String> called RequestsSource.
On each request coming in from that source, I would like to subscribe to an external data source (for the purposes of an example, it could start a separate thread and start producing data on that thread).
The output data could be joined on a single DataStream. For example
Input Requests: A, B
Data produced:
A1
B1
A2
A3
B2
...
... and so on, with new elements being added to the DataStream forever.
How do I write a Flink Operator that can do this? Can I use e.g. FlatMapFunction?
you'd typically want to use an AsyncFunction, which (asynchronously) can take one input element, call some external service, and emit a collection of results.
See also Apache Flink Training - Async IO.
-- Ken
It sounds you are asking about an operator that can emit one or more boundless streams of data based on a connection to an external service, after receiving subscription events. The only clean way I can see to do this is to do all the work in the SourceFunction, or in a custom Operator.
I don't believe async i/o can emit an unbounded stream of results from a single input event. A ProcessFunction can do that, but only via its onTimer method.
Similarly but slightly different as this question: KStream batch process windows, I want to batch messages from a KStream before pushing it down to consumers.
However, this push-down should not be scheduled on a fixed time-window, but on a fixed message count threshold per key.
For starters 2 questions come to mind:
1) Is a custom AbstractProcessor the way this should be handled? Something along the lines of:
#Override
public void punctuate(long streamTime) {
KeyValueIterator<String, Message[]> it = messageStore.all();
while (it.hasNext())
KeyValue<String, Message[]> entry = it.next();
if (entry.value.length > 10) {
this.context.forward(entry.key, entry.value);
entry.value = new Message[10]();
}
}
}
2) Since the StateStore will potentially explode (in case an entry value never reaches the threshold in order to be forwarded), what is the best way to 'garbage-collect' this? I could do a timebased schedule and remove keys that are too old... but that looks very DIY and error prone.
I guess this would work. Applying a time based 'garbage collection' sounds reasonable, too. And yes, using Processor API instead of DSL has some flavor of DIY -- that't the purpose of PAPI in the first place (empower the user to do whatever is needed).
A few comments though:
You will need a more complex data structure: because punctuate() is called based on stream-time progress, it can happen that you have more than 10 records for one key between two calls. Thus, you would need something like KeyValueIterator<String, List<Message[]>> it = messageStore.all(); to be able to store multiple batches per key.
I would assume that you will need to fine tune the schedule for punctuate which will be tricky -- if your schedule is too tight, many batches might not be completed yet and you waste CPU -- if your schedule is too loose, you will need a lot of memory and your downstream operators will get a lot of data as you emit a lot of stuff at once. Sending burst of data downstream could become a problem.
Scanning the whole store is expensive -- it seems to be a good idea to try to "sort" your key-value pairs according to their batch size. This should enable you to only touch keys which do have completed batches instead of all keys. Maybe you can keep an in-memory list of keys that have complteted batches and only do a lookup for those (on failure, you need to do a single pass over all keys from the store to recreate this in-memory list).