Join 2 unbounded Pcollections on key - java

I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key.
As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams in a particular window and joins it. Which is not what I need.
The result expected is in one stream the messages are coming at a very low frequency, and from other stream we are getting messages at a high frequency. I want that if the value of the key has not arrived on both the streams we won't do a join till then and after it arrives do the join.
Is it possible using the current beam paradigm ?

In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].
However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.
Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:
What if you have buffered 1M events for one key and there is still no event appear from another stream? Do you need to empty the state? What if the event appear right after you emptied the state?
If eventually there is one event that appear to finish a JOIN, is the cost of buffering 1M events acceptable for JOIN a single event from another stream?
How to handle late date on both streams? Say You have joined <1, a> from left stream on <1, b> from right stream. Later there is another <1, c> from left stream, how do you know that you only need to emit <1, <c, b>>, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.
Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:
watermark: tells when windows are complete such that events will not long come (and further events are treated as late data)
Lateness SLA control: control the time you cache data for join.
refinement on output data: update output correctly if allowed new events arrive.
Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:
windowing is not flexible enough to support your case where streams have huge different frequencies (so fixed and sliding window does not fit). And you also don't know the arrival rate of streams (so session window does not really fit as you have to give a gap between session windows).
retraction is missing such that you cannot refine your output once late events arrive.
To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html

This isn't something that is well supported by the Beam model today, but there are a few ways you can do it. These examples assume each key appears exactly once on each stream, if that isn't the case you'll need to adjust them.
One option is to use the Global Window and Stateful DoFn instead of a Join. The Global Window effectively turns windowing off. A stateful DoFn lets you store data about the key you are processing in a "state cell" for later use. When you receive a record, you would check the state cell for a value. If you find one, do the join, emit the value, and clear the state. If there isn't anything, store the current value.
Another option is to use Session Windows and Join. The session window "GapDuration" is effectively a timeout on a given key. This works as long as you have a time bound in which you will see the Key on both streams. You'll also want to setup an element count trigger "AfterPane.elementCountAtLeast(2)" so you don't have to wait for the full timeout after seeing the second piece of data.

Related

Applying window based rules in Apache Flink Broadcast stream

I have a set of rules in my BroadcastStream in Apache Flink.
I am able to apply new rules as they come to my stream of events.
But I am not able to figure out how can I implement if my rules are like
rule 1> alert when count of event a is greater than 5 in a window of 5 mins
rule 2> alert when count of event a is greater than 4 in a window of 15 mins
I am a newbie to flink. I am not able to figure this out.
An application based on flink-sql or flink-cep won't be able to do this, because those libraries can only handle rules that are defined at the time the job is compiled. You would need to start a new job for each new rule, which may not meet your requirements.
If you want to have a single job that can handle a dynamic set of rules that are supplied while the job is running, you'll have to build this yourself. You can use a KeyedBroadcastProcessFunction to do this (which it sounds like you have already begun to experiment with).
Here's a sketch of a possible implementation:
You can use keyed state in the KeyedBroadcastProcessFunction to keep track of the current count in each window. If the rules can be characterized by a time interval and a counting threshold, then you could use MapState, where the keys are the rule IDs, and the values in the map are the current count for that rule. You can have a timer for each rule that fires when each window ends.
As events arrive, you iterate through the rule-based map, incrementing the counter for every relevant rule. And when the timers fire, you find the relevant rules, compare the counters to the thresholds, take appropriate action, and clear those counters.
Some potential complications to keep in mind:
This implementation requires that you partition your stream with a keyBy, so that you can use MapState and timers.
The broadcast stream can't have timers associated with it, so the timers will have to be managed by the processElement method that's handling the keyed stream.
Flink only allows one timer for a given key and given timestamp. So take care if you must handle the case where two rules would need to be triggered at the same time.
If events can arrive out of order, then you will need to either first sort the stream by timestamp, or allow for having multiple windows open concurrently.

How to sort infinite event streams using reactive programming?

Problem Statement:
I have the following stream, 1,2,4,6,3,5.... I expect the events reached to the subscribers as 123456...
For Simplicity:
There cannot be possibility where one element is missing i.e you will not have 1,2,3,4,5,6...
The sent messages can be deleted from in-memory data structure giving space for the others.
This is a infinite stream, sometime may be large enough to be stored everything in memory(May be at the worst case lead to memory-exception which is fine.
You can pile up the events using window(n) or other methods, but then the events are expected to be published in sequence.
With respect to my code,
I have a Flowable that gets inbound data with events. These events are not in order. These messages are expected to reach subscribers in order(ascending order of event ID).
Please let me know
how can I achieve this either using rxJava or without rx?
What could be the optimal design for this without any event loss?

RxJava vs Java 8 Parallelism Stream

What are all the similarities and diferences between them, It looks like Java Parallel Stream has some of the element available in RXJava, is that right?
Rx is an API for creating and processing observable sequences. The Streams API is for processing iterable sequences. Rx sequences are push-based; you are notified when an element is available. A Stream is pull-based; it "asks" for items to process. They may appear similar because they both support similar operators/transforms, but the mechanics are essentially opposites of each other.
Stream is pull based. Personally I feel it is Oracle's answer to C# IEnumerable<>, LINQ and their related extension methods.
RxJava is push based, which I am not sure whether it is .NET's reactive extensions released first or Rx project goes live first.
Conceptually they are totally different and their applications are also different.
If you are implementing a text searching program on a text file that's so large that you can't load everything and fit into memory, you would probably want to use Stream since you can easily determine if you have next lines available by keeping track of your iterator, and scan line by line.
Another application of Stream would be parallel calculations on a collection of data. Nowadays every machine has multiple cores but you won't know easily exactly how many cores your client machine are available. It would be hard to pre-configure the number of threads to operate. So we use parallel stream and let the JVM to determine that for us (supposed to be more optimal).
On the other hand, if you are implementing a program that takes an user input string and searches for available videos on the web, you would use RX since you won't even know when the program will start getting any results (or receive an error of network timeout). To make your program responsive you have to let the program "subscribe" for network updates and complete signals.
Another common application of Rx is on GUI to "detect user finished input" without requiring the user to click a button to confirm. For example you want to have a text field whenever the user stops typing you start searching without waiting a "Search button" click. In this case you use Rx to create an observable on "KeyEvent" and "throttle" (e.g. at 500ms), so that whenever he stopped typing for 500ms you receive an onNext() to "start searching".
There is also a difference in threading.
Stream#parallel splits the sequence into parts, and each part is processed in the separate thread.
Observable#subscribeOn and Observable#observeOn are both 'move' execution to another thread, but don't split the sequence.
In other words, for any particular processing stage:
parallel Stream may process different elements on different threads
Observable will use one thread for the stage
E. g. we have Observable/Stream of many elements and two processing stages:
Observable.create(...)
.observeOn(Schedulers.io())
.map(x -> stage1(x))
.observeOn(Schedulers.io())
.map(y -> stage2(y))
.forEach(...);
Stream.generate(...)
.parallel()
.map(x -> stage1(x))
.map(y -> stage2(y))
.forEach(...);
Observable will use no more than 2 additional threads (one per stage), so no two x'es or y's are accessed by different threads. Stream, on the countrary, may span each stage across several threads.

Is there an API that allows ordering event in clustered application?

Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.
Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!
So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?
You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.
I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.
Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.
You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.
Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.
The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.
Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

Categories

Resources