Problem Statement:
I have the following stream, 1,2,4,6,3,5.... I expect the events reached to the subscribers as 123456...
For Simplicity:
There cannot be possibility where one element is missing i.e you will not have 1,2,3,4,5,6...
The sent messages can be deleted from in-memory data structure giving space for the others.
This is a infinite stream, sometime may be large enough to be stored everything in memory(May be at the worst case lead to memory-exception which is fine.
You can pile up the events using window(n) or other methods, but then the events are expected to be published in sequence.
With respect to my code,
I have a Flowable that gets inbound data with events. These events are not in order. These messages are expected to reach subscribers in order(ascending order of event ID).
Please let me know
how can I achieve this either using rxJava or without rx?
What could be the optimal design for this without any event loss?
Related
I want to process multiple events in order of their timestamps coming into the system via multiple source systems like MQ, S3 ,KAFKA .
What approach should be taken to solve the problem?
As soon as an event comes in, the program can't know if another source will send events that should be processed before this one but have not arrived yet. Therefore, you need some waiting period, e.g. 5 minutes, in which events won't be processed so that late events have a chance to cut in front.
There is a trade-off here, making the waiting window larger will give late events a higher chance to be processed in the right order, but will also delay event processing.
For implementation, one way is to use a priority-queue that sorts by min-timestamp. All event sources write to this queue and events are consumed only from the top and only if they are at least x seconds old.
One possible optimisation for the processing lag: As long as all data sources provide at least one event that is ready for consumption, you can safely process events until one source is empty again. This only works if sources provide their own events in-order. You could implement this by having a counter for each data source of how many events exist in the priority-queue.
Another aspect is what happens to the priority-queue when a node crashes. Using acknowledgements should help here, so that on crash the queue can be rebuilt from unacknowledged events.
I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key.
As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams in a particular window and joins it. Which is not what I need.
The result expected is in one stream the messages are coming at a very low frequency, and from other stream we are getting messages at a high frequency. I want that if the value of the key has not arrived on both the streams we won't do a join till then and after it arrives do the join.
Is it possible using the current beam paradigm ?
In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].
However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.
Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:
What if you have buffered 1M events for one key and there is still no event appear from another stream? Do you need to empty the state? What if the event appear right after you emptied the state?
If eventually there is one event that appear to finish a JOIN, is the cost of buffering 1M events acceptable for JOIN a single event from another stream?
How to handle late date on both streams? Say You have joined <1, a> from left stream on <1, b> from right stream. Later there is another <1, c> from left stream, how do you know that you only need to emit <1, <c, b>>, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.
Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:
watermark: tells when windows are complete such that events will not long come (and further events are treated as late data)
Lateness SLA control: control the time you cache data for join.
refinement on output data: update output correctly if allowed new events arrive.
Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:
windowing is not flexible enough to support your case where streams have huge different frequencies (so fixed and sliding window does not fit). And you also don't know the arrival rate of streams (so session window does not really fit as you have to give a gap between session windows).
retraction is missing such that you cannot refine your output once late events arrive.
To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html
This isn't something that is well supported by the Beam model today, but there are a few ways you can do it. These examples assume each key appears exactly once on each stream, if that isn't the case you'll need to adjust them.
One option is to use the Global Window and Stateful DoFn instead of a Join. The Global Window effectively turns windowing off. A stateful DoFn lets you store data about the key you are processing in a "state cell" for later use. When you receive a record, you would check the state cell for a value. If you find one, do the join, emit the value, and clear the state. If there isn't anything, store the current value.
Another option is to use Session Windows and Join. The session window "GapDuration" is effectively a timeout on a given key. This works as long as you have a time bound in which you will see the Key on both streams. You'll also want to setup an element count trigger "AfterPane.elementCountAtLeast(2)" so you don't have to wait for the full timeout after seeing the second piece of data.
I'm trying to achieve some kind of event processing in Kafka. I've got some producers which post events to Kafka queue. I've also consumers which get the event, process it, and save processed data in DB. However, I need to be sure that EVERY event had been processed and finished. What if something crash unexpectedly during processing of event after taking it from a queue? How can I inform Kafka that this particular event is still not processed? Are there any known patterns?
Kafka streams Version 0.10.* by design has "At least once" semantics. Once you are using DB if every event has its own key you will also get "Exactly once semantic " since there is no duplications if you write to the same key.
If you want to make sure that this is correct.
Start kafka,
Generate Data,
Start DB,
Start your stream,
Make sure data is getting there,
Now stop your DB,
Kill stream while it gets some errors,
Start DB again,
And you will see that Kafka reproduces the data into your DB again.
For further reading you can go here
I have a JMS Queue that is populated at a very high rate ( > 100,000/sec ).
It can happen that there can be multiple messages pertaining to the same entity every second as well. ( several updates to entity , with each update as a different message. )
On the other end, I have one consumer that processes this message and sends it to other applications.
Now, the whole set up is slowing down since the consumer is not able to cope up the rate of incoming messages.
Since, there is an SLA on the rate at which consumer processes messages, I have been toying with the idea of having multiple consumers acting in parallel to speed up the process.
So, what Im thinking to do is
Multiple consumers acting independently on the queue.
Each consumer is free to grab any message.
After grabbing a message, make sure its the latest version of the entity. For this, part, I can check with the application that processes this entity.
if its not latest, bump the version up and try again.
I have been looking up the Integration patterns, JMS docs so far without success.
I would welcome ideas to tackle this problem in a more elegant way along with any known APIs, patterns in Java world.
ActiveMQ solves this problem with a concept called "Message Groups". While it's not part of the JMS standard, several JMS-related products work similarly. The basic idea is that you assign each message to a "group" which indicates messages that are related and have to be processed in order. Then you set it up so that each group is delivered only to one consumer. Thus you get load balancing between groups but guarantee in-order delivery within a group.
Most EIP frameworks and ESB's have customizable resequencers. If the amount of entities is not too large you can have a queue per entity and resequence at the beginning.
For those ones interested in a way to solve this:
Use Recipient List EAI pattern
As the question is about JMS, we can take a look into an example from Apache Camel website.
This approach is different from other patterns like CBR and Selective Consumer because the consumer is not aware of what message it should process.
Let me put this on a real world example:
We have an Order Management System (OMS) which sends off Orders to be processed by the ERP. The Order then goes through 6 steps, and each of those steps publishes an event on the Order_queue, informing the new Order's status. Nothing special here.
The OMS consumes the events from that queue, but MUST process the events of each Order in the very same sequence they were published. The rate of messages published per minute is much greater than the consumer's throughput, hence the delay increases over time.
The solution requirements:
Consume in parallel, including as many consumers as needed to keep queue size in a reasonable amount.
Guarantee that events for each Order are processed in the same publish order.
The implementation:
On the OMS side
The OMS process responsible for sending Orders to the ERP, determines the consumer that will process all events of a certain Order and sends the Recipient name along with the Order.
How this process know what should be the Recipient? Well, you can use different approaches, but we used a very simple one: Round Robin.
On ERP
As it keeps the Recipient's name for each Order, it simply setup the message to be delivered to the desired Recipient.
On OMS Consumer
We've deployed 4 instances, each one using a different Recipient name and concurrently processing messages.
One could say that we created another bottleneck: the database. But it is not true, since there is no concurrency on the order line.
One drawback is that the OMS process which sends the Orders to the ERP must keep knowledge about how many Recipients are working.
I am trying to write an Application that uses the JMS publish subscribe model. However I have run into a setback, I want to be able to have the publisher delete messages from the topic. The usecase is that I have durable subscribers, the active ones will get the messages (since it's more or less instantly) , but if there are inactive ones and the publisher decides the message is wrong, I want to have him able to delete the message so that the subscribers won't receive it anymore once they become active.
Problem is, I don't know how/if this can be done.
For a provider I settled on glassfish's implementation, but if other alternatives offer this functionality, I can switch.
Thank you.
JMS is a form of asynchronous messaging and as such the publishers and subscribers are decoupled by design. This means that there is no mechanism to do what you are asking. For subscribers who are active at time of publication, they will consume the message with no chance of receiving the delete message in time to act on it. If a subscriber is offline then they will but async messages are supposed to be atomic. If you proceed with design of other respondent's answer (create a delete message and require reconnecting consumers to read the entire queue looking for delete messages), then you will create a situation in which the behavior of the system differs based on whether or not a subscriber was online or not at the time a specific message/delete combination was was published. There is also a race condition in which the subscriber completes reading of the retained messages just before the publisher sends out the delete message. This means you must put significant logic into subscribers to reconcile these conditions and even more to reconcile the race condition.
The accepted method of doing this is what are called "compensating transactions." In any system where the producer and consumer do not share a single unit of work or share common state (such as using the same DB to store state) then backing out or correcting a previous transaction requires a second transaction that reverses the first. The consumer must of course be able to apply the compensating transaction correctly. When this pattern is used the result is that all subscribers exhibit the same behavior regardless of whether the messages are consumed in real time or in a batch after the consumer has restarted.
Note that a compensating transaction differs from a "delete message." The delete message as proposed in the other respondent's answer is a form of command and control that affects the message stream itself. On the other hand, compensating transactions affect the state of the system through transactional updates of the system state.
As a general rule, you never want to manage state of the system by manipulating the message stream with command and control functions. This is fragile, susceptible to attack and very hard to audit or debug. Instead, design the system to deliver every message subject to its quality of service constraints and to process all messages. Handle state changes (including reversing a prior action) entirely in the application.
As an example, in banking where transactions trigger secondary effects such as overdraft fees, a common procedure is to "memo post" the transactions during the day, then sort and apply them in a batch after the bank has closed. This allows a mistake to be reconciled before it causes overdraft fees. More recently, the transactions are applied in real time but the triggers are withheld until the day's books close and this achieves the same result.
JMS API does not allow removing messages from any destination (either queue or topic). Although I believe that specific JMX providers provide their own proprietary tools to manage their state for example using JMX. Try to check it out for your JMS provider but be careful: even if you find solution it will not be portable between different JMS providers.
One legal way to "remove" message is using its time-to-live:
publish(Topic topic, Message message, int deliveryMode, int priority, long timeToLive). Probably it is good enough for you.
If it is not applicable for your application, solve the problem on application level. For example attach unique ID to each message and publish special "delete" message with higher priority that will be a kind of command to delete "real" message with the same ID.
You have have the producer send a delete message and the consumer needs to read all messages before starting to process them.