Goal
I'm trying to create a dependency relationship between handlers that's somewhat circular, and I can't quite figure out how to get it right. What I want to achieve is a variation of producer -> [handlers 1-3] -> handler 4.
So, disruptor.handleEventsWith(h1, h2, h3).then(h4);. But I have the additional requirements that
While handlers 1-3 do process messages in parallel, none of them begins to process the next message until they have all finished the previous message.
After the first message, handlers 1-3 wait for handler 4 to have finished the most recent message before processing the next message.
The equivalent execution logic using a single event handler could be:
disruptor.handleEventsWith((event, sequence, endOfBatch) -> {
Arrays.asList(h1, h2, h3).parallelStream()
.forEach(h -> h.onEvent(event, sequence, endOfBatch));
h4.onEvent(event, sequence, endOfBatch);
});
Context
The design context is that handlers 1-3 each update their own state according to the message and that after a message is processed by each of the three they are in a consistent state. Handler 4 then runs some logic based on the state updated by handlers 1-3. So handler 4 should only see consistent states for the data structures maintained by 1-3, which means that handlers 1-3 should not process the next message until handler 4 has finished.
(Though the goal is definitely to use the Disruptor to manage the concurrency, rather than java.util.Stream.)
Not sure if it matters, but it's also the case that handler 4's logic can be broken into two parts, one requiring that none of handlers 1-3 are being updated and the next requiring only that the first part of handler 4 has finished. So handlers 1-3 can be processing a message while the second part of handler 4 is still executing.
Is there a way to accomplish this? Or maybe my design is flawed? I feel like there should be a way to do this via SequenceBarrier but I don't quite understand how to implement this custom barrier. For handlers 1-3, I think I'd like to make a barrier with the logic handlers[1:3].lastProcessedSequence() == handlers[4].lastProcessedSequence(), but I'm not sure where to put that logic.
Thanks!
I would consider having the handlers be stateless, and using the messages processed by them to contain the state of your system. That way you won't need to synchronize your handlers at all.
Related
I want to process multiple events in order of their timestamps coming into the system via multiple source systems like MQ, S3 ,KAFKA .
What approach should be taken to solve the problem?
As soon as an event comes in, the program can't know if another source will send events that should be processed before this one but have not arrived yet. Therefore, you need some waiting period, e.g. 5 minutes, in which events won't be processed so that late events have a chance to cut in front.
There is a trade-off here, making the waiting window larger will give late events a higher chance to be processed in the right order, but will also delay event processing.
For implementation, one way is to use a priority-queue that sorts by min-timestamp. All event sources write to this queue and events are consumed only from the top and only if they are at least x seconds old.
One possible optimisation for the processing lag: As long as all data sources provide at least one event that is ready for consumption, you can safely process events until one source is empty again. This only works if sources provide their own events in-order. You could implement this by having a counter for each data source of how many events exist in the priority-queue.
Another aspect is what happens to the priority-queue when a node crashes. Using acknowledgements should help here, so that on crash the queue can be rebuilt from unacknowledged events.
Hi I am working with akka streams along with akka-stream-kafka. I am setting up a Stream with the below setup:
Source (Kafka) --> | Akka Actor Flow | --> Sink (MongoDB)
Actor Flow basically by Actors that will process data, below is the hierarchy:
System
|
Master Actor
/ \
URLTypeHandler SerializedTypeHandler
/ \ |
Type1Handler Type2Handler SomeOtherHandler
So Kafka has the message, I write up the consumer and run it in atMostOnceSource configuration and use
Consumer.Control control =
Consumer.atMostOnceSource(consumerSettings, Subscriptions.topics(TOPIC))
.mapAsyncUnordered(10, record -> processAccessLog(rootHandler, record.value()))
.to(Sink.foreach(it -> System.out.println("FinalReturnedString--> " + it)))
.run(materializer);
I've used a print as a sink initially, just to get the flow running.
and the processAccessLog is defined as:
private static CompletionStage<String> processAccessLog(ActorRef handler, byte[] value) {
handler.tell(value, ActorRef.noSender());
return CompletableFuture.completedFuture("");
}
Now, from the definition ask must be used when an actor is expecting a response, makes sense in this case since I want to return values to be written in the sink.
But everyone (including docs), mention to avoid ask and rather use tell and forward, an amazing blog is written on it Don't Ask, Tell.
In the blog he mentions, in case of nested actors, use tell for the first message and then use forward for the message to reach the destination and then after processing directly send the message back to the root actor.
Now here is the problem,
How do I send the message from D back to A, such that I can still use the sink.
Is it good practice to have open ended streams? e.g. Streams where Sink doesn't matter because the actors have already done the job. (I don't think it is recommend to do so, seems flawed).
ask is Still the Right Pattern
From the linked blog article, one "drawback" of ask is:
blocking an actor itself, which cannot pick any new messages until the
response arrives and processing finishes.
However, in akka-stream this is the exact feature we are looking for, a.k.a. "back-pressure". If the Flow or Sink are taking a long time to process data then we want the Source to slow down.
As a side note, I think the claim in the blog post that the additional listener Actor results in an implementation that is "dozens times heavier" is an exaggeration. Obviously an intermediate Actor adds some latency overhead but not 12x more.
Elimination of Back-Pressure
Any implementation of what you are looking for would effectively eliminate back-pressure. An intermediate Flow that only used tell would continuously propagate demand back to the Source regardless of whether or not your processing logic, within the handler Actors, was completing its calculations at the same speed that the Source is generating data.
Consider an extreme example: what if your Source could produce 1 million messages per second but the Actor receiving those messages via tell could only process 1 message per second. What would happen to that Actor's mailbox?
By using the ask pattern in an intermediate Flow you are purposefully linking the speed of the handlers and the speed with which your Source produces data.
If you are willing to remove back-pressure signaling, from the Sink to the Source, then you might as well not use akka-stream in the first place. You can have either back-pressure or non-blocking messaging, but not both.
Ramon J Romero y Vigil is right but I will try to extend the response.
1) I think that the "Don't ask, tell" dogma is mostly for Actor systems architecture. Here you need to return a Future so the stream can resolve the processed result, you have two options:
Use ask
Create an actor per event and pass them Promise so a Future will be complete when this actor receives the data (you can use the getSender method so D can send the response to A). There is no way to send a Promise or Future in a message (The are not Serialisable) so the creation of this short living actors can not be avoided.
At the end you are doing mostly the same...
2) It's perfectly fine to use an empty Sink to finalise the stream (indeed akka provides the Sink.ignore() method to do so).
Seems like you are missing the reason why you are using streams, they are cool abstraction to provide composability, concurrency and back pressure. In the other hand, actors can not be compose and is hard to handle back pressure. If you don't need this features and your actors can have the work done easily you shouldn't use akka-streams in first place.
I've implemented the producer/consumer paradigm with a message broker in Spring and my producers use WebSocket to extract and publish data into the queue.
A producer is therefore something like:
AcmeProducer.java
handler/AcmeWebSocketHandler.java
and the handler has been a pain to deal with.
Firstly, the handler has two events:
onOpen()
onMessage()
onOpen has to send message to the web socket to subscribe to specific channels
onMessage receives messages from WebSocket and adds them into the queue
onMessage has some dependencies from AcmeProducer.java, such as it needs to know currency pairs to subscribe to, it needs the message broker service, and an ObjectMapper (json deserializer) and a benchmark service.
When messages are consumed from the queue they are transformed into a format for OrderBook.java. Every producer has its own format of OrderBook and therefore its own AcmeOrderBook.java.
I feel the flow is difficult to follow, even though I've put classes for one producer in the same package. Every time I want to fix something in a producer I have to change between classes and search where it is.
Are there some techniques for reducing the complicated data flow into something easy to follow?
As you know, event handlers like AcmeHandler.java hold callbacks that get called from elsewhere in the system (from the websocket implementation) and hence they can be tricky to organize. The data flow with events is also more convoluted because when handlers are in separate files they cannot use variables defined in the main file.
If this code would not use the event driven paradigm the data flow would be easy to follow.
Finally, is there any best practice for structuring code when using web sockets with onOpen and onMessage? Producer/Consumer is the best I could come up with, but I don't want to scatter classes of Acme in different packages. For example, the AcmeOrderbook should be in a consumer class, but as it depends on the AcmeProducer.java and AcmeHandler.java they are often edited at the same time, hence I've put them together.
As the dependencies inside every WebSocket handler are the same (only different implementations of those same interfaces) I think there should be only one thing injected, and that will be some Context variable. Is that the best practice?
I've solved it using Message Dispatcher and Message Handlers.
The dispatcher only checks if the message is a data message (snapshot or update) and passes control to the message handler class which itself checks the type of the message (either snapshot or update) and handles that type properly. If the message is not a data message but something else the dispatcher will dispatch the message depending on what it is (some event).
I've also added callbacks using anonymous functions and they are much shorter now, the callbacks are finally transparent.
For example, inside an anonymous callback function there is only this:
messageDispatcher.dispatchMessage(context);
Another key approach here is the use of context.MessageDispatcher is a separate class (autowired).
I've separated orderbook into its directory inside every producer.
Well, everything requires knowledge of everything to solve this elegantly.
Last pattern for solving this: Java EE uses annotations for their endpoints and control functions such as onOpen, onMessage, etc. That is also a nice approach because with it the callback becomes invisible and onOpen / onMessage are automatically called by the container (using the annotation).
Even after reading http://krondo.com/?p=1209 or Does an asynchronous call always create/call a new thread? I am still confused about how to provide asynchronous calls on an inherently single-threaded system. I will explain my understanding so far and point out my doubts.
One of the examples I read was describing a TCP server providing asynch processing of requests - a user would call a method e.g. get(Callback c) and the callback would be invoked some time later. Now, my first issue here - we have already two systems, one server and one client. This is not what I mean, cause in fact we have two threads at least - one in the server and one on the client side.
The other example I read was JavaScript, as this is the most prominent example of single-threaded asynch system with Node.js. What I cannot get through my head, maybe thinking in Java terms, is this:If I execute the code below (apologies for incorrect, probably atrocious syntax):
function foo(){
read_file(FIle location, Callback c) //asynchronous call, does not block
//do many things more here, potentially for hours
}
the call to read file executes (sth) and returns, allowing the rest of my function to execute. Since there is only one thread i.e. the one that is executing my function, how on earth the same thread (the one and only one which is executing my stuff) will ever get to read in the bytes from disk?
Basically, it seems to me I am missing some underlying mechanism that is acting like round-robin scheduler of some sort, which is inherently single-threaded and might split the tasks to smaller ones or call into a multiothraded components that would spawn a thread and read the file in.
Thanks in advance for all comments and pointing out my mistakes on the way.
Update: Thanks for all responses. Further good sources that helped me out with this are here:
http://www.html5rocks.com/en/tutorials/async/deferred/
http://lostechies.com/johnteague/2012/11/30/node-js-must-know-concepts-asynchrounous/
http://www.interact-sw.co.uk/iangblog/2004/09/23/threadless (.NET)
http://ejohn.org/blog/how-javascript-timers-work/ (intrinsics of timers)
http://www.mobl-lang.org/283/reducing-the-pain-synchronous-asynchronous-programming/
The real answer is that it depends on what you mean by "single thread".
There are two approaches to multitasking: cooperative and interrupt-driven. Cooperative, which is what the other StackOverflow item you cited describes, requires that routines explicitly relinquish ownership of the processor so it can do other things. Event-driven systems are often designed this way. The advantage is that it's a lot easier to administer and avoids most of the risks of conflicting access to data since only one chunk of your code is ever executing at any one time. The disadvantage is that, because only one thing is being done at a time, everything has to either be designed to execute fairly quickly or be broken up into chunks that to so (via explicit pauses like a yield() call), or the system will appear to freeze until that event has been fully processed.
The other approach -- threads or processes -- actively takes the processor away from running chunks of code, pausing them while something else is done. This is much more complicated to implement, and requires more care in coding since you now have the risk of simultaneous access to shared data structures, but is much more powerful and -- done right -- much more robust and responsive.
Yes, there is indeed a scheduler involved in either case. In the former version the scheduler is just spinning until an event arrives (delivered from the operating system and/or runtime environment, which is implicitly another thread or process) and dispatches that event before handling the next to arrive.
The way I think of it in JavaScript is that there is a Queue which holds events. In the old Java producer/consumer parlance, there is a single consumer thread pulling stuff off this queue and executing every function registered to receive the current event. Events such as asynchronous calls (AJAX requests completing), timeouts or mouse events get pushed on to the Queue as soon as they happen. The single "consumer" thread pulls them off the queue and locates any interested functions and then executes them, it cannot get to the next Event until it has finished invoking all the functions registered on the current one. Thus if you have a handler that never completes, the Queue just fills up - it is said to be "blocked".
The system has more than one thread (it has at least one producer and a consumer) since something generates the events to go on the queue, but as the author of the event handlers you need to be aware that events are processed in a single thread, if you go into a tight loop, you will lock up the only consumer thread and make the system unresponsive.
So in your example :
function foo(){
read_file(location, function(fileContents) {
// called with the fileContents when file is read
}
//do many things more here, potentially for hours
}
If you do as your comments says and execute potentially for hours - the callback which handles fileContents will not fire for hours even though the file has been read. As soon as you hit the last } of foo() the consumer thread is done with this event and can process the next one where it will execute the registered callback with the file contents.
HTH
In netty, events that flow through a channel pipeline occur in order as each channel is effectively only assigned to one thread and each handler calls each other in turn. This makes sense and alleviates many syncronisation issues.
However if you use an IdleStateHandler, from my reading of the source, it appears that the channelIdle event will be 'processed' in the context of the Timers thread (the thread the HashedWheelTime is using for example).
Is this case or have I missed something? If it is the case, does this mean that it is possible for an idlestate event and an IO event (example a messageRecieved event) to be executing on the same channel at the same time?
Also, as I could save the ChannelHandler ctx and use it in a different thread to 'write' to a channel for example, thus also have an event going downstream in one thread and upstream in another thread at the same time on the same channel?
Finally which thread do ChannelFutures execute in?
All of these use cases are perfectly acceptable and not a criticism of Netty at all, I am actually very fond of the library and make use of it everywhere. Its just that as I try to do more and more complex things with it, I want to understand more about how it works under the hood so that I can ensure I am using the proper and just the right amount (no more, no less) of syncronisation (locking).
Yes the channelIdle event and downstream / upstream event could be fired at the same time. So if your handler does implement at least two of the three you need to add proper synchronization. I guess we should make it more clear in the javadocs..
Now the other questions..
You can call Channel.write(...) from every thread you want too.. So its possible to just store it somewhere and call write whenever you want. This also gives the "limitation" that you need have proper synchronization for "downstream" handlers.
ChannelFutures are executed from within the worker thread.
However if you use an IdleStateHandler, from my reading of the source, it appears that the channelIdle event will be 'processed' in the context of the Timers thread (the thread the HashedWheelTime is using for example).
Is this case or have I missed something? If it is the case, does this mean that it is possible for an idlestate event and an IO event (example a messageRecieved event) to be executing on the same channel at the same time?
Yes, you have missed the concept of ordered event execution available in the Netty, If you are not haveing ExecutionHandler with OrderedMemoryAwareThreadPoolExecutor in the pipeline, you can not have the ordered event execution, specially channel state events from IdleStateHandler.
Also, as I could save the ChannelHandler ctx and use it in a different thread to 'write' to a channel for example, thus also have an event going downstream in one thread and upstream in another thread at the same time on the same channel?
Yes its correct.
More over, If you want to have ordered event execution in the downstream, you have have an downstream execution handler impl with OrderedMemoryAwareThreadPoolExecutor.
Finally which thread do ChannelFutures execute in?
The are executed by Oio/Nio Worker threads,
But if your down stream handler is consuming some type of event and firing another type of event to below the downstream, then your downstream handler can optionally handle the future execution. this can be done by get the future form downstream event and calling
future.setSuccess()