Spark Structured Streaming with RabbitMQ source

Spark Structured Streaming with RabbitMQ source - java

I am trying to write a custom receiver for Structured Streaming that will consume messages from RabbitMQ.
Spark recently released DataSource V2 API, which seems very promising. Since it abstracts away many details, I want to use this API for the sake of both simplicity and performance. However, since it's quite new, there are not many sources available. I need some clarification from experienced Spark guys, since they will grasp the key points easier. Here we go:
My starting point is the blog post series, with the first part here. It shows how to implement a data source, without streaming capability. To make a streaming source, I slightly changed them, since I need to implement MicroBatchReadSupport instead of (or in addition to) DataSourceV2.
To be efficient, it's wise to have multiple spark executors consuming RabbitMQ concurrently, i.e. from the same queue. If I'm not confused, every partition of the input -in Spark's terminology- corresponds to a consumer from the queue -in RabbitMQ terminology. Thus, we need to have multiple partitions for the input stream, right?
Similar with part 4 of the series, I implemented MicroBatchReader as follows:
#Override
public List<DataReaderFactory<Row>> createDataReaderFactories() {
int partition = options.getInt(RMQ.PARTITICN, 5);
List<DataReaderFactory<Row>> factories = new LinkedList<>();
for (int i = 0; i < partition; i++) {
factories.add(new RMQDataReaderFactory(options));
}
return factories;
}
I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
I want my reciever to be reliable, i.e. after every processed message (or at least written to chekpoint directory for further processing), I need to ack it back to RabbitMQ. The problem starts after here: these factories are created at the driver, and the actual reading process takes place at executors through DataReaders. However, the commit method is a part of MicroBatchReader, not DataReader. Since I have many DataReaders per MicroBatchReader, how should I ack these messages back to RabbitMQ? Or should I ack when the next method is called on DataReader? Is it safe? If so, what is the purpose of commit function then?
CLARIFICATION: OBFUSCATION: The link provided in the answer about the renaming of some classes/functions (in addition to the explanations there) made everything much more clear worse than ever. Quoting from there:
Renames:
DataReaderFactory to InputPartition
DataReader to InputPartitionReader
...
InputPartition's purpose is to manage the lifecycle of the
associated reader, which is now called InputPartitionReader, with an
explicit create operation to mirror the close operation. This was no
longer clear from the API because DataReaderFactory appeared to be more
generic than it is and it isn't clear why a set of them is produced for
a read.
EDIT: However, the docs clearly say that "the reader factory will be serialized and sent to executors, then the data reader will be created on executors and do the actual reading."
To make the consumer reliable, I have to ACK for a particular message only after it is committed at Spark side. Note that the messages have to be ACKed on the same connection that it has been delivered through, but commit function is called at driver node. How can I commit at the worker/executor node?

> I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
The source [socket][1] source implementation has one thread pushing messages into the internal ListBuffer. In other words, there is one consumer (the thread) filling up the internal ListBuffer which is **then** divided up into partitions by `planInputPartitions`( `createDataReaderFactories` got [renamed][2] to `planInputPartitions`).
Also, according to the Javadoc of [MicroBatchReadSupport][3]
> The execution engine will create a micro-batch reader at the start of a streaming query, alternate calls to setOffsetRange and createDataReaderFactories for each batch to process, and then call stop() when the execution is complete. Note that a single query may have multiple executions due to restart or failure recovery.
In other words, the `createDataReaderFactories` should be called **multiple** times, which to my understanding suggests that each `DataReader` is responsible for a static input partition, which implies that the DataReader shouldn't be a consumer.
----------
> However, the commit method is a part of MicroBatchReader, not DataReader ... If so, what is the purpose of commit function then?
Perhaps part of the rationale for the commit function is to prevent the internal buffer of the MicroBatchReader from getting to big. By committing an Offset, you can effectively remove elements less than the Offset from the buffer as you are making a commitment to not process them anymore. You can see this happening in the socket source code with `batches.trimStart(offsetDiff)`
I'm unsure about implementing a reliable receiver, so I hope a more experienced Spark guy comes around and grabs your question as I'm interested too!
Hope this helps!
EDIT
I had only studied the socket, and wiki-edit sources. These sources are not production ready, which is something that the question was was not looking for. Instead, the kafka source is the better starting point which has, unlike the aforementioned sources, multiple consumers like the author was looking for.
However, perhaps if you're looking for unreliable sources, the socket and wikiedit sources above provide a less complicated solution.

Related

Ask vs Tell or forward for Actors using Akka Streams

Hi I am working with akka streams along with akka-stream-kafka. I am setting up a Stream with the below setup:
Source (Kafka) --> | Akka Actor Flow | --> Sink (MongoDB)
Actor Flow basically by Actors that will process data, below is the hierarchy:
System
|
Master Actor
/ \
URLTypeHandler SerializedTypeHandler
/ \ |
Type1Handler Type2Handler SomeOtherHandler
So Kafka has the message, I write up the consumer and run it in atMostOnceSource configuration and use
Consumer.Control control =
Consumer.atMostOnceSource(consumerSettings, Subscriptions.topics(TOPIC))
.mapAsyncUnordered(10, record -> processAccessLog(rootHandler, record.value()))
.to(Sink.foreach(it -> System.out.println("FinalReturnedString--> " + it)))
.run(materializer);
I've used a print as a sink initially, just to get the flow running.
and the processAccessLog is defined as:
private static CompletionStage<String> processAccessLog(ActorRef handler, byte[] value) {
handler.tell(value, ActorRef.noSender());
return CompletableFuture.completedFuture("");
}
Now, from the definition ask must be used when an actor is expecting a response, makes sense in this case since I want to return values to be written in the sink.
But everyone (including docs), mention to avoid ask and rather use tell and forward, an amazing blog is written on it Don't Ask, Tell.
In the blog he mentions, in case of nested actors, use tell for the first message and then use forward for the message to reach the destination and then after processing directly send the message back to the root actor.
Now here is the problem,
How do I send the message from D back to A, such that I can still use the sink.
Is it good practice to have open ended streams? e.g. Streams where Sink doesn't matter because the actors have already done the job. (I don't think it is recommend to do so, seems flawed).

ask is Still the Right Pattern
From the linked blog article, one "drawback" of ask is:
blocking an actor itself, which cannot pick any new messages until the
response arrives and processing finishes.
However, in akka-stream this is the exact feature we are looking for, a.k.a. "back-pressure". If the Flow or Sink are taking a long time to process data then we want the Source to slow down.
As a side note, I think the claim in the blog post that the additional listener Actor results in an implementation that is "dozens times heavier" is an exaggeration. Obviously an intermediate Actor adds some latency overhead but not 12x more.
Elimination of Back-Pressure
Any implementation of what you are looking for would effectively eliminate back-pressure. An intermediate Flow that only used tell would continuously propagate demand back to the Source regardless of whether or not your processing logic, within the handler Actors, was completing its calculations at the same speed that the Source is generating data.
Consider an extreme example: what if your Source could produce 1 million messages per second but the Actor receiving those messages via tell could only process 1 message per second. What would happen to that Actor's mailbox?
By using the ask pattern in an intermediate Flow you are purposefully linking the speed of the handlers and the speed with which your Source produces data.
If you are willing to remove back-pressure signaling, from the Sink to the Source, then you might as well not use akka-stream in the first place. You can have either back-pressure or non-blocking messaging, but not both.

Ramon J Romero y Vigil is right but I will try to extend the response.
1) I think that the "Don't ask, tell" dogma is mostly for Actor systems architecture. Here you need to return a Future so the stream can resolve the processed result, you have two options:
Use ask
Create an actor per event and pass them Promise so a Future will be complete when this actor receives the data (you can use the getSender method so D can send the response to A). There is no way to send a Promise or Future in a message (The are not Serialisable) so the creation of this short living actors can not be avoided.
At the end you are doing mostly the same...
2) It's perfectly fine to use an empty Sink to finalise the stream (indeed akka provides the Sink.ignore() method to do so).
Seems like you are missing the reason why you are using streams, they are cool abstraction to provide composability, concurrency and back pressure. In the other hand, actors can not be compose and is hard to handle back pressure. If you don't need this features and your actors can have the work done easily you shouldn't use akka-streams in first place.

How to correctly use BlockingQueue in java when I want to drop messages from the its head

I'm writing app for Android that process real-time data.
My app reads binary data from data bus (CAN), parse and display it on the screen.
App reads data in background thread. A need rapidly transfer data from one thread to another. Displaying data should be most actual.
I've found the nice java queue that almost implements required behavior: LinkedBlockingQueue. I plan to set the strong limit for this queue (about 100 messages).
Consumer thread should read data from queue with the take() method. But producer thread can't wait for consumer. By this reason it can't use standard method put() (because it's blocking).
So, I plan to put messages to my queue using the following construction:
while (!messageQueue.offer(message)) {
messageQueue.poll();
}
That is, the oldest message should be removed from queue to provide a place for the new actual data.
Is this a good practice? Or I've lost some important details?

Can't see anything wrong with it. You know what you are doing (loosing the head record). This can't relate to any practice; it's your call to use the api like you want. I personally prefer ArrayBlockingQueue though (less temp objects).

This should be what you're looking for: Size-limited queue that holds last N elements in Java
Top answer refers to an apache lib queue which will drop elements.

How does the Storm handle nextTuple in the Bolt

I am newbie to Storm and have created a program to read the incremented numbers for certain time. I have used a counter in Spout and in the "nextTuple()" method the counter is being emitted and incremented
_collector.emit(new Values(new Integer(currentNumber++)));
/* how this method is being continuously called...*/
and in the execute() method of the Tuple class has
public void execute(Tuple input) {
int number = input.getInteger(0);
logger.info("This number is (" + number + ")");
_outputCollector.ack(input);
}
/*this part I am clear as Bolt would receive the input from Spout*/
In my Main class execution I have the following code
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("NumberSpout", new NumberSpout());
builder.setBolt("NumberBolt", new PrimeNumberBolt())
.shuffleGrouping("NumberSpout");
Config config = new Config();
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("NumberTest", config, builder.createTopology());
Utils.sleep(10000);
localCluster.killTopology("NumberTest");
localCluster.shutdown();
The programs Perfectly works fine. What currently I am looking here is how does the Storm framework internally calls the nextTuple() method continuously. I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework.
Can anyone of you guys help me in understanding this portion clearly then it would be a great help for me as I will have to implement this concept in my project. If I am conceptually clear here then I can make a significant progress. Appreciate if anyone can quickly assist me over here. Awaiting responses...

how does the Storm framework internally calls the nextTuple() method continuously.
I believe this actually involves a very detail discussion about the entire life cycle of a storm topology as well as a clear concepts of different entities like workers, executors, tasks etc. The actual processing of a topology is carried out by the StormSubmitter class with its submitTopology method.
The very first thing it does is start uploading the jar using Nimbus's Thrift interface and then calls the submitTopology which eventually submit the topology to Nimbus. The Nimbus then start by normalizing the topology (from doc: The main purpose of normalization is to ensure that every single task will have the same serialization registrations, which is critical for getting serialization working correctly) followed by serialization, zookeeper hand shaking , supervisor and worker process startup and so on. Its too broad to discuss but If you really want to dig more you can go through the life cycle of storm topology where it explain nicely the step by step actions performs during the entire time. ( quick note from the documentation)
First a couple of important notes about topologies:
The actual topology that runs is different than the topology the user
specifies. The actual topology has implicit streams and an implicit
"acker" bolt added to manage the acking framework (used to guarantee
data processing).
The implicit topology is created via the
system-topology! function. system-topology! is used in two places:
- - when Nimbus is creating tasks for the topology code - - in the worker so
it knows where it needs to route messages to code
Now here's few clue I could try to share ...
Spouts or Bolts are actually the components which does the real processing (the logic). In storm terminology they executes as many tasks across the structure.
From the doc page : Each task corresponds to one thread of execution
Now, among many others, one typical responsibility of a worker process (read here) in storm is to monitor weather a topology is active or not and stored that particular state in a variable named storm-active-atom. This variable is used by the tasks to determine whether or not to call the nextTuple method.. So as long as your topology is live (you haven't put your spout code but assuming) till the time your timer is active (as you said for certain time) it will keep calling the nextTuple method. You can dig even further to understand the storm's Acking framework implementation to understand how it understand and acknowledge once a tuple is successfully processed and Guarantee-message-processing
I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework
Having said this I think its more important to get a clear understanding of how to work with storm rather than how to understand storm in the early stage. e.g instead of learning the internal mechanism of storm its important to realize that if we set a spout to read a file line by line then it keep on emitting each lines using the _collector.emit method till it reaches EOF. And the bolt connected to it receive the same in its execute(tuple input) method
Hope this help you share more with us in future

Ordinary Spouts
There is a loop in the storm's executor daemon that repeatedly calls nextTuple (as well as ack and fail when appropriate) on the corresponding spout instance.
There is no waiting for tuples being processed. Spout simply receives fail for tuples that did not manage to be processed in given timeout.
This can be easily simulated with a topology of a fast spout and a slow processing bolt: the spout will receive a lot of fail calls.
See also the ISpout javadoc:
nextTuple, ack, and fail are all called in a tight loop in a single thread in the spout task. When there are no tuples to emit, it is courteous to have nextTuple sleep for a short amount of time (like a single millisecond) so as not to waste too much CPU.
Trident Spouts
The situation is completely different for Trident-spouts:
By default, Trident processes a single batch at a time, waiting for
the batch to succeed or fail before trying another batch. You can get
significantly higher throughput – and lower latency of processing of
each batch – by pipelining the batches. You configure the maximum
amount of batches to be processed simultaneously with the
topology.max.spout.pending property.
Even while processing multiple batches simultaneously, Trident will order any state updates taking place in the topology among batches.

Java simple Analytics/Event Stream Processing with front end

My application takes a lot of measurements of it's internal processes. For example I time certain methods, I time external webservice calls and I also have variables which have a changing value, and processes which have a 'state' (e.g. PAUSED, WAITING etc).
The application uses 100 to 200 threads, and each bit of data would be associated with a particular thread.
I am looking for some software that I can channel all this information into that would produce useful metrics and graphs of the data (ideally in real time or close to real time), let me set thresholds to trigger warnings, would allow me to filter the data by thread or thread group, etc etc.
The application is performing time critical tasks so the software/api would need to be very fast and never block.
The application is written in java, and ideally the software/api would be in java as well. I think what I'm looking for is called Event Stream Processing, but I'm really not sure what language to use to describe it.
All I've found so far are Esper and ERMA. Can anyone give me a recommendation? I'm the only one working on this project so I'm hoping for something that is pretty easy to set up and use, and has a workable front end.

In the end I found Graphite which was pretty close to being exactly what I wanted. Not the simplest to set up and configure however, but I got it working in the end.
http://graphite.wikidot.com/
In my case I send data directly from my application to Statsd (via UDP), which collects the data and does some pre processing before it ends up in the whisper back end, there is a simple example of a java interface here https://github.com/etsy/statsd/commit/2253223f3c19d2149d65ec5bc802198ff93da4cb
Alternatively you could send your data directly to graphite, example here http://neopatel.blogspot.co.uk/2011/04/logging-to-graphite-monitoring-tool.html

Is there an API that allows ordering event in clustered application?

Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.

Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!

So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.