How does the Storm handle nextTuple in the Bolt

How does the Storm handle nextTuple in the Bolt - java

I am newbie to Storm and have created a program to read the incremented numbers for certain time. I have used a counter in Spout and in the "nextTuple()" method the counter is being emitted and incremented
_collector.emit(new Values(new Integer(currentNumber++)));
/* how this method is being continuously called...*/
and in the execute() method of the Tuple class has
public void execute(Tuple input) {
int number = input.getInteger(0);
logger.info("This number is (" + number + ")");
_outputCollector.ack(input);
}
/*this part I am clear as Bolt would receive the input from Spout*/
In my Main class execution I have the following code
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("NumberSpout", new NumberSpout());
builder.setBolt("NumberBolt", new PrimeNumberBolt())
.shuffleGrouping("NumberSpout");
Config config = new Config();
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("NumberTest", config, builder.createTopology());
Utils.sleep(10000);
localCluster.killTopology("NumberTest");
localCluster.shutdown();
The programs Perfectly works fine. What currently I am looking here is how does the Storm framework internally calls the nextTuple() method continuously. I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework.
Can anyone of you guys help me in understanding this portion clearly then it would be a great help for me as I will have to implement this concept in my project. If I am conceptually clear here then I can make a significant progress. Appreciate if anyone can quickly assist me over here. Awaiting responses...

how does the Storm framework internally calls the nextTuple() method continuously.
I believe this actually involves a very detail discussion about the entire life cycle of a storm topology as well as a clear concepts of different entities like workers, executors, tasks etc. The actual processing of a topology is carried out by the StormSubmitter class with its submitTopology method.
The very first thing it does is start uploading the jar using Nimbus's Thrift interface and then calls the submitTopology which eventually submit the topology to Nimbus. The Nimbus then start by normalizing the topology (from doc: The main purpose of normalization is to ensure that every single task will have the same serialization registrations, which is critical for getting serialization working correctly) followed by serialization, zookeeper hand shaking , supervisor and worker process startup and so on. Its too broad to discuss but If you really want to dig more you can go through the life cycle of storm topology where it explain nicely the step by step actions performs during the entire time. ( quick note from the documentation)
First a couple of important notes about topologies:
The actual topology that runs is different than the topology the user
specifies. The actual topology has implicit streams and an implicit
"acker" bolt added to manage the acking framework (used to guarantee
data processing).
The implicit topology is created via the
system-topology! function. system-topology! is used in two places:
- - when Nimbus is creating tasks for the topology code - - in the worker so
it knows where it needs to route messages to code
Now here's few clue I could try to share ...
Spouts or Bolts are actually the components which does the real processing (the logic). In storm terminology they executes as many tasks across the structure.
From the doc page : Each task corresponds to one thread of execution
Now, among many others, one typical responsibility of a worker process (read here) in storm is to monitor weather a topology is active or not and stored that particular state in a variable named storm-active-atom. This variable is used by the tasks to determine whether or not to call the nextTuple method.. So as long as your topology is live (you haven't put your spout code but assuming) till the time your timer is active (as you said for certain time) it will keep calling the nextTuple method. You can dig even further to understand the storm's Acking framework implementation to understand how it understand and acknowledge once a tuple is successfully processed and Guarantee-message-processing
I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework
Having said this I think its more important to get a clear understanding of how to work with storm rather than how to understand storm in the early stage. e.g instead of learning the internal mechanism of storm its important to realize that if we set a spout to read a file line by line then it keep on emitting each lines using the _collector.emit method till it reaches EOF. And the bolt connected to it receive the same in its execute(tuple input) method
Hope this help you share more with us in future

Ordinary Spouts
There is a loop in the storm's executor daemon that repeatedly calls nextTuple (as well as ack and fail when appropriate) on the corresponding spout instance.
There is no waiting for tuples being processed. Spout simply receives fail for tuples that did not manage to be processed in given timeout.
This can be easily simulated with a topology of a fast spout and a slow processing bolt: the spout will receive a lot of fail calls.
See also the ISpout javadoc:
nextTuple, ack, and fail are all called in a tight loop in a single thread in the spout task. When there are no tuples to emit, it is courteous to have nextTuple sleep for a short amount of time (like a single millisecond) so as not to waste too much CPU.
Trident Spouts
The situation is completely different for Trident-spouts:
By default, Trident processes a single batch at a time, waiting for
the batch to succeed or fail before trying another batch. You can get
significantly higher throughput – and lower latency of processing of
each batch – by pipelining the batches. You configure the maximum
amount of batches to be processed simultaneously with the
topology.max.spout.pending property.
Even while processing multiple batches simultaneously, Trident will order any state updates taking place in the topology among batches.

Related

Spark Structured Streaming with RabbitMQ source

I am trying to write a custom receiver for Structured Streaming that will consume messages from RabbitMQ.
Spark recently released DataSource V2 API, which seems very promising. Since it abstracts away many details, I want to use this API for the sake of both simplicity and performance. However, since it's quite new, there are not many sources available. I need some clarification from experienced Spark guys, since they will grasp the key points easier. Here we go:
My starting point is the blog post series, with the first part here. It shows how to implement a data source, without streaming capability. To make a streaming source, I slightly changed them, since I need to implement MicroBatchReadSupport instead of (or in addition to) DataSourceV2.
To be efficient, it's wise to have multiple spark executors consuming RabbitMQ concurrently, i.e. from the same queue. If I'm not confused, every partition of the input -in Spark's terminology- corresponds to a consumer from the queue -in RabbitMQ terminology. Thus, we need to have multiple partitions for the input stream, right?
Similar with part 4 of the series, I implemented MicroBatchReader as follows:
#Override
public List<DataReaderFactory<Row>> createDataReaderFactories() {
int partition = options.getInt(RMQ.PARTITICN, 5);
List<DataReaderFactory<Row>> factories = new LinkedList<>();
for (int i = 0; i < partition; i++) {
factories.add(new RMQDataReaderFactory(options));
}
return factories;
}
I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
I want my reciever to be reliable, i.e. after every processed message (or at least written to chekpoint directory for further processing), I need to ack it back to RabbitMQ. The problem starts after here: these factories are created at the driver, and the actual reading process takes place at executors through DataReaders. However, the commit method is a part of MicroBatchReader, not DataReader. Since I have many DataReaders per MicroBatchReader, how should I ack these messages back to RabbitMQ? Or should I ack when the next method is called on DataReader? Is it safe? If so, what is the purpose of commit function then?
CLARIFICATION: OBFUSCATION: The link provided in the answer about the renaming of some classes/functions (in addition to the explanations there) made everything much more clear worse than ever. Quoting from there:
Renames:
DataReaderFactory to InputPartition
DataReader to InputPartitionReader
...
InputPartition's purpose is to manage the lifecycle of the
associated reader, which is now called InputPartitionReader, with an
explicit create operation to mirror the close operation. This was no
longer clear from the API because DataReaderFactory appeared to be more
generic than it is and it isn't clear why a set of them is produced for
a read.
EDIT: However, the docs clearly say that "the reader factory will be serialized and sent to executors, then the data reader will be created on executors and do the actual reading."
To make the consumer reliable, I have to ACK for a particular message only after it is committed at Spark side. Note that the messages have to be ACKed on the same connection that it has been delivered through, but commit function is called at driver node. How can I commit at the worker/executor node?

> I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
The source [socket][1] source implementation has one thread pushing messages into the internal ListBuffer. In other words, there is one consumer (the thread) filling up the internal ListBuffer which is **then** divided up into partitions by `planInputPartitions`( `createDataReaderFactories` got [renamed][2] to `planInputPartitions`).
Also, according to the Javadoc of [MicroBatchReadSupport][3]
> The execution engine will create a micro-batch reader at the start of a streaming query, alternate calls to setOffsetRange and createDataReaderFactories for each batch to process, and then call stop() when the execution is complete. Note that a single query may have multiple executions due to restart or failure recovery.
In other words, the `createDataReaderFactories` should be called **multiple** times, which to my understanding suggests that each `DataReader` is responsible for a static input partition, which implies that the DataReader shouldn't be a consumer.
----------
> However, the commit method is a part of MicroBatchReader, not DataReader ... If so, what is the purpose of commit function then?
Perhaps part of the rationale for the commit function is to prevent the internal buffer of the MicroBatchReader from getting to big. By committing an Offset, you can effectively remove elements less than the Offset from the buffer as you are making a commitment to not process them anymore. You can see this happening in the socket source code with `batches.trimStart(offsetDiff)`
I'm unsure about implementing a reliable receiver, so I hope a more experienced Spark guy comes around and grabs your question as I'm interested too!
Hope this helps!
EDIT
I had only studied the socket, and wiki-edit sources. These sources are not production ready, which is something that the question was was not looking for. Instead, the kafka source is the better starting point which has, unlike the aforementioned sources, multiple consumers like the author was looking for.
However, perhaps if you're looking for unreliable sources, the socket and wikiedit sources above provide a less complicated solution.

Why is Spark Standalone only creating one Executor when I have two Worker nodes?

First of all, have I fundamentally misunderstood Spark Standalone mode? The official documentation says
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use.
I thought that this implied multiple users could have applications running in parallel, submitting jobs to the same Spark Standalone cluster. However, now I am wondering if this was meant to mean that restricting resources would allow multiple users to each run separate Spark Standalone clusters without starving all other users (or just run other programs on the cluster without Spark starving them of resources). Is this the case?
I have Spark set up in Standalone mode on three VMs running Ubuntu. They can all see each other across a NAT network. One of the machines (192.168.56.101) is the master, while the others are slaves (192.168.56.102 and 192.168.56.103).
The Spark version is 2.1.7.
I have a Java app which creates JavaRDD objects in several threads, each calling .collect() in its own thread. I would have thought that this counts as the kind of "job" which can run in parallel for a single Spark Context object (according to https://spark.apache.org/docs/1.2.0/job-scheduling.html).
Each thread gets a JavaRDD object from a synchronized method of a class co-ordinating access to the (single) JavaSparkContext object. The JavaSparkContext is set up without much tweaking. Essentially it is
public synchronized JavaRDD<String> getRdd(List<String> fooList) {
if (this.javaSparkContext == null) {
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.executor.memory", "500m");
// There might be a few more settings here such as host name and port, but nothing directly to do with an executor pool or anything, as far as I remember. I don't have the code in front of me while not at work.
this.javaSparkContext = JavaSparkContext.fromSparkContext(new SparkContext(sparkConf));
}
if (this.jobPool == "fooPool") {
this.jobPool = "barPool";
} else {
this.jobPool = "fooPool";
}
this.javaSparkContext.setLocalProperty("spark.scheduler.pool", this.jobPool);
this.javaSparkContext.requestExecutors(1);
return this.javaSparkContext.parallelize(fooList);
}
The Spark Context object has set up two job pools (as I set it up to), as far as I can tell from the console log:
... INFO scheduler.FairSchedulableBuilder: Created pool fooPool, schedulingMode: FAIR, minShare: 1, weight: 1
... INFO scheduler.FairSchedulableBuilder: Created pool barPool, schedulingMode: FAIR, minShare: 1, weight: 1
... INFO scheduler.FairSchedulableBuilder: Created pool default, schedulingMode: FIFO, minShare: 1, weight: 1
I started many threads, each submitting one .collect() job, alternating between the two FAIR pools. As far as I can tell, these are being allocated to the two pools:
... INFO: scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
... INFO scheduler.FairSchedulableBuilder: Added task set TaskSet_0.0 tasks to pool fooPool
and so on, alternating between the two pools.
(The .collect() call is something like
List<String> consoleOutput = getRdd(fooList).cache().pipe("python ./dummy.py").collect();
but again I don't have the code in front of me. It certainly works in the sense that an Executor correctly executes the command.)
However, the client.StandaloneAppClient$ClientEndpoint only ever creates one Executor, which then proceeds to execute all the tasks in barPool then all the tasks in fooPool in serial (but not FIFO). The Worker node VM has 1 core though I set SPARK_EXECUTOR_INSTANCES, SPARK_EXECUTOR_CORES, SPARK_WORKER_INSTANCES, and SPARK_WORKER_CORES to 4, hoping that that would help somehow.
The Master node also has SPARK_EXECUTOR_INSTANCES, SPARK_EXECUTOR_CORES, SPARK_WORKER_INSTANCES, and SPARK_WORKER_CORES set to 4.
It is only ever one of the Worker nodes which responds, and only ever sends one Executor. Both Worker nodes can communicate with the Master - I can turn off one, and the other will take up the next set of jobs which I submit.
The jobs are trivial jobs, each of which delivers a Python script which performs "sleep for some seconds, printing some stuff", and each job takes a single-element RDD, as a proof of concept for a good business reason, as essentially multiple unrelated RDDs would need to be processed in parallel by unrelated Python scripts.
Is there some setting which I have missed? I know that I am misusing Spark in that I am specifically preventing it from parallelizing according to an RDD, but this is set in stone. I am baffled though that only one Worker responds, given that there are many task sets lined up, in multiple job pools. I even call .requestExecutors(1) with every submission, with the console showing
... INFO cluster.StandaloneSchedulerBackend: Requesting 1 additional executor(s) from the cluster manager
but this seems to be totally ignored.
Any advice will be greatly appreciated!
Edit: added Spark version and Java code for method setting up context. Removed idiotic English mistakes introduced by someone who thought that they would "correct" my question by making it grammatically wrong, which were approved by some people who obviously did not read the edit.

As far as I can tell from a lot of research on the Internet and experimenting with my own code, the answer is "Spark does not work that way".
Specifically:
1) There can only be 1 Spark Context per Java Virtual Machine.
2) Per Spark Context, tasks are only ever executed sequentially.
The way which is used by popular Spark cluster managers such as Mesos or Mist, is to prepare several Spark Contexts, each in its own JVM, and tasks are divided among these Spark Contexts.
I could manage to engage a second worker by using a second JVM (in my case, it was by running the same code simultaneously in the Eclipse debugger and in the IntelliJ debugger), but this is just a confirmation of the kind of set-up described above.

How to measure latency and throughput in a Storm topology

I'm learning Storm with the example ExclamationTopology. I want measure the latency (the time it takes to add !!! to a word) of a bolt and throughput (say, how many words pass through a bolt per second).
From here, I can count the number of words and how many times a bolt is executed:
_countMetric = new CountMetric();
_wordCountMetric = new MultiCountMetric();
context.registerMetric("execute_count", _countMetric, 5);
context.registerMetric("word_count", _wordCountMetric, 60);
I know that the Storm UI gives Process Latency and Execute Latency and this post gives a good explanation of what they are.
However, I want to log the latency of every execution of each bolt, and use this information along with the word_count to calculate the throughput.
How can I use Storm Metrics to accomplish this?

While your question is straight forward and will be surely for interest for many people, it`s answer is not as trivial as it should be. First of all, we need to clarify, what exactly we really want to measure. Throughput and Latency are terms, that can be easily be understood but in things get more complicated in Storms distributed environment.
As depicted in this excellent blog post, each Storm supervisor has at least 3 threads which fulfill different tasks. While the Worker Receiver Thread waits for incoming data tuples and aggregates them to a bulk, they are send to the Worker Executor Thread. This contains the user logic (in your case the ExclamationBolt and a sender that takes care of the outgoing messages. Finally, on every Supervisor Node, there is a Worker Send Thread that aggregates messages coming from all executors, aggregates them and send them to the network.
For sure, each of those threads has its own latency and throughput. For the Sender and Receiver Thread, they are largely depending on the buffer sizes, that you can adjust. In your case, you want just to measure latency and throughput of one (execution) bolt - this is possible, but keep in mind that those other threads have their effects on this bolt.
My approach:
To obtain latency and throughput, I used the old Storm Builtin Metrics. Because I found the documentation not very clear, I o draw a line here: we are not using the new Storm Metric API v2 and we are not using Cluster Metrics.
Activate the Storm Logging with placing the following in your storm.yaml:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
You can set the reporting interval with: topology.builtin.metrics.bucket.size.secs: 10
Run your Query. All metrics are logged every 10 Seconds in a specific metrics-logfile. It is not trivial to find this logfile. Storm creates a LoggingMetricsConsumer-Bolt and distributes it among the cluster. On this node, you should find in the Storm logs the corresponding metric file.
This metric file contains for each executor the metrics, you are looking for, like: complete-latency, execute-latency and so on. For throughput, I would use the Queue Metrics that contains e.g.: arrival_rate_secs as an estimate of how many tuples are inserted per second. Take care of the multiple threads that are executed on every supervisor.
Good luck!

Storm Emit Execute Latency

I have a Storm topology running in a distributed environment across 4 Unix nodes.
I have a JMSSpout that receives a message and then forwards it onto a ParseBolt that will parse the raw message and create an object.
To help measure latency my JMSSpout emits the current time as a value and then when the ParseBolt receives this it will get the current time again and take the difference as the latency.
Using this approach I am seeing 200+ ms which doesn't sound right at all. Does anyone have an idea with regards to why this might be?

It's probably a threading issue. Storm uses the same thread for all spout nextTuple() calls and tuples emitted aren't processed until the nextTuple() call ends. There's also a very tight loop that repeatedly calls the nextTuple() method and it can consume a lot of cycles if you don't put at least a short sleep in the nextTuple() implementation.
Try adding a sleep(10) and emitting only one tuple per nextTuple().

Why storm replays tuple from spout instead of retry on crashing component?

I am using storm to process online problems, but I cant't understand why storm replays tuple from spout . Retrying on what crashed may be more effective than replaying from root, right?
Anyone can help me? Thx

A typical spout implementation will replay only the FAILED tuples. As explained here a tuple emitted from the spout can trigger thousands of others tuple and storm creates a tree of tuple based on that. Now a tuple is called "fully processed" when every message in the tree has been processed. While emitting spout add a message id which is used to identify the tuple in later phase. This is called anchoring and can be done in the following way
_collector.emit(new Values("field1", "field2", 3) , msgId);
Now from the link posted above it says
A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout. This timeout can be configured on a topology-specific basis using the Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS configuration and defaults to 30 seconds.
If the tuple times-out Storm will call the FAIL method on spout and likewise in case of success the ACK method will be called.
So at this point storm will let you know which are the tuple that it has been failed to process but if you look into the source code you will see that the implementation of the fail method is empty in the BaseRichSpout class, so you need to override BaseRichSpout's fail method in order to have replay capability in your application.

Such replays of failed tuples should represent only a tiny proportion of the overall tuple traffic, so the efficiency of this simple replay-from start policy is usually not a concern.
Supporting a "replay-from-error-step" would bring lot's of complexity since the location of errors are sometimes hard to determine and there would be a need to support "replay-elsewhere" in case the cluster node where the error happened is currently (or permanently) down. It would also slow down the execution of the whole traffic which would probably not be compensated by the efficiency gained on error handling (which, again, is assumed to be triggered rarely).
If you think this replay-from-start strategy would impact negatively your topology, try to break it down into several smaller ones separated by some persistent queuing system like Kafka.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.