Storm Emit Execute Latency

Storm Emit Execute Latency - java

I have a Storm topology running in a distributed environment across 4 Unix nodes.
I have a JMSSpout that receives a message and then forwards it onto a ParseBolt that will parse the raw message and create an object.
To help measure latency my JMSSpout emits the current time as a value and then when the ParseBolt receives this it will get the current time again and take the difference as the latency.
Using this approach I am seeing 200+ ms which doesn't sound right at all. Does anyone have an idea with regards to why this might be?

It's probably a threading issue. Storm uses the same thread for all spout nextTuple() calls and tuples emitted aren't processed until the nextTuple() call ends. There's also a very tight loop that repeatedly calls the nextTuple() method and it can consume a lot of cycles if you don't put at least a short sleep in the nextTuple() implementation.
Try adding a sleep(10) and emitting only one tuple per nextTuple().

Related

How to measure latency and throughput in a Storm topology

I'm learning Storm with the example ExclamationTopology. I want measure the latency (the time it takes to add !!! to a word) of a bolt and throughput (say, how many words pass through a bolt per second).
From here, I can count the number of words and how many times a bolt is executed:
_countMetric = new CountMetric();
_wordCountMetric = new MultiCountMetric();
context.registerMetric("execute_count", _countMetric, 5);
context.registerMetric("word_count", _wordCountMetric, 60);
I know that the Storm UI gives Process Latency and Execute Latency and this post gives a good explanation of what they are.
However, I want to log the latency of every execution of each bolt, and use this information along with the word_count to calculate the throughput.
How can I use Storm Metrics to accomplish this?

While your question is straight forward and will be surely for interest for many people, it`s answer is not as trivial as it should be. First of all, we need to clarify, what exactly we really want to measure. Throughput and Latency are terms, that can be easily be understood but in things get more complicated in Storms distributed environment.
As depicted in this excellent blog post, each Storm supervisor has at least 3 threads which fulfill different tasks. While the Worker Receiver Thread waits for incoming data tuples and aggregates them to a bulk, they are send to the Worker Executor Thread. This contains the user logic (in your case the ExclamationBolt and a sender that takes care of the outgoing messages. Finally, on every Supervisor Node, there is a Worker Send Thread that aggregates messages coming from all executors, aggregates them and send them to the network.
For sure, each of those threads has its own latency and throughput. For the Sender and Receiver Thread, they are largely depending on the buffer sizes, that you can adjust. In your case, you want just to measure latency and throughput of one (execution) bolt - this is possible, but keep in mind that those other threads have their effects on this bolt.
My approach:
To obtain latency and throughput, I used the old Storm Builtin Metrics. Because I found the documentation not very clear, I o draw a line here: we are not using the new Storm Metric API v2 and we are not using Cluster Metrics.
Activate the Storm Logging with placing the following in your storm.yaml:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
You can set the reporting interval with: topology.builtin.metrics.bucket.size.secs: 10
Run your Query. All metrics are logged every 10 Seconds in a specific metrics-logfile. It is not trivial to find this logfile. Storm creates a LoggingMetricsConsumer-Bolt and distributes it among the cluster. On this node, you should find in the Storm logs the corresponding metric file.
This metric file contains for each executor the metrics, you are looking for, like: complete-latency, execute-latency and so on. For throughput, I would use the Queue Metrics that contains e.g.: arrival_rate_secs as an estimate of how many tuples are inserted per second. Take care of the multiple threads that are executed on every supervisor.
Good luck!

Long delay between Akka actors

I'm consistently seeing very long delays (60+ seconds) between two actors, from the time at which the first actor sends a message for the second, and when the second actor's onReceive method is actually called with the message. What kinds of things can I look for to debug this problem?
Details
Each instance of ActorA is sending one message for ActorB with ActorRef.tell(Object, ActorRef). I collect a millisecond timestamp (with System.currentTimeMillis()) right after calling the tell method in ActorA, and getting another one at the start of ActorB's onReceive(Object). The interval between these timestamps is consistently 60 seconds or more. Specifically, when plotted over time, this interval follows a rough saw tooth pattern that ranges from more 60 second to almost 120 seconds, as shown in the graph below.
These actors are early in the data flow of the system, there are several other actors that follow after ActorB. This large gap only occurs between these two specific actors, the gaps between other pairs of adjacent actors is typically less than a millisecond, occassionally a few tens of milliseconds. Additionally, the actual time spent inside any given actor is never more than a second.
Generally, each actor in the system only passes a single message to another actor. One of the actors (subsequent to ActorB) sends a single message to each of a few different actors, and a small percentage (less than 0.1%) of the time, certain actors will send multiple messages to the same subsequent actor (i.e., multiple instances of the subsequent actor will be demanded). When this occurs, the number of multiple messages is typically on the order of a dozen or less.
Can this be explained (explicitely) by the normal reactive nature of Akka? Does it indicate a problem with the way work is distributed or the way the actors are configured? Is there something that can explicitly block a particular actor from spinning up? What other information should I collect or look at to understand the source of this, or to understand whether or not it is actually a problem?

You have a limited thread pool. If your Actors block, they still take up space in the thread pool. New threads will not be created if your thread pool is saturated.
You may want to configure
core-pool-size-factor,
core-pool-size-min, and
core-pool-size-max.
If you expect certain actions to block, you can instead wrap them in Future { blocking { ... } } and register a callback. But it's better to use asynchronous, non-blocking calls.

Why storm replays tuple from spout instead of retry on crashing component?

I am using storm to process online problems, but I cant't understand why storm replays tuple from spout . Retrying on what crashed may be more effective than replaying from root, right?
Anyone can help me? Thx

A typical spout implementation will replay only the FAILED tuples. As explained here a tuple emitted from the spout can trigger thousands of others tuple and storm creates a tree of tuple based on that. Now a tuple is called "fully processed" when every message in the tree has been processed. While emitting spout add a message id which is used to identify the tuple in later phase. This is called anchoring and can be done in the following way
_collector.emit(new Values("field1", "field2", 3) , msgId);
Now from the link posted above it says
A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout. This timeout can be configured on a topology-specific basis using the Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS configuration and defaults to 30 seconds.
If the tuple times-out Storm will call the FAIL method on spout and likewise in case of success the ACK method will be called.
So at this point storm will let you know which are the tuple that it has been failed to process but if you look into the source code you will see that the implementation of the fail method is empty in the BaseRichSpout class, so you need to override BaseRichSpout's fail method in order to have replay capability in your application.

Such replays of failed tuples should represent only a tiny proportion of the overall tuple traffic, so the efficiency of this simple replay-from start policy is usually not a concern.
Supporting a "replay-from-error-step" would bring lot's of complexity since the location of errors are sometimes hard to determine and there would be a need to support "replay-elsewhere" in case the cluster node where the error happened is currently (or permanently) down. It would also slow down the execution of the whole traffic which would probably not be compensated by the efficiency gained on error handling (which, again, is assumed to be triggered rarely).
If you think this replay-from-start strategy would impact negatively your topology, try to break it down into several smaller ones separated by some persistent queuing system like Kafka.

How does the Storm handle nextTuple in the Bolt

I am newbie to Storm and have created a program to read the incremented numbers for certain time. I have used a counter in Spout and in the "nextTuple()" method the counter is being emitted and incremented
_collector.emit(new Values(new Integer(currentNumber++)));
/* how this method is being continuously called...*/
and in the execute() method of the Tuple class has
public void execute(Tuple input) {
int number = input.getInteger(0);
logger.info("This number is (" + number + ")");
_outputCollector.ack(input);
}
/*this part I am clear as Bolt would receive the input from Spout*/
In my Main class execution I have the following code
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("NumberSpout", new NumberSpout());
builder.setBolt("NumberBolt", new PrimeNumberBolt())
.shuffleGrouping("NumberSpout");
Config config = new Config();
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("NumberTest", config, builder.createTopology());
Utils.sleep(10000);
localCluster.killTopology("NumberTest");
localCluster.shutdown();
The programs Perfectly works fine. What currently I am looking here is how does the Storm framework internally calls the nextTuple() method continuously. I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework.
Can anyone of you guys help me in understanding this portion clearly then it would be a great help for me as I will have to implement this concept in my project. If I am conceptually clear here then I can make a significant progress. Appreciate if anyone can quickly assist me over here. Awaiting responses...

how does the Storm framework internally calls the nextTuple() method continuously.
I believe this actually involves a very detail discussion about the entire life cycle of a storm topology as well as a clear concepts of different entities like workers, executors, tasks etc. The actual processing of a topology is carried out by the StormSubmitter class with its submitTopology method.
The very first thing it does is start uploading the jar using Nimbus's Thrift interface and then calls the submitTopology which eventually submit the topology to Nimbus. The Nimbus then start by normalizing the topology (from doc: The main purpose of normalization is to ensure that every single task will have the same serialization registrations, which is critical for getting serialization working correctly) followed by serialization, zookeeper hand shaking , supervisor and worker process startup and so on. Its too broad to discuss but If you really want to dig more you can go through the life cycle of storm topology where it explain nicely the step by step actions performs during the entire time. ( quick note from the documentation)
First a couple of important notes about topologies:
The actual topology that runs is different than the topology the user
specifies. The actual topology has implicit streams and an implicit
"acker" bolt added to manage the acking framework (used to guarantee
data processing).
The implicit topology is created via the
system-topology! function. system-topology! is used in two places:
- - when Nimbus is creating tasks for the topology code - - in the worker so
it knows where it needs to route messages to code
Now here's few clue I could try to share ...
Spouts or Bolts are actually the components which does the real processing (the logic). In storm terminology they executes as many tasks across the structure.
From the doc page : Each task corresponds to one thread of execution
Now, among many others, one typical responsibility of a worker process (read here) in storm is to monitor weather a topology is active or not and stored that particular state in a variable named storm-active-atom. This variable is used by the tasks to determine whether or not to call the nextTuple method.. So as long as your topology is live (you haven't put your spout code but assuming) till the time your timer is active (as you said for certain time) it will keep calling the nextTuple method. You can dig even further to understand the storm's Acking framework implementation to understand how it understand and acknowledge once a tuple is successfully processed and Guarantee-message-processing
I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework
Having said this I think its more important to get a clear understanding of how to work with storm rather than how to understand storm in the early stage. e.g instead of learning the internal mechanism of storm its important to realize that if we set a spout to read a file line by line then it keep on emitting each lines using the _collector.emit method till it reaches EOF. And the bolt connected to it receive the same in its execute(tuple input) method
Hope this help you share more with us in future

Ordinary Spouts
There is a loop in the storm's executor daemon that repeatedly calls nextTuple (as well as ack and fail when appropriate) on the corresponding spout instance.
There is no waiting for tuples being processed. Spout simply receives fail for tuples that did not manage to be processed in given timeout.
This can be easily simulated with a topology of a fast spout and a slow processing bolt: the spout will receive a lot of fail calls.
See also the ISpout javadoc:
nextTuple, ack, and fail are all called in a tight loop in a single thread in the spout task. When there are no tuples to emit, it is courteous to have nextTuple sleep for a short amount of time (like a single millisecond) so as not to waste too much CPU.
Trident Spouts
The situation is completely different for Trident-spouts:
By default, Trident processes a single batch at a time, waiting for
the batch to succeed or fail before trying another batch. You can get
significantly higher throughput – and lower latency of processing of
each batch – by pipelining the batches. You configure the maximum
amount of batches to be processed simultaneously with the
topology.max.spout.pending property.
Even while processing multiple batches simultaneously, Trident will order any state updates taking place in the topology among batches.

Performance issue designing threaded consumer of queue

I'm really new to programming and having performance problems with my software. Basically I get some data and run a 100 loop on it(i=0;i<100;i++) and during that loop my program makes 1 of 3 decisions, keep the data its working on, discard it, or send a version of it back to the queue to process. The individual work each thread does is very small but there's a lot of it(which is why I'm using a queue server to scale horizontally).
My problem is it never takes close to my entire cpu, my program runs at around 40% per core. After profiling, it seems the majority of the time is spend sending/receiving data from the queue(64% approx. in a part called com.rabbitmq.client.impl.Frame.readFrom(DataInputStream) and com.rabbitmq.client.impl.SocketFrameHandler.readFrame(), 17% approx. is getting it in the format for the queue(I brought down from 40% before) and the rest is spend on my programs logic). Obviously, I want my work to be done faster and want to not have it spend so much time in the queue and I'm wondering if there's a better design I can use.
My code is actually quite large but here's a overview of what it does:
I create a connection to the queue server(rabbitmq and java)
I fork as many threads as I have cpu cores(using the same connection)
Data from thread is
each thread creates its own channel to the queue server using the shared connection.
There'a while loop that pools the server and gets X number of messages without acknowledgments
Once I get a message, I use thread executor to send an acknowledge while my job is running
I parse the message and run my loop
If data is sent back to the queue, I send it to a thread executor that sends it back so my program can proceed with the next data set.
One weird thing I did, was although I use thread executor for acknowledgments and sending to the queue, my main worker thread is just a forked thread(using public void run()) because my program is dedicated to this single process I did that to make sure there was always X number of threads ready to work(and there was no shutting down/respawning of them). The rest is in threads because I figured the rest could wait/be queued while my main program runs.
I'm not sure how to design it better so it spends less time gathering/sending data. Is there any designs, rabbitmq, Java things I can use to help?

If it's not IO wait, then I suspect that it's down to some locking going on inside those methods.
It looks to me like your threads are spending a significant amount of time waiting for them to return. Somewhat counter-intuitively, you might well be able to increase your performance by cutting down on the number of threads, since they'll spend less time tripping over each other and more time actively doing something.
Give it a try and see what affect it has on the profile.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.