storm - finding source(s) of latency

storm - finding source(s) of latency - java

I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.
kafka -> db lookup -> write to cassandra
The numbers from the storm UI look like this:
(I see that the bolts are running at > 1.0 capacity)
If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.
The tuples are connected via shuffleGrouping.
Cassandra lives on AWS so there are likely network limitations en route.
The storm cluster has 3 machines. There are 3 workers in the topology.

Your topology has several problems:
look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
And yes, your problem is having tuples timing out, this is exactly what is happening.
(EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.

FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?
After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500:

Related

Elasticsearch 5 stuck reading from disk

I have a cluster of 6 nodes with ES 5.4 with 4B small documents yet indexed.
Documents are organized in ~9K indexes, for a total of 2TB. The indexes' occupancy varies from few KB to hundreds of GB and they are sharded in order to keep each shard under 20GB.
Cluster health query responds with:
{
cluster_name: "##########",
status: "green",
timed_out: false,
number_of_nodes: 6,
number_of_data_nodes: 6,
active_primary_shards: 9014,
active_shards: 9034,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 0,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0,
task_max_waiting_in_queue_millis: 0,
active_shards_percent_as_number: 100
}
Before sending any query to the cluster, it is stable and it gets a bulk index query every second with 10 or some thousand of documents with no problem.
Everything is fine until I redirect some traffic to this cluster.
As soon as it starts to respond the majority of the servers start reading from disk at 250 MB/s making the cluster unresponsive:
What it is strange is that I cloned this ES configuration on AWS (same hardware, same Linux kernel, but different Linux version) and there I have no problem:
NB: note that 40MB/s of disk read is what I always had on servers that are serving traffic.
Relevant Elasticsearch 5 configurations are:
Xms12g -Xmx12g in jvm.options
I also tested it with the following configurations, but without succeeded:
bootstrap.memory_lock:true
MAX_OPEN_FILES=1000000
Each server has 16CPU and 32GB of RAM; some have Linux Jessie 8.7, other Jessie 8.6; all have kernel 3.16.0-4-amd64.
I checked that cache on each node with localhost:9200/_nodes/stats/indices/query_cache?pretty&human and all the servers have similar statistics: cache size, cache hit, miss and eviction.
It doesn't seem a warm up operation, since on AWS cloned cluster I never see this behavior and also because it never ends.
I can't find useful information under /var/log/elasticsearch/*.
Am I doing anything wrong?
What should I change in order to solve this problem?
Thanks!

You probably need to reduce the number of threads for searching.
Try going with 2x the number of processors. In the elasticsearch.yaml:
threadpool.search.size:<size>
Also, that sounds like too many shards for a 6 node cluster. If possible, I would try reducing that.

The max content of an HTTP request. Defaults to 100mb
servers start reading from disk at 250 MB/s making the cluster unresponsive - The max content of an HTTP request. Defaults to 100mb. . If set to greater than Integer.MAX_VALUE, it will be reset to 100mb.
This will become unresponsive and you might see the logs related this. Check with the max read size of the indices.
Check with Elasticsearch HTTP

a few things;
5.x has been EOL for years now, please upgrade as a matter of urgency
you are heavily oversharded
for point 2 - you either need to;
upgrade to handle that amount of shards, the memory management in 7.X is far superior
reduce your shard count by reindexing
add more nodes to deal with the load

Benchmarking Kafka - mediocre performance

I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?

I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle

If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.

What's a good way to enforce a single rate limit on multiple machines?

I have a web service with a load balancer that maps requests to multiple machines. Each of these requests end up sending a http call to an external API, and for that reason I would like to rate limit the number of requests I send to the external API.
My current design:
Service has a queue in memory that stores all received requests
I rate limit how often we can grab a request from the queue and process it.
This doesn't work when I'm using multiple machines, because each machine has its own queue and rate limiter. For example: when I set my rate limiter to 10,000 requests/day, and I use 10 machines, I will end up processing 100,000 requests/day at full load because each machine processes 10,000 requests/day. I would like to rate limit so that only 10,000 requests get processed/day, while still load balancing those 10,000 requests.
I'm using Java and MYSQL.

use memcached or redis keep api request counter per client. check every request if out rate limit.
if you think checking at every request is too expensive,you can try storm to process request log, and async calculate request counter.

The two things you stated were:
1)"I would like to rate limit so that only 10,000 requests get processed/day"
2)"while still load balancing those 10,000 requests."
First off, it seems like you are using a divide and conquer approach where each request from your end user gets mapped to one of the n machines. So, for ensuring that only the 10,000 requests get processed within the given time span, there are two options:
1) Implement a combiner which will route the results from all n machines to
another endpoint which the external API is then able to access. This endpoint is able
to keep a count of the amount of jobs being processed, and if it's over your threshold,
then reject the job.
2) Another approach is to store the amount of jobs you've processed for the day as a variable
inside of your database. Then, it's common practice to check if your threshold value
has been reached by the value in your database upon the initial request of the job
(before you even pass it off to one of your machines). If the threshold value has been
reached, then reject the job at the beginning. This, coupled with an appropriate message, has an advantage as having a better experience for the end user.
In order to ensure that all these 10,000 requests are still being load balanced so that no 1 CPU is processing more jobs than any other cpu, you should use a simple round robin approach to distribute your jobs over the m CPU's. With round robin, as apposed to a bin/categorization approach, you'll ensure that the job request is being distributed as uniformly as possible over your n CPU's. A downside to round robin, is that depending on the type of job you're processing you might be replicating a lot data as you start to scale up. If this is a concern for you, you should think about implementing a form of locality-sensitive hash (LSH) function. While a good hash function distributes the data as uniformly as possible, LSH exposes you to having a CPU process more jobs than other CPU's if a skew in the attribute you choose to hash against has a high probability of occurring. As always, there's tradeoffs associated with both, so you'll know best for your use cases.

Why not implement a simple counter in your database and make the API client implement the throttling?
User Agent -> LB -> Your Service -> Q -> Your Q consumers(s) -> API Client -> External API
API client checks the number (for today) and you can implement whatever rate limiting algorithm you like. eg if the number is > 10k the client could simply blow up, have the exception put the message back on the queue and continue processing until today is now tomorrow and all the queued up requests can get processed.
Alternatively you could implement a tiered throttling system, eg flat out til 8k, then 1 message every 5 seconds per node up til you hit the limit at which point you can send 503 errors back to the User Agent.
Otherwise you could go the complex route and implement a distributed queue (eg AMQP server) however this may not solve the issue entirely since your only control mechanism would be throttling such that you never process any faster than the something less than the max limit per day. eg your max limit is 10k so you never go any faster than 1 message every 8 seconds.

If you're not adverse to using a library/service https://github.com/jdwyah/ratelimit-java is an easy way to get distributed rate limits.
If performance is of utmost concern you can acquire more than 1 token in a request so that you don't need to make an API request to the limiter 100k times. See https://www.ratelim.it/documentation/batches for details of that

Distribute database records evenly across multiple processes

I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java

The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.

Sounds like a job for Hadoop to me.

I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.

Quartz Performance

It seems there is a limit on the number of jobs that Quartz scheduler can run per second. In our scenario we are having about 20 jobs per second firing up for 24x7 and quartz worked well upto 10 jobs per second (with 100 quartz threads and 100 database connection pool size for a JDBC backed JobStore), however, when we increased it to 20 jobs per second, quartz became very very slow and its triggered jobs are very late compared to their actual scheduled time causing many many Misfires and eventually slowing down the overall performance of the system significantly. One interesting fact is that JobExecutionContext.getScheduledFireTime().getTime() for such delayed triggers comes to be 10-20 and even more minutes after their schedule time.
How many jobs the quartz scheduler can run per second without affecting the scheduled time of the jobs and what should be the optimum number of quartz threads for such load?
Or am I missing something here?
Details about what we want to achieve:
We have almost 10k items (categorized among 2 or more categories, in current case we have 2 categories) on which we need to some processing at given frequency e.g. 15,30,60... minutes and these items should be processed within that frequency with a given throttle per minute. e.g. lets say for 60 minutes frequency 5k items for each category should be processed with a throttle of 500 items per minute. So, ideally these items should be processed within first 10 (5000/500) minutes of each hour of the day with each minute having 500 items to be processed which are distributed evenly across the each second of the minute so we would have around 8-9 items per second for one category.
Now for to achieve this we have used Quartz as scheduler which triggers jobs for processing these items. However, we don't process each item with in the Job.execute method because it would take 5-50 seconds (averaging to 30 seconds) per item processing which involves webservice call. We rather push a message for each item processing on JMS queue and separate server machines process those jobs. I have noticed the time being taken by the Job.execute method not to be more than 30 milliseconds.
Server Details:
Solaris Sparc 64 Bit server with 8/16 cores/threads cpu for scheduler with 16GB RAM and we have two such machines in the scheduler cluster.

In a previous project, I was confronted with the same problem. In our case, Quartz performed good up a granularity of a second. Sub-second scheduling was a stretch and as you are observing, misfires happened often and the system became unreliable.
Solved this issue by creating 2 levels of scheduling: Quartz would schedule a job 'set' of n consecutive jobs. With a clustered Quartz, this means that a given server in the system would get this job 'set' to execute. The n tasks in the set are then taken in by a "micro-scheduler": basically a timing facility that used the native JDK API to further time the jobs up to the 10ms granularity.
To handle the individual jobs, we used a master-worker design, where the master was taking care of the scheduled delivery (throttling) of the jobs to a multi-threaded pool of workers.
If I had to do this again today, I'd rely on a ScheduledThreadPoolExecutor to manage the 'micro-scheduling'. For your case, it would look something like this:
ScheduledThreadPoolExecutor scheduledExecutor;
...
scheduledExecutor = new ScheduledThreadPoolExecutor(THREAD_POOL_SIZE);
...
// Evenly spread the execution of a set of tasks over a period of time
public void schedule(Set<Task> taskSet, long timePeriod, TimeUnit timeUnit) {
if (taskSet.isEmpty()) return; // or indicate some failure ...
long period = TimeUnit.MILLISECOND.convert(timePeriod, timeUnit);
long delay = period/taskSet.size();
long accumulativeDelay = 0;
for (Task task:taskSet) {
scheduledExecutor.schedule(task, accumulativeDelay, TimeUnit.MILLISECOND);
accumulativeDelay += delay;
}
}
This gives you a general idea on how use the JDK facility to micro-schedule tasks. (Disclaimer: You need to make this robust for a prod environment, like check failing tasks, manage retries (if supported), etc...).
With some testing + tuning, we found an optimal balance between the Quartz jobs and the amount of jobs in one scheduled set.
We experienced a 100X throughput improvement in this way. Network bandwidth was our actual limit.

First of all check How do I improve the performance of JDBC-JobStore? in Quartz documentation.
As you can probably guess there is in absolute value and definite metric. It all depends on your setup. However here are few hints:
20 jobs per second means around 100 database queries per second, including updates and locking. That's quite a lot!
Consider distributing your Quartz setup to cluster. However if database is a bottleneck, it won't help you. Maybe TerracottaJobStore will come to the rescue?
Having K cores in the system everything less than K will underutilize your system. If your jobs are CPU intensive, K is fine. If they are calling external web services, blocking or sleeping, consider much bigger values. However more than 100-200 threads will significantly slow down your system due to context switching.
Have you tried profiling? What is your machine doing most of the time? Can you post thread dump? I suspect poor database performance rather than CPU, but it depends on your use case.

You should limit your number of threads to somewhere between n and n*3 where n is the number of processors available. Spinning up more threads is going to cause a lot of context switching, since most of them will be blocked most of the time.
As far as jobs per second, it really depends on how long the jobs run and how often they're blocked for operations like network and disk io.
Also, something to consider is that perhaps quartz isn't the tool you need. If you're sending off 1-2 million jobs a day, you might want to look into a custom solution. What are you even doing with 2 million jobs a day?!
Another option, which is a really bad way to approach the problem, but sometimes works... what is the server it's running on? Is it an older server? It might be bumping up the ram or other specs on it will give you some extra 'umph'. Not the best solution, for sure, because that delays the problem, not addresses, but if you're in a crunch it might help.

In situations with high amount of jobs per second make sure your sql server uses row lock and not table lock. In mysql this is done by using InnoDB storage engine, and not the default MyISAM storage engine which only supplies table lock.

Fundamentally the approach of doing 1 item at a time is doomed and inefficient when you're dealing with such a large number of things to do within such a short time. You need to group things - the suggested approach of using a job set that then micro-schedules each individual job is a first step, but that still means doing a whole lot of almost nothing per job. Better would be to improve your webservice so you can tell it to process N items at a time, and then invoke it with sets of items to process. And even better is to avoid doing this sort of thing via webservices and process them all inside a database, as sets, which is what databases are good for. Any sort of job that processes one item at a time is fundamentally an unscalable design.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.