Benchmarking Kafka - mediocre performance

Benchmarking Kafka - mediocre performance - java

I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?

I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle

If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.

Related

Most efficient number of threads in Kafka streams

I am using Kafka Streams with one topic(has 3 partitions).
I want to know most efficient number of thread numbers in Kafka Streams num.stream.threads option.
1 Thread and 3 tasks VS 3 Threads and 1 task(in each thread) Which one is better?
P.S. Server has 3 Core CPU.

The answer is, it depends! Typically, it will be more efficient to have as many threads as partitions/tasks as this will give you a better paralellism. But having too many threads can also be disastrous due to context switch if you don't have enought CPU.
You must also consider the throughput of the data to be processed, as well as the cost of the operation to perform on each record. If your stream application is not really data intensive you may not have interest to allocate a huge number of thread as they will be most of time idle.
It is therefore best to start with a single thread and perform load tests to measure the performance of your applications. For doing this, you can use the command-line tool available in the Apache kafka (or Confluent) distribution, i.e., bin/kafka-producer-perf-test.sh and monitor the metrics exposed by Kafka Streams using JMX (see : Monitoring Kafka Streams - Confluent Documentation).
Moreover, you should note that the maximum number of threads you can allocate to your application is not exactly equals to the number of partitions of the input topic you have declared in your topology. Actually, you should also consider all the topics from all the sub-topologies that have been generated by your application.
For example, let's say your are consuming a stream topic with 3 partitions, but your application perfom a repartition operation. Then, you will end up with two sub-topologies each consuming one topic with 3 partitions. So you will have a total of 6 tasks which means you can configure up to 6 threads.
Note: Usually, it is recommended to deploy a KafkaStreams instance with a single thread and to scale horizontally by adding more instance. This simplify the scaling model especially when using Kubernetes (i.e. 1 pod = 1 KafkaStreams instance = 1 Thread).

Performance Issue: Latency Spike happens sometimes in Kafka Streams

I am doing Performance Testing in Kafka Streaming. I created a simple Streams API with Transformer.
// Stream data from input topic
builder.stream(Serdes.String(), Serdes.String(), inTopic)
// convert csv data to avro
.transformValues(new TransformSupplier())
// post converted data to output topic
.to(Serdes.String(), Serdes.ByteArray(), outTopic);
I am using inTopic with 10 partitions and outTopic with 1 partition. I am seeing latency is good and around ~4-6 ms . But, I am facing sudden spike in the latency sometimes and it reaches even upto ~60 - 1000 ms. Then after few seconds, it gradually dropped latency down back to ~4-6 ms. This results in the average latency for my whole experiment to ~67 ms.
What could be the reason for the sudden spike? Suggest me some performance tuning parameters if any.
Note: I have provided default StreamsConfig only.

After some amount of msgs producing, the data should be flushed to the disk.
This may cause the phenomenon you observed.
Please refer to the "log.flush.interval.messages" of kafka configuration: Link
After in prod, I do not recommend you to change this property to improve. You should change your system conf:
/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_ratio
To improve the efficiency of your own msgs flush

Flink - structuring job to maximize throughput

I have 4 types of kafka topics and 65 topics of each type. The goal is to do some simple windowed aggregation on the data and write it to a DB.
The topologies will look something like:
kafka -> window -> reduce -> db write
Somewhere in this mix I want / need to do a union - or possibly several (depending on how many topics are combined each time).
The data flow in the topics ranges from 10K to >200K messages / min.
I have a four node flink cluster with 30 cores / node. How do I build these topologies to spread the load out?

I am writing this answer assuming that each of the 65 topics of the same type contains the same type of data.
The simplest solution to this problem would be to change the Kafka setup such that you have 4 topics with 65 partitions each. Then you have 4 data sources in the program, with high parallelism (65) and this distributes across the cluster naturally.
If it is not possible to change the setup, I see two things you can do:
One possible solution is to create a modified version of the FlinkKafkaConsumer where one source may consume multiple topics (rather than only multiple partitions of one topic). With that change, it would work pretty much as if you were using many partitions, rather than many topics. If you want to go with this solution, I would ping the mailing list to get some support for doing this. It would be a valuable addition to the Flink code anyways.
You can give each source a separate resource group, which will give it a dedicated slot. You can do this via "env.addSource(new FlinkKafkaConsumer(...)).startNewResourceGroup();". But here, the observation is that you try to execute 260 distinct sources on a cluster with 120 cores (and thus probably 120 task slots). You would need to increase the number of slots to hold all the tasks.
I think the first option is the preferable option.

storm - finding source(s) of latency

I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.
kafka -> db lookup -> write to cassandra
The numbers from the storm UI look like this:
(I see that the bolts are running at > 1.0 capacity)
If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.
The tuples are connected via shuffleGrouping.
Cassandra lives on AWS so there are likely network limitations en route.
The storm cluster has 3 machines. There are 3 workers in the topology.

Your topology has several problems:
look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
And yes, your problem is having tuples timing out, this is exactly what is happening.
(EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.

FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?
After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500:

Distribute database records evenly across multiple processes

I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java

The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.

Sounds like a job for Hadoop to me.

I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.