Most efficient number of threads in Kafka streams

Most efficient number of threads in Kafka streams - java

I am using Kafka Streams with one topic(has 3 partitions).
I want to know most efficient number of thread numbers in Kafka Streams num.stream.threads option.
1 Thread and 3 tasks VS 3 Threads and 1 task(in each thread) Which one is better?
P.S. Server has 3 Core CPU.

The answer is, it depends! Typically, it will be more efficient to have as many threads as partitions/tasks as this will give you a better paralellism. But having too many threads can also be disastrous due to context switch if you don't have enought CPU.
You must also consider the throughput of the data to be processed, as well as the cost of the operation to perform on each record. If your stream application is not really data intensive you may not have interest to allocate a huge number of thread as they will be most of time idle.
It is therefore best to start with a single thread and perform load tests to measure the performance of your applications. For doing this, you can use the command-line tool available in the Apache kafka (or Confluent) distribution, i.e., bin/kafka-producer-perf-test.sh and monitor the metrics exposed by Kafka Streams using JMX (see : Monitoring Kafka Streams - Confluent Documentation).
Moreover, you should note that the maximum number of threads you can allocate to your application is not exactly equals to the number of partitions of the input topic you have declared in your topology. Actually, you should also consider all the topics from all the sub-topologies that have been generated by your application.
For example, let's say your are consuming a stream topic with 3 partitions, but your application perfom a repartition operation. Then, you will end up with two sub-topologies each consuming one topic with 3 partitions. So you will have a total of 6 tasks which means you can configure up to 6 threads.
Note: Usually, it is recommended to deploy a KafkaStreams instance with a single thread and to scale horizontally by adding more instance. This simplify the scaling model especially when using Kubernetes (i.e. 1 pod = 1 KafkaStreams instance = 1 Thread).

Related

Configuring akka dispatcher for large amount of concurrent graphs

My current system has around 100 thousand running graphs, Each is built like that:
Amqp Source ~> Processing Stage ~> Sink
Each amqp source receives messages at a rate of 1 per second. Only around 10 thousand graphs receive messages at once, So I've figured there is no need for more than 10 thousand threads running concurrently.
These are currently the settings i'm using:
my-dispatcher {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 16
parallelism-factor = 2.0
parallelism-max = 32
}
throughput = 20
}
Obviously these settings are not defining enough resources for the wanted performances, So I wonder:
Am I correct to assume that 10 thousand threads are enough?
Is it possible to configure the dispatcher (by editing application.conf) for that amount of threads? How would the configuration look like? Should I pick "fork-join-executor" or "thread-pool-executor" as the executor?
Thanks.

Akka and Akka Streams is based on async, an actor or stream only uses a thread for a chunk of processing and then hands the thread back to the threadpool, this is nice because you can size the threadpool according the number of cores you have to actually execute the threads rather than the things you want to execute. Having many threads will have an overhead, both in scheduling/switching and in that the JVM allocates a stack of somewhere around 0.5-1Mb per thread.
So, 10 thousand actors or running streams, can still execute fine on a small thread pool. Increasing the number of threads may rather slow the processing down than make anything faster as more time is spent on switching between threads. Even the default settings may be fine and you should always benchmark when tuning to see if the changes had the effect you expected.
Generally the fork join pool gives good performance for actors and streams. The thread-pool based one is good for use cases where you cannot avoid blocking (see this section of the docs: https://doc.akka.io/docs/akka/current/dispatchers.html#blocking-needs-careful-management)

Schedulers.elastic() not using early generated threads

I've written web-server on spring-boot-2.0 which uses netty under the hood.
This application uses Schedulers.elastic() threads.
When it started, about 100 elastic threads were created. These threads were rarely used and we've had few loading. But after a working day, the number of threads in elastic pool has increased to 1300. And now execution is on the elastic-1XXX, elastic-12XX threads, (name's numbers are above 100 and even 900).
Elastic, as I understand it, uses cachedThreadPool under the hood.
Why have new elastic threads been created and why has task switched to new threads?
What is the criteria for adding new trends?
And why haven't old threads (elastic-XX, elastic-1xx) been shutdown?

Without more information about the type of workload, the maximum concurrency, burst and average task duration, it’s really hard to tell if there’s a problem here.
By definition the elastic scheduler is creating an unbounded number of threads, as long as new tasks are added to the queue.
If the workload is bursty, with night concurrency at regular times, then it’s n unexpected to find a large number of threads. You could leverage the newElastic variants to reduce the TTL (default is 60s).
Again, without more information it’s hard to tell but your workload might not fit this scheduler. If your workload is CPU bound, the parallel scheduler is a better fit. The elastic one is tailored for IO/latency bound tasks.

The problem was: i was using Schedulers.elastic() for non-blocking operations, while there were no such operations . When i had removed elastic(), my service started to working correctly (without elastic's threads).

Efficient Approach to Process Large set of Data in java

I have a billion records which are unsorted, unrelated to each other and I have to call a function processRecord on each record using Java.
The easy way to do so is using a for loop but it is taking a lot of time.
The other way I could think of is using multithreading but the question is how to divide the dataset of records efficiently and among how many threads?
Is there an efficient way to process this large dataset?

Measure
Before figuring out which implementation path to choose you should measure how long it takes to process single item. Based on that you could choose size of work chunk submitted to thread pool, queue, cluster. Very small work chunks would increase coordination overhead. Too big work chunk will take long time to be processed so you will have less gradual progress info.
Single machine processing is simpler to implement, troubleshoot maintain and reason about.
Processing on single machine
Use java.util.concurrent.ExecutorService
Submit each work piece using submit(Callable<T> task) method https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#submit-java.util.concurrent.Callable-
Create instance of ExecutorService using java.util.concurrent.Executors.newFixedThreadPool(int nThreads). Choose reasonable value for nThreads Nnumber of CPU cores is reasonable startup value. You may add use more threads if there are some blocking IO calls (database, HTTP) in processing.
Processing on multiple machines - cluster
Submit processing jobs to cluster processing technologies like Spark, Hadoop, Google BigQuery.
Processing on multiple machines - queue
You can submit your records to any queue system (Kafka, RabbitMQ, ActiveMQ, etc). Then have multiple machines that consume items from the queue. You will be able to add/remove consumers at any time. This approach is fine if you do not need to have single place with processing result.

Parallel stream could be used here to perform parallel processing of your data. By default parallel stream uses pool by one thread less than processors count.
Wide and useful information about that could be found here https://stackoverflow.com/a/21172732/8184084

Benchmarking Kafka - mediocre performance

I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?

I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle

If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.

Is a single Java thread better than multiple threading in my scenario?

Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks

Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.

Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.

It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.

First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system

As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.

In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.