Flink - structuring job to maximize throughput - java

I have 4 types of kafka topics and 65 topics of each type. The goal is to do some simple windowed aggregation on the data and write it to a DB.
The topologies will look something like:
kafka -> window -> reduce -> db write
Somewhere in this mix I want / need to do a union - or possibly several (depending on how many topics are combined each time).
The data flow in the topics ranges from 10K to >200K messages / min.
I have a four node flink cluster with 30 cores / node. How do I build these topologies to spread the load out?

I am writing this answer assuming that each of the 65 topics of the same type contains the same type of data.
The simplest solution to this problem would be to change the Kafka setup such that you have 4 topics with 65 partitions each. Then you have 4 data sources in the program, with high parallelism (65) and this distributes across the cluster naturally.
If it is not possible to change the setup, I see two things you can do:
One possible solution is to create a modified version of the FlinkKafkaConsumer where one source may consume multiple topics (rather than only multiple partitions of one topic). With that change, it would work pretty much as if you were using many partitions, rather than many topics. If you want to go with this solution, I would ping the mailing list to get some support for doing this. It would be a valuable addition to the Flink code anyways.
You can give each source a separate resource group, which will give it a dedicated slot. You can do this via "env.addSource(new FlinkKafkaConsumer(...)).startNewResourceGroup();". But here, the observation is that you try to execute 260 distinct sources on a cluster with 120 cores (and thus probably 120 task slots). You would need to increase the number of slots to hold all the tasks.
I think the first option is the preferable option.

Related

Uneven partitioner in kafka / no key

I have a topic with 3 partitions with only 1 consumer, and I am using the default partitioner which in this case is "Sticky". everything else by default.
The data sent from the producer does not have a key and I don't want it to have one, I simply want each data to go to a random partition and for these to be evenly distributed.
However I have a result similar to this, where one partition is way above the others
As a result of this I have 2 questions.
Why did this happen?
How can I make the partitions to be equal again?
I have tried to create a custom partitioner that looks at the size of each partition and assigns the data where it has less data. is this possible?
Kafka documentation explains it:
The DefaultPartitioner now uses a sticky partitioning strategy. This
means that records for specific topic with null keys and no assigned
partition will be sent to the same partition until the batch is ready
to be sent. When a new batch is created, a new partition is chosen.
This decreases latency to produce, but it may result in uneven
distribution of records across partitions in edge cases. Generally
users will not be impacted, but this difference may be noticeable in
tests and other situations producing records for a very short amount
of time.
Switching to the RoundRobinPartitionner (instead of DefaultPartitionner) is probably what you are looking for. See https://kafka.apache.org/documentation/#producerconfigs_partitioner.class I ignore how constant your message rate, but under normal circumstances (Production) the Default partitionner is pretty fair.
Also ensure that linger.ms is 0 and reduce batch.size as much as you can.
Implementing a custom Partitionner is rather easy. But knowing which partition is the smaller is harder as it will change very often. You may end up spending more time refreshing partition sizes, and finding the smallest one that sending the message.

Java Batch Multithreading

I have some JSR-352 batch jobs that run quite well - despite their runtime.
Now I am thinking of distributing the work across several threads.
How is this pattern supported by JSR-352?
Edit: Now that I know the keywords to search for I can make out more resources on this problem:
Batch job definition: How to run a dynamically-calculated number of partitions?
JSR 352 batch application example
https://github.com/javaee-samples/jakartaee-samples/tree/main/ee7/batch
https://www.baeldung.com/java-ee-7-batch-processing#partitionof-job
https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf page 24+
I want to create a partitioned batchlet, and the partitions must be calculated at runtime. The idea is to split the processing of all records into a predefined maximum number of partitions, or a number of partitions with maximum size.
Here is another example of using partitioned chunk processing in JBeret. Partitioned batchlet should be similar in terms of job configuration. But partitioned batchlet is not common compared to partitioned chunk steps, because batchlet is mainly a task-oriented step, which is not easy to split into multiple partitions. So you may want to check if batchlet step is really suitable.
JBeret project test-apps module contains some examples with partition configuration. For example, cdiScopes tests use batchlet with partitions. Throttle tests use chunk with partitions.
JBeret test-apps/chunkPartition test module contains a job xml that uses mapper for dynamic partition.

What is partition in Spark?

I'm trying to understand, what is partition in Spark ?
My understanding is, When we read from a source and place into any specific Datatset, then that Data set can be split into multiple sub-Datasets, those sub-Datasets are called partitions And Its upto spark framework where and how it distributes in cluster. Is it correct ?
I came into a doubt, when I read some online articles, which says
Under the hood, these RDDs or Datasets are stored in partitions on
different cluster nodes. Partition basically is a logical chunk of a
large distributed data set
This statment breaks my under standing. As per the above statment, RDDs or Datasets sits inside partition. But I was thinking RDD itself as a partition (after splitiing).
Can anyone help me to clear this doubt ?
Here is my code snippet, where am reading from JSON.
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
So while reading itself, how can I split this into multile partitions ? or Any other way around ?
What is Partition?
As per spark documentation, A partition in spark is an atomic chunk of
data (a logical division of data) stored on a node in the cluster.
Partitions are basic units of parallelism in Apache Spark. RDDs/Dataframe/Dataset in
Apache Spark is a collection of partitions.
So, When you do
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
spark reads your source json data and create a (logical division on data which are paritions) and then process those partitions parallely in cluster.
For example in laymen terms...
If you have a task to move 1-ton load of wheat from one place to another place and you have only 1 men resource(similar to a single thread) to do that task.so there can be a lot of possibilities over here.
1)Your resource might not be able to move such a huge weight at a time. (similar to you don't have enough CPU or RAM)
2)If It is capable(similar to high conf machine) then It might take a huge time and It might have stressed out.
3) AND your resource can't process any other process in between when It is doing load transfer. and soon.....
what if you divide 1-ton load of wheat into 1kg wheat blocks(similar to logical partitions on data) and hire more men and then ask your resources to move.
now it is a lot easier for them and you can add few more men resources(similar to scaling up the cluster) and can achieve your actual task very easily and fast.
similar to the above approach spark does a logical division on data so that you can process data parallelly using your cluster resources optimally and can finish your task much faster.
Note: RDD/Dataset and Dataframe are just abstractions for logical partitions of data.
and there other concepts in RDD and Dataframe which I didn't cover in the example (i.e Resilient and immutability)
How can I split this into multiple partitions ?
you can use repartition API to split furthermore partitions
spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath).**repartition**(number)
and you can use coalesce() api to bring down partitions.

Benchmarking Kafka - mediocre performance

I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?
I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle
If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.

Any workaround for distributing(sharding) key of collections like Lists, Sets etc

We are using Redis 2.8.17 as JobQueues.
We are using RPUSH and BRPOPLPUSH for making a reliable queue.
As per our current design multiple app-servers push(RPUSH) jobs to a single job queue. Since the BRPOPLPUSH operation is atomic in redis, the jobs will later be poped(BRPOPLPUSH) and processed by any of the server's consumers.
Since the app servers are capable of scaling out, I am bit concerned that REDIS might become bottleneck in the future.
I learnt the following from documentation on redis partitioning:
"it is not possible to shard a dataset with a single huge key like a very big sorted set"
I wonder whether pre-sharding the queues for app servers is the only option to scale out.
Is there anything that cluster can do in the above design?
The main thing you need to consider is whether you'll need to shard at all. The entire StackExchange network (not just StackOverflow -- all the network) runs off of 2 Redis servers (one of which I'm pretty sure only exists for redundancy), which it uses very aggressively. Take a look at http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow/
Redis is absurdly fast (and remarkably space-efficient), with only one caveat: deleting an entire list/set/sorted set/hash is O(n), where n is the number of elements it contains. As long as you don't do that (operations like RPOP, BRPOP, BRPOPLPUSH, etc. don't count -- those are constant time), you should be golden.
TLDR: Unless you're planning to go bigger than StackOverflow, you won't need to shard, especially for a simple job queue.

Categories

Resources