Uneven partitioner in kafka / no key

Uneven partitioner in kafka / no key - java

I have a topic with 3 partitions with only 1 consumer, and I am using the default partitioner which in this case is "Sticky". everything else by default.
The data sent from the producer does not have a key and I don't want it to have one, I simply want each data to go to a random partition and for these to be evenly distributed.
However I have a result similar to this, where one partition is way above the others
As a result of this I have 2 questions.
Why did this happen?
How can I make the partitions to be equal again?
I have tried to create a custom partitioner that looks at the size of each partition and assigns the data where it has less data. is this possible?

Kafka documentation explains it:
The DefaultPartitioner now uses a sticky partitioning strategy. This
means that records for specific topic with null keys and no assigned
partition will be sent to the same partition until the batch is ready
to be sent. When a new batch is created, a new partition is chosen.
This decreases latency to produce, but it may result in uneven
distribution of records across partitions in edge cases. Generally
users will not be impacted, but this difference may be noticeable in
tests and other situations producing records for a very short amount
of time.
Switching to the RoundRobinPartitionner (instead of DefaultPartitionner) is probably what you are looking for. See https://kafka.apache.org/documentation/#producerconfigs_partitioner.class I ignore how constant your message rate, but under normal circumstances (Production) the Default partitionner is pretty fair.
Also ensure that linger.ms is 0 and reduce batch.size as much as you can.
Implementing a custom Partitionner is rather easy. But knowing which partition is the smaller is harder as it will change very often. You may end up spending more time refreshing partition sizes, and finding the smallest one that sending the message.

Related

Is It Possible for Cassandra to Return an Inconsistent Value?

I am very new to Cassandra and I am wondering, is it possible for Cassandra to return an inconsistent value?
For example, say we have six node cluster.
LOCAL_QUORUM = (replication_factor/2) + 1
This would give us a Local Quorum of 4. So for a simple write, four of six nodes have to respond, which means that four nodes would have the most recent value.
From my understanding, the two nodes that were not updated eventually get updated through Gossip Protocol.
If so what happens if a client reads from one of the two nodes that were not
updated before the protocol occurs? Are they at risk of getting a stale value?
How does read repair play into all this?
*Also a quick side note. Not that you would ever do this, but if you set the replication factor equal to the consistency level, does that essentially operate the same as 2PC (two phase commit) on the back?

Welcome to Cassandra world
is it possible for Cassandra to return an inconsistent value?
Yes, Cassandra by nature has an "eventually consistent" approach, so if you set your consistency level for a read with ANY or ONE, the risk to have an inconsistent value returned increases. You can increase this setting to ALL to ensure that the information will be consistent, but you'll sacrifice performance and resiliency. The levels used in the application will depend on your use case.
For example, say we have six node cluster.
LOCAL_QUORUM = (replication_factor/2) + 1
Replication factor is independent of the amount of nodes in the cluster, the thumb rule is that you have is that replication factor should not be greater than the amount of nodes.
Assuming that you are using a replication factor of 6 in the 6 nodes cluster:
This would give us a Local Quorum of 4. So for a simple write, four of
six nodes have to respond, which means that four nodes would have the
most recent value.
From my understanding, the two nodes that were not updated eventually
get updated through Gossip Protocol.
The mechanism to ensure that the replication factor is fulfilled is with Hinted handoffs; the gossip protocol is used by the nodes to report the state of the node (from itself and from other nodes), some of those states are "up", "down", "healthy", "joining", "leaving", etc.
If so what happens if a client reads from one of the two nodes that
were not updated before the protocol occurs? Are they at risk of
getting a stale value?
You will want to read about the read path of Cassandra; as a tl dr, this will depend on the replication factor as well as the consistency level for the read operation. You will also be able to decrease the risk of inaccurate data sacrificing resiliency and performance.

is it possible for Cassandra to return an inconsistent value?
The answer is: Yes it is.
It depends on how you are going to set read/write consistency level.
If so what happens if a client reads from one of the two nodes that were not updated before the protocol occurs? Are they at risk of getting a stale value?
If you set the consistency level of read operations to ONE or TWO then there is still a possibility/risk of getting stale value. Why?: because cassandra will return value to client if it get the response from specified number of nodes.
Cassandra is very flexible you can configure cassandra as your application needs. To maintain a Strong level of consistency you can always follow this rules:
Reliability of read and write operations depends on the consistency
used to verify the operation. Strong consistency can be guaranteed
when the following condition is true:
R + W > N
where R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas
For more to understand checkout this: consistent read and write operations
How does read repair play into all this?
In read repair, Cassandra sends a digest request to each replica not directly
involved in the read. Cassandra compares all replicas and writes the most recent
version to any replica node that does not have it. If the query's consistency
level is above ONE, Cassandra performs this process on all replica nodes in the
foreground before the data is returned to the client. Read repair repairs any node
queried by the read. This means that for a consistency level of ONE, no data is
repaired because no comparison takes place. For QUORUM, only the nodes that the
query touches are repaired, not all nodes.
Check this link for more details:
Read Repair: repair during read path
Repairing nodes

Flink - structuring job to maximize throughput

I have 4 types of kafka topics and 65 topics of each type. The goal is to do some simple windowed aggregation on the data and write it to a DB.
The topologies will look something like:
kafka -> window -> reduce -> db write
Somewhere in this mix I want / need to do a union - or possibly several (depending on how many topics are combined each time).
The data flow in the topics ranges from 10K to >200K messages / min.
I have a four node flink cluster with 30 cores / node. How do I build these topologies to spread the load out?

I am writing this answer assuming that each of the 65 topics of the same type contains the same type of data.
The simplest solution to this problem would be to change the Kafka setup such that you have 4 topics with 65 partitions each. Then you have 4 data sources in the program, with high parallelism (65) and this distributes across the cluster naturally.
If it is not possible to change the setup, I see two things you can do:
One possible solution is to create a modified version of the FlinkKafkaConsumer where one source may consume multiple topics (rather than only multiple partitions of one topic). With that change, it would work pretty much as if you were using many partitions, rather than many topics. If you want to go with this solution, I would ping the mailing list to get some support for doing this. It would be a valuable addition to the Flink code anyways.
You can give each source a separate resource group, which will give it a dedicated slot. You can do this via "env.addSource(new FlinkKafkaConsumer(...)).startNewResourceGroup();". But here, the observation is that you try to execute 260 distinct sources on a cluster with 120 cores (and thus probably 120 task slots). You would need to increase the number of slots to hold all the tasks.
I think the first option is the preferable option.

storm - finding source(s) of latency

I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.
kafka -> db lookup -> write to cassandra
The numbers from the storm UI look like this:
(I see that the bolts are running at > 1.0 capacity)
If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.
The tuples are connected via shuffleGrouping.
Cassandra lives on AWS so there are likely network limitations en route.
The storm cluster has 3 machines. There are 3 workers in the topology.

Your topology has several problems:
look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
And yes, your problem is having tuples timing out, this is exactly what is happening.
(EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.

FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?
After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500:

Benchmarking Kafka - mediocre performance

I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?

I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle

If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.

Any workaround for distributing(sharding) key of collections like Lists, Sets etc

We are using Redis 2.8.17 as JobQueues.
We are using RPUSH and BRPOPLPUSH for making a reliable queue.
As per our current design multiple app-servers push(RPUSH) jobs to a single job queue. Since the BRPOPLPUSH operation is atomic in redis, the jobs will later be poped(BRPOPLPUSH) and processed by any of the server's consumers.
Since the app servers are capable of scaling out, I am bit concerned that REDIS might become bottleneck in the future.
I learnt the following from documentation on redis partitioning:
"it is not possible to shard a dataset with a single huge key like a very big sorted set"
I wonder whether pre-sharding the queues for app servers is the only option to scale out.
Is there anything that cluster can do in the above design?

The main thing you need to consider is whether you'll need to shard at all. The entire StackExchange network (not just StackOverflow -- all the network) runs off of 2 Redis servers (one of which I'm pretty sure only exists for redundancy), which it uses very aggressively. Take a look at http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow/
Redis is absurdly fast (and remarkably space-efficient), with only one caveat: deleting an entire list/set/sorted set/hash is O(n), where n is the number of elements it contains. As long as you don't do that (operations like RPOP, BRPOP, BRPOPLPUSH, etc. don't count -- those are constant time), you should be golden.
TLDR: Unless you're planning to go bigger than StackOverflow, you won't need to shard, especially for a simple job queue.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.