Producer - consumer using MySQL DB - java

My requirement is as follows
Maintain a pool of records in a table (MySQL DB).
A job acts as a producer and fills up this pool if the number of entries goes below a certain threshold. The job runs every 15 mins.
There can be multiple consumers with each consumer picking up just one record each. Two consumers coming in at the same time should get two different records.
The producer should not block the consumer. So while the producer job is running consumers should be able to pick up any available rows.
The producer / consumer is a part of the application code which is turn is a JBoss application.
In order to ensure that each consumer picks a distinct record (in case of concurrency) we do the following
We use an integer column as an index.
A consumer will first update the record with the lowest index value with its own name.
It will then select and pick up that record and proceed with that.
This approach ensures that two consumers do not end up with the same record.
One problem we are seeing is that when the producer is filling up the pool, consumers get blocked. Since the producer can take some time to complete, all consumers in that period are blocked as the update by the consumer waits for the insert by the producer to complete.
Is there any way to resolve this scenario? Any other approach to design this is also welcome.

Is it a hard requirement that you use a relational database as a queue? This seems like a bad approach to the problem, especially since the problems been addressed by message queues. You could use MySQL to persist the state of your queue, but it won't make a good queue itself.
Take a look at ActiveMQ or JBoss Messaging (given that you are using JBoss)

Related

Java Kafka Consumer store state in memory?

I'm having a usecase where I need to "batch process" events data for customers.
Every piece of event data would have a customerId.
In my application layer (java), I will need to batch up all the events per customer id and then apply my business logic. My business logic needs all the events per customer to be available. Basically, I'm grouping by customerId before I can do anything with it.
Approach:
Ingest all the events to a Kafka Topic with partition key as "customerId". Therefore the events belonging to a specific customer always goes to the same consumer. In the consumer, I can gather the events in memory (perhaps using a simple expiry map or so) and do a batch process. In this approach, my entire batch is transient and stored in the application memory.
Caveats:
When Kafka partitions rebalancing happens (for whatever reasons) and when different partitions are re-assigned to different consumers, the data becomes inconsistent. Not sure if there's any way to overcome that.
I'm wondering what is a practical approach for such "batch" use cases? Is Kafka-Streams the right candidate for this? But this is not an infinite stream. The batch data set clearly has a start and end. End event is used as a trigger to perform the business logic.
The events will be ordered per customerId, but without a StickyAssignor in the consumer instances, they will not "go to" be consumed by the same consumer, especially in the event of replaces in a distributed environment
If you have some data in a compact topic that acts as your raw events, and consuming them all into some cache will build up your materialized view, then that's what Kafka Streams does with changelog topics, yes. You can also build this logic on your own with a plain consumer like the Confluent Schema Registry does with its _schemas topic and multiple internal Hashmaps

How does Kafka Consumer Consume from Multiple assigned Partition

tl;dr; I am trying to understand how a single consumer that is assigned multiple partitions handles consuming records for reach partition.
For example:
Completely processes a single partition before moving to the next.
Process a chunk of available records from each partition every time.
Process a batch of N records from first available partitions
Process a batch of N records from partitions in round-robin rotation
I found the partition.assignment.strategy configuration for Ranged or RoundRobin Assignors but this only determines how consumers are assigned partitions not how it consumes from the partitions it is assigned to.
I started digging into the KafkaConsumer source and
#poll() lead me to the #pollForFetches()
#pollForFetches() then lead me to fetcher#fetchedRecords() and fetcher#sendFetches()
This just lead me to try to follow along the entire Fetcher class all together and maybe it is just late or maybe I just didn't dig in far enought but I am having trouble untangling exactly how a consumer will process multiple assigned partitions.
Background
Working on a data pipeline backed by Kafka Streams.
At several stages in this pipeline as records are processed by different Kafka Streams applications the stream is joined to compacted topics feed by external data sources that provide the required data that will be augmented in the records before continuing to the next stage in processing.
Along the way there are several dead letter topics where the records could not be matched to external data sources that would have augmented the record. This could be because the data is just not available yet (Event or Campaign is not Live yet) or it it is bad data and will never match.
The goal is to republish records from the dead letter topic when ever new augmented data is published so that we can match previously unmatched records from the dead letter topic in order to update them and send them down stream for additional processing.
Records have potentially failed to match on several attempts and could have multiple copies in the dead letter topic so we only want to reprocess existing records (before latest offset at the time the application starts) as well as records that were sent to the dead letter topic since the last time the application ran (after the previously saved consumer group offsets).
It works well as my consumer filters out any records arriving after the application has started, and my producer is managing my consumer group offsets by committing the offsets as part of the publishing transaction.
But I want to make sure that I will eventually consume from all partitions as I have ran into an odd edge case where unmatached records get reprocessed and land in the same partition as before in the dead letter topic only to get filtered out by the consumer. And though it is not getting new batches of records to process there are partitions that have not been reprocessed yet either.
Any help understanding how a single consumer processes multiple assigned partitions would be greatly appreciated.
You were on the right tracks looking at Fetcher as most of the logic is there.
First as the Consumer Javadoc mentions:
If a consumer is assigned multiple partitions to fetch data from, it
will try to consume from all of them at the same time, effectively
giving these partitions the same priority for consumption.
As you can imagine, in practice, there are a few things to take into account.
Each time the consumer is trying to fetch new records, it will exclude partitions for which it already has records awaiting (from a previous fetch). Partitions that already have a fetch request in-flight are also excluded.
When fetching records, the consumer specifies fetch.max.bytes and max.partition.fetch.bytes in the fetch request. These are used by the brokers to respectively determine how much data to return in total and per partition. This is equally applied to all partitions.
Using these 2 approaches, by default, the Consumer tries to consume from all partitions fairly. If that's not the case, changing fetch.max.bytes or max.partition.fetch.bytes usually helps.
In case, you want to prioritize some partitions over others, you need to use pause() and resume() to manually control the consumption flow.

Kafka: Single consumer group in multiple instances

I am working on implementing a Kafka based solution to our application.
As per the Kafka documentation, what i understand is one consumer in a consumer group (which is a thread) is internally mapped to one partition in the subscribed topic.
Let's say i have a topic with 40 partitions and i have a high level consumer running in 4 instances. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
Should i go for same consumer group with 10 threads per instance?
- Stackoverflow says same consumer group between the instances act as traditional synchronous queue mechanism
In Apache Kafka why can't there be more consumer instances than partitions?
Or Should i go for different consumer group per instance?
Using simple consumer or low level consumer gives control over the partition but then if one instance goes down, the other three instances would not process the messages from the partitions consumed in first instance
First to explain the concept of Consumers & Consumer Groups,
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.
The records will be effectively load balanced over the consumer instances in a consumer group. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Now to answer your questions,
1. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
This is possible by default in Kafka architecture. You just have to label all the 4 instances with the same consumer group name.
2. Should i go for same consumer group with 10 threads per instance ?
Doing this will assign each thread a kafka partition from which it will consume data, which is optimal. Reducing the number of threads will load balance the record distribution among the consumer instances and MAY overload some of the consumer instances.
3. In Apache Kafka why can't there be more consumer instances than partitions?
In Kafka, a partition can be assigned only to one consumer instance. Thus, creating more consumer instances than partitions will lead to idle consumers who will not be consuming any records from kafka.
4. Should i go for different consumer group per instance?
No. This will lead to duplication of the records, as every record will be sent to all the instances, as they are from different consumer groups.
Hope this clarifies your doubts.
There are few things to note when designing your Kafka echo system:
Consumer is essentially a thread and you do not want multiple thread trying to change your offset mark. That's why the consumer system should be designed as one consumer one thread.
Offset commits, there a delicate balance between how frequently you want to perform offset commits. If the frequency is higher then it will have an adverse effect on performance of your system (Zk will be the bottleneck). If the frequency is two low then you may risk duplicate messages.
In Kafka you have both ways to do competing-consumers and publish-subscribe patterns:
competing consumers : it's possible putting consumers inside the same consumer group. So that each partition is accessible by only one consumer (of course a consumer can read more than one partition). It means that you can't have more consumers than partitions in a consumer group, because the other consumers will be idle without being assigned any partition. Of course if one consumer in the consumer group goes down, one of the idle consumer will take the partition.
publish subscribe : if you have different consumer groups, all consumers in different consumer groups will receive same messages. Inside the consumer group then, the above pattern will be applied.

Two Kafka consumers causing odd behavior with one another

I have two consumers with different client ID's and group ID's. Aside from retention hour and max partitions, my Kafka installation contains default configuration. I've looked around to see if anyone else has had the same issue but can't pull up any results.
So the scenario goes like this:
Consumer A:
Connects to Kafka, consumes about 3 million messages that need to be consumed, and then sits idle waiting for more messages.
Consumer B:
Different client / group ID, connects to the same Kafka topic, and this causes consumer A to get a repeat of the 3 million messages while consumer B consumes them as well.
The two consumers are two completely different Java applications with different client and group ID's running on the same computer. The Kafka server is on another computer.
Is this a normal behavior in Kafka? I am at a complete loss.
Here is my consumer config:
bootstrap.servers=192.168.110.109:9092
acks=all
max.block.ms=2000
retries=0
batch.size=16384
auto.commit.interval.ms=1000
linger.ms=0
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
block.on.buffer.full=true
enable.auto.commit=false
auto.offset.reset=none
session.timeout.ms=30000
zookeeper.session.timeout=100000
rebalance.backoff.ms=8000
group.id=consumerGroupA
zookeeper.connect=192.168.110.109:2181
poll.interval=100
And the obvious difference in my consumer B is the group.id=consumerGroupB
This is a correct behavior. Because based on your configs, your consumers don't commit offset of records that they have read!
When a consumer read a record, it must commit reading it, you can ensure that consumers commit offsets automatically by setting enable.auto.commit=true or commit each record manually. In this case I think auto commit is fine for you.

Effective strategy to avoid duplicate messages in apache kafka consumer

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once
Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

Categories

Resources