One consumer multiple topics causes potential thread issue? - java

I have a REST service, lets call it MDD, which has one kafka consumer. When I FIRST start the rest service, another service tells MDD's consumer to subscribe to a specific topic, everything seems to go fine.
Then the service tells MDD's consumer to subscribe to another topic. The way I am doing it right now is via consumer.assign() method. Basically, if a new topic is introduced to which the consumer is not assigned to, I assign this new topic to the consumer. So the one consumer is now assigned to 2 different topics.
This consumer polls the messages and deposits them into HDFS.
Now what I have noticed, is when the subscription for the 2nd topic comes in, sometimes I get the error about failing to append to file in HDFS and when I looked at the logs, it was trying to append some data that should not have been appended till later on.
For example, data to kafka comes in this order A, B, C. When MDD is done appending A to HDFS, it tries to append C (rather than B) and simultaneously tries to append B as well. Also another note, no data is coming from the first topic at this point, only data from second topic is streaming in. So currently, only one kafka topic has data streaming in at any given time.
Anyone have any idea what could be going on? Is there potential of some thread issues being created when I assign ONE consumer to multiple topics? Because everything seems to go fine when the consumer is assigned to ONE topic but as soon as its assigned to more than ONE topic, I get failed to append to file in HDFS because some other writer already owns the lease. This error does not happen frequently, just very randomly.
Also would a recommended fix be every time a new topic is created, create a new kafka consumer?

It's definitely valid and doable to have only one consumer read messages from multiple topics. The problem you ran into is because of the fact that Kafka does not currently support both using manual partition assignment(with KafkaConsumer#assign) and group management(with KafkaConsumer#subscribe).
To support subscribing newly-created topics, you could try to invoke KafkaConsumer#subscribe to which a regular expression is passed, matching all the newly-created topics.

Related

Can I send a message different topics in kafka at runtime depending on the load?

I have 5 defined topics, my specific question is there any way in the code to know if a kafka topic is free or is still full to be able to balance the load between topics
In the producer I have to do the function .send(Topic1, object) but if the topic to which I am sending the information is busy or already has to load, how can I know it to change the function to .send(Topic2, object) by means of a conditional?
I do not know if this can be done or that otherwise you can know this. Currently, I plan to use ListenableFuture and with future.addCallback to know if this process is already done and reassign the topic but I do not see it viable.
Topics don't have load. Brokers do.
A broker can have leader multiple leader partitions of different topics, which clients cannot control.
Therefore, you cannot guarantee that sending data to a new topic (rather, set of partitions) will have less system/network load than another.
Besides that, if you start sending data to other topics, you lose ordering guarantees in both the produced data and the consumer group for any downstream systems.

How to solve the queue multi-consuming concurrency problem?

Our program is using Queue.
Multiple consumers are processing messages.
Consumers do the following:
Receive on or off status message from the Queue.
Get the latest status from the repository.
Compare the state of the repository and the state received from the message.
If the on/off status is different, update the data. (At this time, other related data are also updated.)
Assuming that this process is handled by multiple consumers, the following problems are expected.
Producer sends messages 1: on, 2: off, and 3: on.
Consumer A receives message #1 and stores message #1 in the storage because there is no latest data.
Consumer A receives message #2.
At this time, consumer B receives message #3 at the same time.
Consumers A and B read the latest data from the storage at the same time (message 1).
Consumer B finishes processing first. Don't update the repository as the on/off state is unchanged.(1: on, 3: on)
Then consumer A finishes the processing. The on/off state has changed, so it processes and saves the work. (1: on, 2: off)
In normal case, the latest data remaining in the DB should be on.
(This is because the message was sent in the order of on -> off -> on.)
However, according to the above scenario, off remains the latest data.
Is there any good way to solve this problem?
For reference, the queue we use is using AWS Amazon MQ and the storage is using AWS dynamoDB. And using Spring Boot.
The fundamental problem here is that you need to consume these "status" messages in order, but you're using concurrent consumers which leads to race-conditions and out-of-order message processing. In short, your basic architecture using concurrent consumers is causing this problem.
You could possibly work up some kind of solution in the database with timestamps as suggested in the comments, but that would be extra work for the clients and extra data stored in the database that isn't strictly necessary.
The simplest way to solve the problem is to just consume the messages serially rather than concurrently. There are a handful of different ways to do this, e.g.:
Define just 1 consumer for the queue with the "status" messages.
Use ActiveMQ's "exclusive consumer" feature to ensure that only one consumer receives messages.
Use message groups to group all the "status" messages together to ensure they are processed serially (i.e. in order).

Find number of records in a kafka topic without consuming those records

I am working on writing a test where before asserting on some things I need to wait until there are 2 records in the topic. I, then want to get these 2 records and do assertions on them. I can't seem to find a way to find number of records in a topic without actually consuming them.
Here is a brief on what i am trying to do. I am listening to an event published in topic "x". After consuming this event I publish a new event "abc" in the same topic "x". Now I want to do some assertions based on the published event "abc". In test i have a separate consumer subscribed to "x". So I should wait to assert things until i know there are 2 events for the consumer to consume
Messages are not consumed per se. They are just delivered to every subscribed consumer group once but stay in the topic for future use. So you can "consume" them with a dedicated consumer group and perform your tests without worrying about other consumers.

Kafka consumer returns empty on second call

I insert a record to Kafka producer and then call consumer which returns the inserted element (and previously inserted elements) then I call the customer again (without inserting new record with producer) the consumer does not return any records.
As far as I know the record should remain in topic. I have no idea how to set acknowledge to false in properties. Is this issue related to acknowledgment?
The consumer will only get "new" messages. If you have previously read until the end of the topic and there is nothing new, you won't get anything.
If you want to read from the beginning of the topic again, you have to "rewind" the consumer (or create a new one).
I have no idea how to set acknowledge to false in properties(is this issue related to acknowledgment?)
If you use consumer groups, the offsets for the consumer (how far it is read each topic partition) will be stored within Kafka (or Zookeeper for older versions). You can control this by acknowledging (or not) the receipt of messages. However, this only has an effect when the consumer is restarted, not for an already running instance.
If you don't use consumer groups, this offset tracking is purely done within the consumer instance itself.

Effective strategy to avoid duplicate messages in apache kafka consumer

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once
Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

Categories

Resources