Kafka Consumer : controlled reading from topic

Kafka Consumer : controlled reading from topic - java

I have below kafka consumer code where 3 threads are reading from kafka Topic having 3 partitions.
Is there any way, where new message will be read from the kafka topic only after the messages currently being in process by the thread got processed.
For example lets say there are 100 messages in topic, so is there any way where only 3 messages should be read at a time and processed. Now when these 3 messages gets processed then only next 3 messages should be read and so on.
public void run(int a_numThreads) {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(3);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.submit(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
}

If iterator inside ConsumerTest is processing message in synchronously then only 3 messages will be consumed at a time. enable.auto.commit is true by default. Make sure you do not set it to false, else you need to add logic for committing offset.
ex-
ConsumerIterator<byte[], byte[]> streamIterator= stream.iterator();
while (streamIterator.hasNext()) {
String kafkaMsg= new String(streamIterator.next().message());
}

Well, the consumers do not know about each other by default, so the can not "sync" their work. What you could to is either wrap your three messages into one (and thus guaranteeing they all will be answered in order) or maybe introduce more ("sub") topics.
Another possibility (if you really need to guarantee that your three messages will be consumed by individual consumers) might be that all your consumers sync their work or maybe notifying a controller which tracks your work.
But tbh it feels like you "are doing it wrong", actually the messages in a queue are meant to be stateless and only their order in a topic determines their "order in which they should be processed". WHEN the messages are being processed should not matter.

Related

Multithreading in a hashmap entry loop - Executor service (Java)

I have a map (HashMap<String, Map<String,String> mapTest) in which I have a loop that does several operations.
HashMap<String, Map<String, String> mapTest = new HashMap<>();
ArrayList<Object> testFinals = new ArrayList<>();
for (Map.Entry<String, Map<String, String>> entry : mapTest.entrySet()) {
// in here I do a lot of things, like another for loops, if's, etc.
//the final juice to get from here is that in each time the loop is executed I have this:
List<Object> resultExp = methodXYZ(String, String, String);
testFinals.addAll(resultExp);
}
- In here, I have to wait before I proceed, since I need the full testFinals filled to advance.
Now, what I need to do is:
1 - This mapTest can have like 400 rows to iterate from.
I want to schedule like 4 threads, and assign like 100 rows of that FOR cycle to thread 1, the next 100 rows of the mapTest to thread 2, and so on.
Already tryed a few solutions, like this one:
ExecutorService taskExecutor = Executors.newFixedThreadPool(4);
while(...) {
taskExecutor.execute(new MyTask());
}
taskExecutor.shutdown();
try {
taskExecutor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
...
}
but I can't adapt this correctly or a similar working solution to what I have now, with that map iteration.

HashMap is not a thread safe data structure.
When using concurrent, consider that the threads must obtain, hold and relinquish locks on a variable.
This is done at field level -not content.
On short ... The hasmap is locked for access y a specific thread. Not some random entry.

How to use concurrentHashMap in executorCompletionService?

I search the database many times,even I have cache some result, it still cost took a long time.
List<Map<Long, Node>> aNodeMapList = new ArrayList<>();
Map<String, List<Map<String, Object>>> cacheRingMap = new ConcurrentHashMap<>();
for (Ring startRing : startRings) {
for (Ring endRing : endRings) {
Map<String, Object> nodeMapResult = getNodeMapResult(startRing, endRing, cacheRingMap);
Map<Long, Node> nodeMap = (Map<Long, Node>) nodeMapResult.get("nodeMap");
if (nodeMap.size() > 0) {
aNodeMapList.add(nodeMap);
}
}
}
getNodeMapResult is a function to search database according to startRing, endRing, and cache in cacheRingMap, and next time it may not need to search database if I find the result have exist in
cacheRingMap.
My leader tell me that multithread technology can be used. So I change it to executorCompletionService, but now I have a question, is this thread safe when I use concurrentHashMap to cache result in executorCompletionService?
Will it run fast after I change?
int totalThreadCount = startRings.size() * endRings.size();
ExecutorService threadPool2 = Executors.newFixedThreadPool(totalThreadCount > 4 ? 4 : 2);
CompletionService<Map<String, Object>> completionService = new ExecutorCompletionService<Map<String, Object>>(threadPool2);
for (Ring startRing : startRings) {
for (Ring endRing : endRings) {
completionService.submit(new Callable<Map<String, Object>>() {
#Override
public Map<String, Object> call() throws Exception {
return getNodeMapResult(startRing, endRing, cacheRingMap);
}
});
}
}
for (int i = 0; i < totalThreadCount; i++) {
Map<String, Object> nodeMapResult = completionService.take().get();
Map<Long, Node> nodeMap = (Map<Long, Node>) nodeMapResult.get("nodeMap");
if (nodeMap.size() > 0) {
aNodeMapList.add(nodeMap);
}
}

Is this thread safe when I use concurrentHashMap to cache result in executorCompletionService?
The ConcurrentHashMap itself is thread safe, as its name suggests ("Concurrent"). However, that doesn't mean that the code that uses it is thread safe.
For instance, if your code does the following:
SomeObject object = cacheRingMap.get(someKey); //get from cache
if (object == null){ //oh-oh, cache miss
object = getObjectFromDb(someKey); //get from the db
cacheRingMap.put(someKey, object); //put in cache for next time
}
Since the get and put aren't performed atomically in this example, two threads executing this code could end up both looking for the same key first in the cache, and then in the db. It's still thread-safe, but we performed two db lookups instead of just one. But this is just a simple example, more complex caching logic (say one that includes cache invalidation and removals from the cache map) can end up being not just wasteful, but actually incorrect. It all depends on how the map is used and what guarantees you need from it. I suggest you read the ConcurrentHashMap javadoc. See what it can guarantee, and what it cannot.
Will it run fast after I change？
That depends on too many parameters to know in advance. How would the database handle the concurrent queries? How many queries are there? How fast is a single query? Etc. The best way of knowing is to actually try it out.
As a side note, if you're looking for ways to improve performance, you might want to try using a batch query. The flow would then be to search the cache for all the keys you need, gather the keys you need to look up, and then send them all together in a single query to the database. In many cases, a single large query would run faster that a bunch of smaller ones.
Also, you should check whether concurrent lookups in the map are faster than single threaded ones in your case. Perhaps parallelizing only the query itself, and not the cache lookup could yield better results in your case.

Kafka Connect duplicating messages in topic after setting sourceOffset, unique key, log compaction

I tried to do something like the following:
public List<SourceRecord> poll() throws InterruptedException {
List<SourceRecord> records = new ArrayList<>();
JSONArray jsonRecords = getRecords(0, 3);
for (Object jsonRecord: jsonRecords) {
JSONObject j = new JSONObject(jsonRecord.toString());
Map sourceOffset = Collections.singletonMap("block", j.get("block").toString());
Object value = j.get("data").toString();
records.add(new SourceRecord(
Collections.singletonMap("samesourcepartition", "samesourcepartition"), // sourcePartition
sourceOffset, // sourceOffset
"mytopic", // topic
Schema.STRING_SCHEMA, // keySchema
j.get("block").toString, // key: "0", "1", "2", "3"
Schema.STRING_SCHEMA, // valueSchema
value // value
));
log.info("added record for block: " + j.get("block"));
}
log.info("Returning {} records", records.size());
return records;
}
I am confused about how to use the sourceOffset.(https://docs.confluent.io/current/connect/devguide.html#task-example-source-task)
An example of block could be "3". I would expect it to be the case that if Kafka has already read this sourceOffset, it should not read it again. But it seems to completely ignore this and the offset continue to grow way past 3 and keep repeating the same 0-3 data in an infinite loop. For example, if I look at the Confluent dashboard > Topics > Inspect I would expect the highest offset and key recorded to be "3" but it's over 100+ with duplicated keys and values.
Does my poll() need to be incrementing the 0->3 so it knows when to "stop"? The current behavior keeps repeating 0->3, 0->3, ... to add new SourceRecord(), but I imagine with sourceOffset and the unique key should be idempotent.
I am sure I am misunderstanding something. I also tried to turn on log compaction, but still getting duplicates even with the same key. Can someone show the proper use to have a message per sourceOffset/key?

Ensuring Kafka consumer registers to an unassigned partition where consumers are running on multiple machines

I have a use case where I want to assign a consumer to a partition, and have no other consumer consume that partition, I know this sounds easy with using a consumer group and using the subscribe method, but what I want eventually is that no rebalance happens if one of the consumer goes down. The other challenge is that these consumers will be running on multiple machines.
Let me give the complete scenario:
I have a topic T and T has 4 partitions. Now, there are two machines M1 and M2 each running two threads that consume from the 4 partitions, now I can assign each partition using the assign function in KAFKA.
Challenges
When the consumers are brought up I don't want to hard code the partition assignment in code, instead I use the KafkaAdminClient to find information about the topic and the consumer group to which these consumers belong, and programmatically find out which partition is assigned and which is not, this is very hard in a multi-machine mode, because the application running on both machine queries for the same consumer group and gets the same information, now say machine M1 assigns partition 0 to thread0 and by that time machine M2 also assigns the partition 0 to thread0, I have a problem both threads are consuming data from the same partition. This is the scenario that I fail to address. Below is some KafkaAdminClient code that I use to gather information about the consumer group and topic. Any ideas are highly appreciated.
import org.apache.kafka.clients.admin.*;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.KafkaFuture;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.TopicPartitionInfo;
import java.util.*;
import java.util.concurrent.ExecutionException;
import java.util.stream.Collectors;
public class KafkaConsumerAdmin {
private final AdminClient adminClient;
public KafkaConsumerAdmin(Properties adminProperties) {
this.adminClient = AdminClient.create(adminProperties);
}
public List<TopicPartitionInfo> getTopicPartitionInfo(String topicName)
throws ExecutionException, InterruptedException {
DescribeTopicsResult describeTopicsResult = this.adminClient.describeTopics(
Collections.singletonList(topicName)
);
KafkaFuture<Map<String, TopicDescription>> futureOfDescribeTopicResult = describeTopicsResult.all();
Map<String, TopicDescription> topicDescriptionMap = futureOfDescribeTopicResult.get();
TopicDescription topicDescription = topicDescriptionMap.get(topicName);
return topicDescription.partitions();
}
public ConsumerGroupDescription getConsumerGroupInfo(String consumerGroupID)
throws ExecutionException, InterruptedException {
DescribeConsumerGroupsResult describeConsumerGroupsResult =
this.adminClient.describeConsumerGroups(Collections.singletonList(consumerGroupID));
KafkaFuture<Map<String, ConsumerGroupDescription>> futureOfDescribeConsumerGroupsResult =
describeConsumerGroupsResult.all();
Map<String, ConsumerGroupDescription> consumerGroupDescriptionMap = futureOfDescribeConsumerGroupsResult.get();
return consumerGroupDescriptionMap.get(consumerGroupID);
}
public Map<String, Set<TopicPartition>> getConsumerGroupMemberInfo(ConsumerGroupDescription consumerGroupDescription) {
Map<String, Set<TopicPartition>> memberToTopicPartitionMap = new HashMap<>();
for (MemberDescription memberDescription : consumerGroupDescription.members()) {
MemberAssignment memberAssignment = memberDescription.assignment();
Set<TopicPartition> topicPartitions = memberAssignment.topicPartitions();
memberToTopicPartitionMap.put(memberDescription.consumerId(), topicPartitions);
}
return memberToTopicPartitionMap;
}
public List<Integer> getAvailablePartitions(List<TopicPartitionInfo> topicPartitionInfoList,
Map<String, Set<TopicPartition>> memberToTopicPartitionMap) {
Map<Integer, List<String>> partitionToMemberMap = new HashMap<>();
topicPartitionInfoList.forEach(x -> partitionToMemberMap.put(x.partition(), new LinkedList<>()));
for (Map.Entry<String, Set<TopicPartition>> entry : memberToTopicPartitionMap.entrySet()) {
String memberID = entry.getKey();
for (TopicPartition topicPartition : entry.getValue()) {
partitionToMemberMap.get(topicPartition.partition()).add(memberID);
}
}
return partitionToMemberMap.entrySet().stream().
filter(x -> x.getValue().isEmpty()).map(Map.Entry::getKey).collect(Collectors.toList());
}
Map<TopicPartition, Long> getConsumerGroupOffsetInfo(String consumerGroupID)
throws ExecutionException, InterruptedException {
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult =
this.adminClient.listConsumerGroupOffsets(consumerGroupID);
KafkaFuture<Map<TopicPartition, OffsetAndMetadata>> futureOfConsumerGroupOffsetResult =
listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata();
Map<TopicPartition, OffsetAndMetadata> consumerGroupOffsetInfo = futureOfConsumerGroupOffsetResult.get();
return consumerGroupOffsetInfo.entrySet().stream().
collect(Collectors.toMap(Map.Entry::getKey, x -> x.getValue().offset()));
}
}

Apache Kafka - KafkaStream on topic/partition

I am writing Kafka Consumer for high volume high velocity distributed application. I have only one topic but rate incoming messages is very high. Having multiple partition that serve more consumer would be appropriate for this use-case. Best way to consume is to have multiple stream readers. As per the documentation or available samples, number of KafkaStreams the ConsumerConnector gives out is based on number of topics. Wondering how to get more than one KafkaStream readers [based on the partition], so that I can span one thread per stream or Reading from same KafkaStream in multiple threads would do the concurrent read from multiple partitions?
Any insights are much appreciated.

Would like to share what I found from mailing list:
The number that you pass in the topic map controls how many streams a topic is divided into. In your case, if you pass in 1, all 10 partitions's data will be fed into 1 stream. If you pass in 2, each of the 2 streams will get data from 5 partitions. If you pass in 11, 10 of them will each get data from 1 partition and 1 stream will get nothing.
Typically, you need to iterate each stream in its own thread. This is because each stream can block forever if there is no new event.
Sample snippet:
topicCount.put(msgTopic, new Integer(partitionCount));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = connector.createMessageStreams(topicCount);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(msgTopic);
for (final KafkaStream stream : streams) {
ReadTask task = new ReadTask(stream, msgTopic);
task.addObserver(this.msgObserver);
tasks.add(task); executor.submit(task);
}
Reference: http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201201.mbox/%3CCA+sHyy_Z903dOmnjp7_yYR_aE2sRW-x7XpAnqkmWaP66GOqf6w#mail.gmail.com%3E

The recommended way to do this is to have a thread pool so Java can handle organisation for you and for each stream the createMessageStreamsByFilter method gives you consume it in a Runnable. For example:
int NUMBER_OF_PARTITIONS = 6;
Properties consumerConfig = new Properties();
consumerConfig.put("zk.connect", "zookeeper.mydomain.com:2181" );
consumerConfig.put("backoff.increment.ms", "100");
consumerConfig.put("autooffset.reset", "largest");
consumerConfig.put("groupid", "java-consumer-example");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(consumerConfig));
TopicFilter sourceTopicFilter = new Whitelist("mytopic|myothertopic");
List<KafkaStream<Message>> streams = consumer.createMessageStreamsByFilter(sourceTopicFilter, NUMBER_OF_PARTITIONS);
ExecutorService executor = Executors.newFixedThreadPool(streams.size());
for(final KafkaStream<Message> stream: streams){
executor.submit(new Runnable() {
public void run() {
for (MessageAndMetadata<Message> msgAndMetadata: stream) {
ByteBuffer buffer = msgAndMetadata.message().payload();
byte [] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
//Do something with the bytes you just got off Kafka.
}
}
});
}
In this example I asked for 6 threads basically because I know that I have 3 partitions for each topic and I listed two topics in my whitelist. Once we have the handles of the incoming streams we can iterate over their content, which are MessageAndMetadata objects. Metadata is really just the topic name and offset. As you discovered you can do it in a single thread if you ask for 1 stream instead of, in my example 6, but if you require parallel processing the nice way is to launch an executor with one thread for each returned stream.

/**
* #param source : source kStream to sink output-topic
*/
private static void pipe(KStream<String, String> source) {
source.to(Serdes.String(), Serdes.String(), new StreamPartitioner<String, String>() {
#Override
public Integer partition(String arg0, String arg1, int arg2) {
return 0;
}
}, "output-topic");
}
above code will write record at partition 1 of topic name "output-topic"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Consumer : controlled reading from topic - java

Related

Multithreading in a hashmap entry loop - Executor service (Java)

How to use concurrentHashMap in executorCompletionService?

Kafka Connect duplicating messages in topic after setting sourceOffset, unique key, log compaction

Ensuring Kafka consumer registers to an unassigned partition where consumers are running on multiple machines

Apache Kafka - KafkaStream on topic/partition

Categories

Resources