I have written a code to push data to kafka topic on daily basis, but there are few issue which i am not sure this code will be able to handle. my responsibility is to push complete data from a live table which holds 1 day data(refreshed every day morning)
my code will query "select * from mytable" and push it one by one to kafka topic as before pushing i need to validate/alter each row and push to topic.
below is my producer send code.
Properties configProperties = new Properties();
configProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, sBOOTSTRAP_SERVERS_CONFIG);
configProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
configProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
configProperties.put("acks", "all");
configProperties.put("retries", 0);
configProperties.put("batch.size", 15000);
configProperties.put("linger.ms", 1);
configProperties.put("buffer.memory", 30000000);
#SuppressWarnings("resource")
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(configProperties);
System.out.println("Starting Kafka producer job " + new Date());
producer.send(new ProducerRecord<String, String>(eventName, jsonRec.toString()), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
e.printStackTrace();
}
}
});
Now, i am not sure how to push data back again into topic in case of failure. Since i have selected all the records from table and few of it got failed and i do not know which all.
Below is what i want to address
how can process only those records which are not pushed to avoid duplicate record being push(avoid redundancy).
how to validate the records pushed are exactly same as in table. i mean the data integrity. like size of data and count of records been pushed.
You can use configProperties.put("enable.idempotence", true); - it will try to retry failed messages but make sure there will be just one of each record saved in kafka. Note that it implies that retries>0 acks=all and max.in.flight.requests.per.connection >=0. For details check https://kafka.apache.org/documentation/.
For 2nd question - if you mean that you need to save all records or none then you have to use kafka transactions, which brings a lot more questions, I would recommend reading https://www.confluent.io/blog/transactions-apache-kafka/
Related
I want to create single kafka consumer for several topics. Method constructor for consumer allows me to transfer arguments for a list of topics inside subscription, like that:
private Consumer createConsumer() {
Properties props = getConsumerProps();
Consumer<String, byte[]> consumer = new KafkaConsumer<>(props);
ArrayList<String> topicMISL = new ArrayList<>();
for (String s:Connect2Redshift.kafkaTopics) {
topicMISL.add(systemID + "." + s);
}
consumer.subscribe(topicMISL);
return consumer;
}
private boolean consumeMessages( Duration duration, Consumer<String, byte[]> consumer) {
try { Long start = System.currentTimeMillis();
ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(duration);
}
}
Afterwards I want to poll records from kafka into stream every 3 sec and process them, but I wonder what is inside this consumer - how will records from different topics be polled - at first one topic, then another, or in parallel. Could it be that one topic with large amount of messages would be processed all the time and another topic with small amount of messages would wait?
in general it depends on your topic settings. Kafka scales by using multiple partitions per topic.
If you have 3 partitions on 1 topic, kafka can read from them in parallel
The same is true for multiple topics, reading can happen in parallel
If you have a partition that receives a lot more messages than the others, you may run into the scenario of a consumer lag for this particular partition. Tweaking the batch size and consumer settings may help them, also compressing messages.
Ideally making sure to distribute the load evenly avoids this scenario.
Look into this blog article, it gave me a good understanding of the internals: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
}
}
Also need to do commit for offset by finding offset and commit using consumer.commitSync
I have a Kafka topic with multiple partitions and I wonder if there is a way in Java to fetch the last message for the topic. I don't care for the partitions I just want to get the latest message.
I have tried #KafkaListener but it fetches the message only when the topic is updated. If there is nothing published after the application is opened nothing is returned.
Maybe the listener is not the right approach to the problem at all?
This following snippet worked for me. You may try this. Explanation in the comments.
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList(topic));
consumer.poll(Duration.ofSeconds(10));
consumer.assignment().forEach(System.out::println);
AtomicLong maxTimestamp = new AtomicLong();
AtomicReference<ConsumerRecord<String, String>> latestRecord = new AtomicReference<>();
// get the last offsets for each partition
consumer.endOffsets(consumer.assignment()).forEach((topicPartition, offset) -> {
System.out.println("offset: "+offset);
// seek to the last offset of each partition
consumer.seek(topicPartition, (offset==0) ? offset:offset - 1);
// poll to get the last record in each partition
consumer.poll(Duration.ofSeconds(10)).forEach(record -> {
// the latest record in the 'topic' is the one with the highest timestamp
if (record.timestamp() > maxTimestamp.get()) {
maxTimestamp.set(record.timestamp());
latestRecord.set(record);
}
});
});
System.out.println(latestRecord.get());
You'll have to consume the latest message from each partition and then do a comparison on the client side (using the timestamp on the message, if it contains it). The reason for this is that Kafka does not guarantee inter-partition ordering. Inside a partition, you can be sure that the message with the largest offset is the latest message pushed to it.
I want to verify the partition size before producing the the record in Kafka.
I have a custom partitioned class which gives me exact partition number in which my message is supposed to drop.
Now my requirement is I want to check the size of partition before sending my record.
List<String> users = userService.findAllUsers();
for (String user : users) {
String msg = "Hello " + user;
//Check size here
producer.send(new ProducerRecord<String, String>(topic, user, msg), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
e.printStackTrace();
}
}
});
Is there any way in kafka I can achieve this ?
Capacity is purely a Kafka-broker level aspect.
Basically, if there is disk space in broker's data directory, you can deliver the message. The message are cleaned up by broker based on time and partition size (that's in broker configuration), so if you configure your broker accordingly, you might always have space - the old messages would just get thrown away. It might not suit your business usecase though.
Also, responding to your comment Can we check size of Topic, you can actually check the current size of partition by using beginningOffsets & endOffsets methods in KafkaConsumer. Beware that these methods might block if the partitions do not exist (at least in 0.10.2). e.g. when you request data for partition 4 when topic actually contains 3 partitions.
Kafka 0.11 introduces administrative capabilities in client, however it is still work in progress.
According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again
I have written a simple program to read data from Kafka and print in flink. Below is the code.
public static void main(String[] args) throws Exception {
Options flinkPipelineOptions = PipelineOptionsFactory.create().as(Options.class);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Class<?> unmodColl = Class.forName("java.util.Collections$UnmodifiableCollection");
env.getConfig().addDefaultKryoSerializer(unmodColl, UnmodifiableCollectionsSerializer.class);
env.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE);
flinkPipelineOptions.setJobName("MyFlinkTest");
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setCheckpointingInterval(1000L);
flinkPipelineOptions.setNumberOfExecutionRetries(5);
flinkPipelineOptions.setExecutionRetryDelay(3000L);
Properties p = new Properties();
p.setProperty("zookeeper.connect", "localhost:2181");
p.setProperty("bootstrap.servers", "localhost:9092");
p.setProperty("group.id", "test");
FlinkKafkaConsumer09<Notification> kafkaConsumer = new FlinkKafkaConsumer09<>("testFlink",new ProtoDeserializer(),p);
DataStream<Notification> input = env.addSource(kafkaConsumer);
input.rebalance().map(new MapFunction<Notification, String>() {
#Override
public String map(Notification value) throws Exception {
return "Kafka and Flink says: " + value.toString();
}
}).print();
env.execute();
}
I need flink to process my data in kafka exactly once and I have few questions on how it can be done.
When does FlinkKafkaConsumer09 commits the processed offsets to kafka?
Say my topic has 10 messages, the consumer processes all 10 messages. When I stop the job and start it again, it starts processing random messages from the set of previously read messages. I need to ensure none of my messages are processed twice.
Please advice. Appreciate all the help. Thanks.
This page describes the fault tolerance guarantees of the Flink Kafka connector.
You can use Flink's savepoints to re-start a job in an exactly-once (state preserving) manner.
The reason why you are seeing the messages again is because the offsets committed by Flink to the Kafka broker / Zookeeper are not in line with Flink's registered state.
You'll always see messages processed multiple times after restore / failure in Flink, even with exactly-once semantics enabled. The exactly-once guarantees in Flink are with respect to registered state, not for the records send to the operators.
Slightly off-topic: What are these lines for? They are not passed to Flink anywhere.
Options flinkPipelineOptions = PipelineOptionsFactory.create().as(Options.class);
flinkPipelineOptions.setJobName("MyFlinkTest");
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setCheckpointingInterval(1000L);
flinkPipelineOptions.setNumberOfExecutionRetries(5);
flinkPipelineOptions.setExecutionRetryDelay(3000L);