Specify number of partitions on Kafka producer - java

I have the following code from the web page https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
What seems to be missing is how to configure the number of partitions. I want to specify 4 partitions but am always ending up with the default of 2 partitions. How do I change the code to have 4 partitions(without changing the default).
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:9092,broker2:9092");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("partitioner.class", "com.gnip.kafka.SimplePartitioner");
props.put("request.required.acks", "1");
props.put("num.partitions", 4);
ProducerConfig config = new ProducerConfig(props);
Producer<String, String> producer = new Producer<String, String>(config);
Random rnd = new Random();
for (long nEvents = 0; nEvents < 1000; nEvents++) {
long runtime = new Date().getTime();
String ip = "192.168.2." + rnd.nextInt(255);
String msg = runtime + ",www.example.com," + ip;
KeyedMessage<String, String> data = new KeyedMessage<String, String>("page_visits2", ip, msg);
producer.send(data);
}
producer.close();

The Kafka producer api does not allow you to create custom partition, if you try to produce some data to a topic which does not exists it will first create the topic if the auto.create.topics.enable property in the BrokerConfig is set to TRUE and start publishing data on the same but the number of partitions created for this topic will based on the num.partitions parameter defined in the configuration files (by default it is set to one).
Increasing partition count for an existing topic can be done, but it'll not move any existing data into those partitions.
To create a topic with different number of partition you need to create the topic first and the same can be done with the console script that shipped along with the Kafka distribution. The following command will allow you to create a topic with 2 partition (as specified by the --partition flag)
bin/kafka-create-topic.sh --zookeeper localhost:2181 --replica 3 --partition 2 --topic my-custom-topic
Unfortunately as far as my understanding goes currently there is no direct alternative to achieve this.

The number of partitions is a broker property and will not have any effect for the producer, see here.
As the producer example page shows, you can use a custom partitioner to route messages as you prefer but new partitions will not be created if not defined in the broker properties.

Related

Kafka consumer group, set offset to 0 when consumer group is created

I'm creating a new Kafka consumer like this in Java (some code omitted for brevity)
final Properties props = new Properties();
props.put(ConsumerConfig.GROUP_ID_CONFIG, "group2");
final Consumer<Long, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("topicname"));
This creates also the consumer group automatically, in case it does not yet exist. The problem is that the offset of this consumer group is not at the beginning of the topic, but at the end.
How can I ensure that the offset is at 0 when group is created (but not otherwise)? I don't want to manually track the offset, just set it to 0 when creating the consumer if the consumer group does not yet exist.
If you didn't specify any value for auto.offset.reset in the consumer config, it is default to "latest" offset.
You need to set it to "earliest" if you want to consume from offset 0:
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

Why does kafka streams reprocess the messages produced after broker restart

I have a single node kafka broker and simple streams application. I created 2 topics (topic1 and topic2).
Produced on topic1 - processed message - write to topic2
Note: For each message produced only one message is written to destination topic
I produced a single message. After it was written to topic2, I stopped the kafka broker. After sometime I restarted the broker and produced another message on topic1. Now streams app processed that message 3 times. Now without stopping the broker I produced messages to topic1 and waited for streams app to write to topic2 before producing again.
Streams app is behaving strangely. Sometimes for one produced message there are 2 messages written to destination topic and sometimes 3. I don't understand why is this happening. I mean even the messages produced after broker restart are being duplicated.
Update 1:
I am using Kafka version 1.0.0 and Kafka-Streams version 1.1.0
Below is the code.
Main.java
String credentials = env.get("CREDENTIALS");
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "activity-collection");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.RECONNECT_BACKOFF_MS_CONFIG, 100000);
props.put(StreamsConfig.RECONNECT_BACKOFF_MAX_MS_CONFIG, 200000);
props.put(StreamsConfig.REQUEST_TIMEOUT_MS_CONFIG, 60000);
props.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 60000);
props.put(StreamsConfig.producerPrefix(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG), true);
props.put(StreamsConfig.producerPrefix(ProducerConfig.ACKS_CONFIG), "all");
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> activityStream = builder.stream("activity_contenturl");
KStream<String, String> activityResultStream = AppUtil.hitContentUrls(credentials , activityStream);
activityResultStream.to("o365_user_activity");
AppUtil.java
public static KStream<String, String> hitContentUrls(String credentials, KStream<String, String> activityStream) {
KStream<String, String> activityResultStream = activityStream
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
ArrayList<String> log = new ArrayList<String>();
JSONObject received = new JSONObject(value);
String url = received.get("url").toString();
String accessToken = ServiceUtil.getAccessToken(credentials);
JSONObject activityLog = ServiceUtil.getActivityLogs(url, accessToken);
log.add(activityLog.toString());
}
return log;
}
});
return activityResultStream;
}
Update 2:
In a single broker and single partition environment with the above config, I started the Kafka broker and streams app. Produced 6 messages on source topic and when I started a consumer on destination topic there are 36 messages and counting. They keep on coming.
So I ran this to see consumer-groups:
kafka_2.11-1.1.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list
Output:
streams-collection-app-0
Next I ran this:
kafka_2.11-1.1.0/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group streams-collection-app-0
Output:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 1 0 streams-collection-app-0-244b6f55-b6be-40c4-9160-00ea45bba645-StreamThread-1-consumer-3a2940c2-47ab-49a0-ba72-4e49d341daee /127.0.0.1 streams-collection-app-0-244b6f55-b6be-40c4-9160-00ea45bba645-StreamThread-1-consumer
After a while the output showed this:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 6 5 - - -
And then:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 7 6 - - -
seems you are facing with known limitation. Kafka topic by default stores messages at least 7 days, but committed offsets stored for 1 day (default config value offsets.retention.minutes = 1440). so if no messages were produced to your source topic during more than 1 day, after app restart all messages from topic will be reprocessed again (actually multiple times, depending on number of restarts, max 1 time per day per such topic with rare incoming messages).
you could find description of expiration committed offsets How does an offset expire for consumer group.
in kafka version 2.0 retention for committed offsets was increased KIP-186: Increase offsets retention default to 7 days.
to prevent reprocessing, you could add consumer property auto.offset.reset: latest (default value is earliest).
there is exist a small risk with latest: if no one produced message into a source topic longer that day, and after that you restarted app, you could lost some messages (only messages which arrived exactly during restart).

Why is there a lag when consuming from Apache Kafka using Java KafkaConsumer

I hope someone can shed some light on my issue.
I'm making a series of REST calls to a micro-service I have built. When the service receives the calls, it persists some data to a database and then publishes a message to a Kafka topic.
I'm trying to write a test that makes the REST calls and then consumes the messages from the Kafka topic.
When my test tries to consume from Kafka, it doesnt appear to be consuming the latest messages.
I've used the kafka-consumer-groups.sh shell script which ships with Kafka to describe the state of my consumer. Here is what it looks like:
bash-4.3# ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group test
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).
Consumer group 'test' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test_topic 0 42 43 1 - - -
Note the lag of 1. This is what appears to be my issue.
Here is my Kafka consumer code:
public void consumeMessages() {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
try (KafkaConsumer < String, String > kafkaConsumer = new KafkaConsumer < > (props)) {
kafkaConsumer.subscribe(Collections.singletonList("test_topic"));
ConsumerRecords < String, String > consumerRecords = kafkaConsumer.poll(5000);
for (ConsumerRecord < String, String > consumerRecord: consumerRecords) {
System.out.printf("offset = %d, message = %s%n", consumerRecord.offset(), consumerRecord.value());
}
}
}
Any help will be greatly received.
Thanks,
Ben :)

Kafka one consumer one partition

I have a use case where I have a single topic with 100 partitions where messages go in each partition with some logic and I have 100 consumers who reads this message. I want to map a specific partition to a specific consumer. How can I achieve that?
Checkout the Javadoc for the KafkaConsumer, specifically the section "Manual Partition Assignment".
TL/DR
You can manually assign specific partitions to a consumer as follows:
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));

Can a Kafka broker retain messages while there are no consumers connected?

I am trying to build a pub/sub application and I am exploring the best tools out there. I am currently looking at Kafka and have a little demo app already running. However, I am running into a conceptual issue.
I have a producer (Java code):
String topicName = "MyTopic;
String key = "MyKey";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("acks", "all");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer <String, byte[]>(props);
byte[] data = <FROM ELSEWHERE>;
ProducerRecord<String, byte[]> record = new ProducerRecord<String, byte[]>(topicName, key, data);
try {
RecordMetadata result = producer.send(record).get();
}
catch (Exception e) {
// Nothing for now
}
producer.close();
When I start a consumer via the Kakfa command line tools:
kafka-console-consumer --bootstrap-server localhost:9092 --topic MyTopic
and then I execute the producer code, I see the data message show up on my consumer terminal.
However, if I do not run the consumer prior executing the producer, the message appears "lost". When I start the consumer (after executing the producer), nothing appears in the consumer terminal.
Does anyone know if it's possible to have the Kafka broker retain messages while there are no consumers connected? If so, how?
Append --from-beginning to the console consumer command to have it start consuming from the earliest offset. This is actually about the offset reset strategy which is controlled by config auto.offset.reset. Here is what this config means:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.

Categories

Resources