How to track committed offset with Spark job for kafka batch

How to track committed offset with Spark job for kafka batch - java

I have a use case where i am writing to a Kafka topic in batches using spark job (no streaming).Initially i pump-in suppose 10 records to Kafka topic and run the spark job which does some processing and finally write to another Kafka topic.
Next time when i push another 5 records and run the spark job, my requirement is to start processing these 5 records only not from starting offset. I need to maintain the committed offset so that spark job should run on next offset position and do the processing.
Here is code from kafka side to fetch the offset:
private static List<TopicPartition> getPartitions(KafkaConsumer consumer, String topic) {
List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
return partitionInfoList.stream().map(x -> new TopicPartition(topic, x.partition())).collect(Collectors.toList());
}
public static void getOffSet(KafkaConsumer consumer) {
List<TopicPartition> topicPartitions = getPartitions(consumer, topic);
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " startingOffSet-> " + consumer.position(x));
});
consumer.assign(topicPartitions);
consumer.seekToEnd(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " endingOffSet-> " + consumer.position(x));
});
topicPartitions.forEach(x -> {
consumer.poll(1000) ;
OffsetAndMetadata offsetAndMetadata = consumer.committed(x);
long position = consumer.position(x);
System.out.printf("Committed: %s, current position %s%n", offsetAndMetadata == null ? null : offsetAndMetadata
.offset(), position);
});
}
Below code is for spark to load the messages from topic which is not working :
Dataset<Row> kafkaDataset = session.read().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("group.id", "test-consumer-group")
.option("startingOffsets","{\"Topic1\":{\"0\":2}}")
.option("endingOffsets", "{\"Topic1\":{\"0\":3}}")
.option("enable.auto.commit","true")
.load();
After above code executes i am again trying to get the offset by calling
getoffset(consumer)
from the topic which always reads from 0 offset and committed offset fetched initially keeps on increasing. I am new to kafka and still figuring out how to handle such scenarion.Please help here.
Initially i had 10 records in my topic, i published another 2 records and here is the o/p:
Output post getoffset method executes :
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
Output post spark code executes for loading messages.
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
I see no diff and . Please take a look and suggest resolution for this sceanario.

Related

How to make spark streaming commit in each batch when limiting Kafka batch size?

To limit the batch size when using Spark streaming, I referenced this answer
There is about 50 millions records stocking (about to be consumed) in Kafka.
The topic is with 3 partitions.
zhihu_comment 0 10906153 28668062 17761909 - - -
zhihu_comment 1 10972464 30271728 19299264 - - -
zhihu_comment 2 10906395 28662007 17755612 - - -
My consumer app:
public final class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
String brokers = "device1:9092,device2:9092,device3:9092";
String groupId = "spark";
String topics = "zhihu_comment";
// Create context with a certain seconds batch interval
SparkConf sparkConf = new SparkConf().setAppName("TestKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled", "true");
sparkConf.set("spark.streaming.backpressure.initialRate", "10000");
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "10000");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(10));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
kafkaParams.put("enable.auto.commit", true);
kafkaParams.put("max.poll.records", "500");
// Create direct kafka stream with brokers and topics
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
// Get the lines, split them into words, count the words and print
JavaDStream<String> lines = messages.map(ConsumerRecord::value);
lines.count().print();
jssc.start();
jssc.awaitTermination();
}
}
I have limited the consuming size of spark streaming, in my case, I set maxRatePerPartition to 10000, which means it consumed 300000 records per batch in my case.
The question is although spark streaming is able to handle records with specific limit, the current offset showing by kafka is not the offset that spark streaming is handling. As the kafka's current offset suddenly goes down to latest offset:
zhihu_comment 0 28700537 28700676 139 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
zhihu_comment 1 30305102 30305224 122 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
zhihu_comment 2 28695033 28695146 113 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
It appears that Spark streaming does not commit the offset in each batch, it commits the latest offset at the beginning when it starts to consume!
Is there any way to make spark streaming commit with each batch?
Spark streaming log, proving the records num it consumed each batch:
20/05/04 22:28:13 INFO scheduler.DAGScheduler: Job 15 finished: print at SparkConsumer.java:65, took 0.012606 s
-------------------------------------------
Time: 1588602490000 ms
-------------------------------------------
300000
20/05/04 22:28:13 INFO scheduler.JobScheduler: Finished job streaming job 1588602490000 ms.0 from job set of time 1588602490000 ms

You need to disable
kafkaParams.put("enable.auto.commit", false);
and rather use
messages.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// do here some transformations and action on the rdd, typically like:
rdd.foreachPartition(it -> {
it.foreach(row -> ...)
})
// commit messages
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
});
as described in the Spark + Kafka Integration Guide.
You could also use commitSync for synchronous commits.

java.lang.IllegalStateException: This consumer has already been closed

Set up kafka consumer with this configuration
kafkaconfig:
acks: 1
autoCommit: true
bootstrapServers: example.com:9092
topic: item
groupId: EWok-group
keyDeserializer: org.apache.kafka.common.serialization.StringDeserializer
valueDeserializer: org.apache.kafka.common.serialization.StringDeserializer
maxPollRecords: 1
pollMillisTime: 15
retries: 5
heartBeatInterval: 300
sessionTimeout: 100000
maxPollInterval: 30000
code
while (true) {
try {
ConsumerRecords<String, String> consumerRecords = eWokIntegrationConsumer.poll(Duration.of(kafkaCommConfig.getPollMillisTime(), ChronoUnit.SECONDS));
if (!consumerRecords.isEmpty()) {
LOG.info("Consumed Record Count: {}", consumerRecords.count());
consumerRecords.forEach(record -> {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
eWokMessageProcessor.onMessage(record.value());
eWokIntegrationConsumer.commitSync();
});
} else {
LOG.info("Polling returned without any records.");
}
} catch (Exception exception) {
LOG.error("Consumer was interrupted. But still continue to poll. Exception:", exception);
eWokIntegrationConsumer.close();
}
}
10000 ms is taking for processing the data which we have received from kafka consumer.but getting
exception saying
java.lang.IllegalStateException: This consumer has already been closed.
Exception Logs
java.lang.IllegalStateException: This consumer has already been closed.
at org.apache.kafka.clients.consumer.KafkaConsumer.acquireAndEnsureOpen(KafkaConsumer.java:2202)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1332)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1298)
Kafka version : kafka-clients-2.0.1
Could you pls suggest any one how should configurations Kafka consume looks like.

I have put System.exit(0) in other place in the source code.That is why consumer has leave the group and mark as closed.
I have remove System.exit(0) from source code.Now it's working fine.

Can produce to Kafka but cannot consume

I'm using the Kafka JDK client ver 0.10.2.1 . I am able to produce simple messages to Kafka for a "heartbeat" test, but I cannot consume a message from that same topic using the sdk. I am able to consume that message when I go into the Kafka CLI, so I have confirmed the message is there. Here's the function I'm using to consume from my Kafka server, with the props - I pass the message I produced to the topic only after I have indeed confirmed the produce() was succesful, I can post that function later if requested:
private def consumeFromKafka(topic: String, expectedMessage: String): Boolean = {
val props: Properties = initProps("consumer")
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(List(topic).asJava)
var readExpectedRecord = false
try {
val records = {
val firstPollRecs = consumer.poll(MAX_POLLTIME_MS)
// increase timeout and try again if nothing comes back the first time in case system is busy
if (firstPollRecs.count() == 0) firstPollRecs else {
logger.info("KafkaHeartBeat: First poll had 0 records- trying again - doubling timeout to "
+ (MAX_POLLTIME_MS * 2)/1000 + " sec.")
consumer.poll(MAX_POLLTIME_MS * 2)
}
}
records.forEach(rec => {
if (rec.value() == expectedMessage) readExpectedRecord = true
})
} catch {
case e: Throwable => //log error
} finally {
consumer.close()
}
readExpectedRecord
}
private def initProps(propsType: String): Properties = {
val prop = new Properties()
prop.put("bootstrap.servers", kafkaServer + ":" + kafkaPort)
propsType match {
case "producer" => {
prop.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
prop.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
prop.put("acks", "1")
prop.put("producer.type", "sync")
prop.put("retries", "3")
prop.put("linger.ms", "5")
}
case "consumer" => {
prop.put("group.id", groupId)
prop.put("enable.auto.commit", "false")
prop.put("auto.commit.interval.ms", "1000")
prop.put("session.timeout.ms", "30000")
prop.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
prop.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
prop.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
// poll just once, should only be one record for the heartbeat
prop.put("max.poll.records", "1")
}
}
prop
}
Now when I run the code, here's what it outputs in the console:
13:04:21 - Discovered coordinator serverName:9092 (id: 2147483647
rack: null) for group 0b8947e1-eb68-4af3-ac7b-be3f7c02e76e. 13:04:23
INFO o.a.k.c.c.i.ConsumerCoordinator - Revoking previously assigned
partitions [] for group 0b8947e1-eb68-4af3-ac7b-be3f7c02e76e 13:04:24
INFO o.a.k.c.c.i.AbstractCoordinator - (Re-)joining group
0b8947e1-eb68-4af3-ac7b-be3f7c02e76e 13:04:25 INFO
o.a.k.c.c.i.AbstractCoordinator - Successfully joined group
0b8947e1-eb68-4af3-ac7b-be3f7c02e76e with generation 1 13:04:26 INFO
o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions
[HeartBeat_Topic.Service_5.2018-08-03.13_04_10.377-0] for group
0b8947e1-eb68-4af3-ac7b-be3f7c02e76e 13:04:27 INFO
c.p.p.l.util.KafkaHeartBeatUtil - KafkaHeartBeat: First poll had 0
records- trying again - doubling timeout to 60 sec.
And then nothing else, no errors thrown -so no records are polled. Does anyone have any idea what's preventing the 'consume' from happening? The subscriber seems to be successful, as I'm able to successfully call the listTopics and list partions no problem.

Your code has a bug. It seems your line:
if (firstPollRecs.count() == 0)
Should say this instead
if (firstPollRecs.count() > 0)
Otherwise, you're passing in an empty firstPollRecs, and then iterating over that, which obviously returns nothing.

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file.
Here is my code:
public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());
boolean clearOutputBeforeSaving = false;
if (extraArgs != null && extraArgs.length > 0) {
if (extraArgs[0].equals("clearOutput")) {
clearOutputBeforeSaving = true;
} else {
logger.warn("Unknown param " + extraArgs[0]);
}
}
Date currRunDate = new Date(writeStartDate.getTime());
while (currRunDate.getTime() < writeEndDate.getTime()) {
try {
SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
inputDir,
currRunDate,
getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
// Normalize to 1 hours & 0.25 degrees
JavaRDD<FirstData> distinctData1 = data1.distinct();
// Floor all (distinct) values to 6 hour windows
JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
d1.getId(),
TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
d1.getLatitude(),
d1.getLongitude()));
// Convert Data1 to Dataframes
DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
data1DF.registerTempTable("data1");
// Read Data2 DataFrame
String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
String inputS3Path = basedirInput + "/dt=" + currDateString;
DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
data2DF.registerTempTable("data2");
// Join data1 and data2
DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
"FROM data1 as D1,data2 as D2 " +
"WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
"GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");
// Create histogram per ID
JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());
logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());
// Save to parquet
String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
if (clearOutputBeforeSaving) {
writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
} else {
write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
}
} finally {
TimeUtils.progressToNextDay(currRunDate);
}
}
}
public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
// Apply a schema to an RDD of JavaBeans and save it as Parquet.
DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
fullDataDF.write().parquet(outputS3Path);
}
public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
SQLContext sqlContext, S3Client s3Client) {
String fileKey = S3Utils.getS3Key(outputS3Path);
String bucket = S3Utils.getS3Bucket(outputS3Path);
logger.info("Deleting existing dir: " + outputS3Path);
s3Client.deleteAll(bucket, fileKey);
write(outputS3Path, outputRDD, outputClass, sqlContext);
}
public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
if (endOfDay < proposedEndTime.getTime()) {
return new Date(endOfDay);
}
return proposedEndTime;
}
The size of data1 is around 150,000 and data2 is around 500,000.
What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet.
The spark has 25GB of memory per server, and the code runs fine.
Each iteration takes about 2-3 minutes.
The problem starts when I run it on a large set of dates.
After a while, I get an OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
Last time it ran, it crashed after 233 iterations.
The line it crashed on was this:
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
Can anyone please tell me what can be the reason for the eventual crashes?

I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue.
I have ran my application for several days now, and had no crashes yet.

This error occurs when GC takes up over 98% of the total execution time of process. You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040.
Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application.
Another thing to try is to change the garbage collector that the java is using. Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html. It very clearly explains why GC overhead error occurs and which garbage collector is best for your application.

Kafka 0.9-Java : consumer skipping offsets during application restart

I have a java application with below properties
kafkaProperties = new Properties();
kafkaProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokersList);
kafkaProperties.put(ConsumerConfig.GROUP_ID_CONFIG, consumerGroupName);
kafkaProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
kafkaProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
kafkaProperties.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, consumerSessionTimeoutMs);
kafkaProperties.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, maxPartitionFetchBytes);
kafkaProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
I've created 15 consumer threads and let them process the below runnable .I don't have any other consumer with this consumer group name consuming .
#Override
public void run() {
try {
logger.info("Starting ConsumerWorker, consumerId={}", consumerId);
consumer.subscribe(Arrays.asList(kafkaTopic), offsetLoggingCallback);
while (true) {
boolean isPollFirstRecord = true;
logger.debug("consumerId={}; about to call consumer.poll() ...", consumerId);
ConsumerRecords<String, String> records = consumer.poll(pollIntervalMs);
Map<Integer,Long> partitionOffsetMap = new HashMap<>();
for (ConsumerRecord<String, String> record : records) {
if (isPollFirstRecord) {
isPollFirstRecord = false;
logger.info("Start offset for partition {} in this poll : {}", record.partition(), record.offset());
}
messageProcessor.processMessage(record.value(), record.offset());
partitionOffsetMap.put(record.partition(),record.offset());
}
if (!records.isEmpty()) {
logger.info("Invoking commit for partition/offset : {}", partitionOffsetMap);
consumer.commitAsync(offsetLoggingCallback);
}
}
} catch (WakeupException e) {
logger.warn("ConsumerWorker [consumerId={}] got WakeupException - exiting ... Exception: {}",
consumerId, e.getMessage());
} catch (Exception e) {
logger.error("ConsumerWorker [consumerId={}] got Exception - exiting ... Exception: {}",
consumerId, e.getMessage());
} finally {
logger.warn("ConsumerWorker [consumerId={}] is shutting down ...", consumerId);
consumer.close();
}
}
I also have a OffsetCommitCallbackImpl like below . It basically maintains the partition's and their commited offset as map .It also logs whenever offset is committed .
#Override
public void onComplete(Map<TopicPartition, OffsetAndMetadata> offsets, Exception exception) {
if (exception == null) {
offsets.forEach((topicPartition, offsetAndMetadata) -> {
partitionOffsetMap.put(topicPartition, offsetAndMetadata);
logger.info("Offset position during the commit for consumerId : {}, partition : {}, offset : {}", Thread.currentThread().getName(), topicPartition.partition(), offsetAndMetadata.offset());
});
} else {
offsets.forEach((topicPartition, offsetAndMetadata) -> logger.error("Offset commit error, and partition offset info : {}, partition : {}, offset : {}", exception.getMessage(), topicPartition.partition(), offsetAndMetadata.offset()));
}
}
Problem/Issue :
I noticed that i miss events/messages whenever i (restart) bring the application down and bring it back up . So when i closely looked at the logging . by comparing the offsets that are committed(using offsetcommitcallback logging) before shutdown vs offsets that are picked up for processing after restart, i see that for certain partition we did not pickup the offset where we left before shutdown. sometimes the start offset for certain partition's are like 1000 more than the committed offset .
NOTE : This happens to like 8 out of 40 partitions
If you closely look at the logging in run method there is one log statement where i actually print the offset before invoking async commit . For example if that last log before shutdown shows that as 10 for partition 1 . After restart the first offset we are processing for partition 1 would be like 100 . And i validated that we are exactly missing 90 messages .
Can any one think of a reason why this would be happening ?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to track committed offset with Spark job for kafka batch - java

Related

How to make spark streaming commit in each batch when limiting Kafka batch size?

java.lang.IllegalStateException: This consumer has already been closed

Can produce to Kafka but cannot consume

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

Kafka 0.9-Java : consumer skipping offsets during application restart

Categories

Resources