How to extract timestamp embedded in messages in Kafka Streams - java

I want to extract Timestamps embedded with each message and send them as json payload into my database.
I want to get the following three timestamps.
Event-time: The point in time when an event or data record occurred, i.e. was originally created “by the source”.
Processing-time: The point in time when the event or data record happens to be processed by the stream processing application, i.e. when the record is being consumed.
Ingestion-time: The point in time when an event or data record is stored in a topic partition by a Kafka broker.
This is my streams application code:
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, BROKER_URL + ":9092"); // pass from env localhost:9092 ,BROKER_URL + ":9092"
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source_o365_user_activity = builder.stream("o365_user_activity");
source_o365_user_activity.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
System.out.println("========> o365_user_activity_by_date Log: " + value);
ArrayList<String> keywords = new ArrayList<String>();
try {
JSONObject send = new JSONObject();
JSONObject received = new JSONObject(value);
send.put("current_date", getCurrentDate().toString()); // UTC TIME
send.put("activity_time", received.get("CreationTime")); // CONSTANTS FINAL STATIC(Topic Names, Cassandra keys)
send.put("user_id", received.get("UserId"));
send.put("operation", received.get("Operation"));
send.put("workload", received.get("Workload"));
keywords.add(send.toString());
} catch (Exception e) {
// TODO: handle exception
System.err.println("Unable to convert to json");
e.printStackTrace();
}
return keywords;
}
}).to("o365_user_activity_by_date");
In the code I am simply getting each record, doing some stream processing on it and sending it to a different topic.
Now with each record I want to send Event-time, Processing-time and Ingestion-time embedded in the payload.
I have looked at the FailOnInvalidTimestamp and WallclockTimestampExtractor but I am confused on how to use them.
Kindly guide me how can I achieve this.

The Timestamp extractor can only give you one timestamp and this timestamp is used for time-based operations like windowed-aggregations or joins. It seems that you don't do any time-based computation thought, thus, from a computation point of view, it does not matter.
Note, that a record has only one metadata timestamp field. This timestamp field can be used to store an event-timestamp that can be set by the producer. As an alternative, you can let the broker overwrite the producer provided timestamp with the broker ingestion time. This is a topic configuration.
To access the record metadata timestamp (independent if it's event-time or ingestion-time), the default timestamp extractor with give you this timestamp. If you want to access it in your application, you need to use Processor API, ie, in your case a .transform() instead of a .flatMap operator. Your Transformer will be initialized with a context object that allows you to access the extracted timestamp.
Because a record can only store one metadata timestamp and because you want to use this for broker ingestion time, the upstream producer must put the event-timestamp into the payload directly.
For processing-time, just do a system call as indicated in your code snippet already.

Related

Run kafka consumer without specifying partition

I am learning Kafka recently, and my consumers can't consume any records unless I specify the --parititon 0 parameter. In other words I can NOT consume records like:
kafka-console-consumer --bootstrap-server 127.0.0.10:9092 --topic first-topic
but works like:
kafka-console-consumer --bootstrap-server 127.0.0.10:9092 --topic first-topic --partition 0
THE MAIN PROBLEM IS, when I moved to java code, my KafkaConsumer class can't fetch records, and I need to know how to specify the partition number in java KafkaConsumer ?!
my current java code is:
public class ConsumerDemo {
public static void main(String[] args) {
Logger logger = LoggerFactory.getLogger((ConsumerDemo.class.getName()));
String bootstrapServer = "127.0.0.10:9092";
String groupId = "my-kafka-java-app";
String topic = "first-topic";
// create consumer configs
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer);
//properties.setProperty(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, partition);
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId);
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// create consumer
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
// subscribe consumer to our topic
consumer.subscribe(Collections.singleton(topic)); //means subscribtion to one topic
// poll for new data
while(true){
//consumer.poll(100); old way
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records){
logger.info("Key: " + record.key() + ", Value: "+ record.value() );
logger.info("Partition: " + record.partition() + ", Offset: "+ record.offset());
}
}
}
}
After a lot of inspection, my solution came out to be using consumer.assign and consumer.seek instead of using consumer.subscribe and without specifying the groupId. But I feel there should be a more optimal solution
the java code will be as:
// create consumer
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
// subscribe consumer to our topic
//consumer.subscribe(Collections.singleton(topic)); //means subscription to one topic
// using assign and Seek, are mostly used to replay data or fetch a specific msg
TopicPartition partitionToReadFrom = new TopicPartition(topic, 0);
long offsetToReadFrom = 15L;
// assign
consumer.assign(Arrays.asList(partitionToReadFrom));
// seek: for a specific offset to read from
consumer.seek(partitionToReadFrom, offsetToReadFrom);
The way you are doing is correct. You don't need to specify the partition when subscribing to a topic. Maybe your consumer group has already consumed all messages in the topic and has committed the latest offsets.
Make sure new messages are being produced when you run your application or create a new consumer group to consume from the beginning (if you keep the ConsumerConfig.AUTO_OFFSET_RESET_CONFIG set to "earliest")
As the name implies, the ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG property aims to configure a Partition Assignment Strategy and no to set a fixed partition as instructed by the command line.
The default strategy used is the RangeAssignor which can be changed, for example to a StickyAssignor as follows:
properties.setProperty(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,StickyAssignor.class.getName());
You can read more about Kafka Client Side Assignment Proposal.

Kafka KStream-KStream leftjoin windowed with custom TimestampExtractor cause Skipping record for expired segment

I'am trying to left join 2 KStream (stream1, stream2) with a custom TimestampExtractor,
event if my 2 events are timestamped close enougth to each other, I get:
[my-app-client-StreamThread-1] WARN org.apache.kafka.streams.state.internals.AbstractRocksDBSegmentedBytesStore - Skipping record for expired segment.
For testing, I've tried not to use a custom TimestampExtractor, it works if my producer sends events fast enough and respects my window duration config.
Any ideas ?
I've check the documentation and I don't see limitation regarding custom TimestampExtractor when joining 2 KStreams ?
Here is more details about the issue:
My TimestampExtractor extract the timestamp from event payload:
public class EventTimestampExtractor implements TimestampExtractor {
#Override
public long extract(final ConsumerRecord<Object, Object> record, final long previousTimestamp) {
final Event event = (Event) record.value();
final long timestamp = = event.myTimestamp;
return timestamp;
}
}
Here is my application:
final Properties props = new Properties();
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, EventTimestampExtractor.class);
...
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, Event> stream1 =
builder.stream("topic-left", Consumed.with(Serdes.String(),
EventSerde.serde()));
final KStream<String, Event> stream2 =
builder.stream("topic-right", Consumed.with(Serdes.String(),
EventSerde.serde()));
stream1.leftJoin(stream2,
(eventLeft, eventRight) -> {
... processing ...
Data data = merge(eventLeft.data, eventRight.data);
return data;
},
JoinWindows.of(Duration.ofMillis(1000)).grace(Duration.ofMillis(60000)),
StreamJoined.with(Serdes.String(), EventSerde.serde(), EventSerde.serde())
)
.peek((key, data) -> {
LOG.debug(key + data);
});
...
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
After that, I inject using kafka-console-producer console, eventLeft in "topic-left" and eventRight in "topic-right", where:
eventLeft.myTimestamp = T (in ms)
eventRight.myTimestamp = T+200 (in ms)
My issue is, I don't get my log in peek() , I get that instead:
[my-app-client-StreamThread-1] WARN org.apache.kafka.streams.state.internals.AbstractRocksDBSegmentedBytesStore - Skipping record for expired segment.
When I display timestamp values in the EventTimestampExtractor, everything seems OK.
Issue seems duplicated with this one
Instead of using kafka-console-producer console and hardcoded timestamps, I use an application that produce my data without hardcoded timestamp and it works.
I don't understand why it doesn't work with kafka-console-producer console, there must be some tricky behaviors around records internal and extracted timestamps.

How to delete data which already been consumed by consumer? Kafka

I am doing data replication in kafka. But, the size of kafka log file is increases very quickly. The size reaches 5 gb in a day. As a solution of this problem, ı want to delete processed data immediately. I am using delete record method in AdminClient to delete offset. But when I look at the log file, data corresponding to that offset is not deleted.
RecordsToDelete recordsToDelete = RedcordsToDelete.beforeOffset(offset);
TopicPartition topicPartition = new TopicPartition(topicName,partition);
Map<TopicPartition,RecordsToDelete> deleteConf = new HashMap<>();
deleteConf.put(topicPartition,recordsToDelete);
adminClient.deleteRecords(deleteConf);
I don't want suggestions like (log.retention.hours , log.retention.bytes , log.segment.bytes , log.cleanup.policy=delete)
Because I just want to delete data consumed by the consumer. In this solution, I also deleted the data that is not consumed.
What are your suggestions?
You didn't do anything wrong. The code you provided works and I've tested it. Just in case I've overlooked something in your code, mine is:
public void deleteMessages(String topicName, int partitionIndex, int beforeIndex) {
TopicPartition topicPartition = new TopicPartition(topicName, partitionIndex);
Map<TopicPartition, RecordsToDelete> deleteMap = new HashMap<>();
deleteMap.put(topicPartition, RecordsToDelete.beforeOffset(beforeIndex));
kafkaAdminClient.deleteRecords(deleteMap);
}
I've used group: 'org.apache.kafka', name: 'kafka-clients', version: '2.0.0'
So check if you are targeting right partition ( 0 for the first one)
Check your broker version: https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/admin/AdminClient.html says:
This operation is supported by brokers with version 0.11.0.0
Produce the messages from the same application, to be sure you're connected properly.
There is one more option you can consider. Using cleanup.policy=compact If your message keys are repeating you could benefit from it. Not just because older messages for that key will be automatically deleted but you can use the fact that message with null payload deletes all the messages for that key. Just don't forget to set delete.retention.ms and min.compaction.lag.ms to values small enough. In that case you can consume a message and than produce null payload for the same key ( but be cautious with this approach since this way you can delete messages ( with that key) you didn't consume)
Try this
DeleteRecordsResult result = adminClient.deleteRecords(recordsToDelete);
Map<TopicPartition, KafkaFuture<DeletedRecords>> lowWatermarks = result.lowWatermarks();
try {
for (Map.Entry<TopicPartition, KafkaFuture<DeletedRecords>> entry : lowWatermarks.entrySet()) {
System.out.println(entry.getKey().topic() + " " + entry.getKey().partition() + " " + entry.getValue().get().lowWatermark());
}
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
adminClient.close();
In this code, you need to call entry.getValue().get().lowWatermark(), because adminClient.deleteRecords(recordsToDelete) returns a map of Futures, you need to wait for the Future to run by calling get()
This code will only work if the cleanup policy is "delete" or "compact, delete" else the code will throw a Policy Violation exception.

AWS - DynamoDB - how to get object that have only one field from DataBase

I'm using DynamoDB, I have table called "cache" with only one field that is String - "apiToken", how can I get this String from DB when I have only that one field? Is this even possible?
private String getAuthToken() {
// TODO: Replace with cache fetched from DB instead of refreshApiToken method
Cache cache = new Cache();
cache.setApiToken(this.refreshApiToken());
return cache.getApiToken();
}
If you only store one item in DynamoDB, I suggest to get rid of DynamoDB at all and use AWS Systems Manager Parameter Store instead.
If you want to stick to DynamoDB you can make a ScanRequest to get the first item.
ScanRequest scanRequest = new ScanRequest()
.withTableName("cache")
.withLimit(1);
ScanResult result = client.scan(scanRequest);
// handle result.getItems() ...

Apache Kafka System Error Handling

We are trying to implement Kafka as our message broker solution. We are deploying our Spring Boot microservices in IBM BLuemix, whose internal message broker implementation is Kafka version 0.10. Since my experience is more on the JMS, ActiveMQ end, I was wondering what should be the ideal way to handle system level errors in the java consumers?
Here is how we have implemented it currently
Consumer properties
enable.auto.commit=false
auto.offset.reset=latest
We are using the default properties for
max.partition.fetch.bytes
session.timeout.ms
Kafka Consumer
We are spinning up 3 threads per topic all having the same groupId, i.e one KafkaConsumer instance per thread. We have only one partition as of now. The consumer code looks like this in the constructor of the thread class
kafkaConsumer = new KafkaConsumer<String, String>(properties);
final List<String> topicList = new ArrayList<String>();
topicList.add(properties.getTopic());
kafkaConsumer.subscribe(topicList, new ConsumerRebalanceListener() {
#Override
public void onPartitionsRevoked(final Collection<TopicPartition> partitions) {
}
#Override
public void onPartitionsAssigned(final Collection<TopicPartition> partitions) {
try {
logger.info("Partitions assigned, consumer seeking to end.");
for (final TopicPartition partition : partitions) {
final long position = kafkaConsumer.position(partition);
logger.info("current Position: " + position);
logger.info("Seeking to end...");
kafkaConsumer.seekToEnd(Arrays.asList(partition));
logger.info("Seek from the current position: " + kafkaConsumer.position(partition));
kafkaConsumer.seek(partition, position);
}
logger.info("Consumer can now begin consuming messages.");
} catch (final Exception e) {
logger.error("Consumer can now begin consuming messages.");
}
}
});
The actual reading happens in the run method of the thread
try {
// Poll on the Kafka consumer every second.
final ConsumerRecords<String, String> records = kafkaConsumer.poll(1000);
// Iterate through all the messages received and print their
// content.
for (final TopicPartition partition : records.partitions()) {
final List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
logger.info("consumer is alive and is processing "+ partitionRecords.size() +" records");
for (final ConsumerRecord<String, String> record : partitionRecords) {
logger.info("processing topic "+ record.topic()+" for key "+record.key()+" on offset "+ record.offset());
final Class<? extends Event> resourceClass = eventProcessors.getResourceClass();
final Object obj = converter.convertToObject(record.value(), resourceClass);
if (obj != null) {
logger.info("Event: " + obj + " acquired by " + Thread.currentThread().getName());
final CommsEvent event = resourceClass.cast(converter.convertToObject(record.value(), resourceClass));
final MessageResults results = eventProcessors.processEvent(event
);
if ("Success".equals(results.getStatus())) {
// commit the processed message which changes
// the offset
kafkaConsumer.commitSync();
logger.info("Message processed sucessfully");
} else {
kafkaConsumer.seek(new TopicPartition(record.topic(), record.partition()), record.offset());
logger.error("Error processing message : {} with error : {},resetting offset to {} ", obj,results.getError().getMessage(),record.offset());
break;
}
}
}
}
// TODO add return
} catch (final Exception e) {
logger.error("Consumer has failed with exception: " + e, e);
shutdown();
}
You will notice the EventProcessor which is a service class which processes each record, in most cases commits the record in database. If the processor throws an error (System Exception or ValidationException) we do not commit but programatically set the seek to that offset, so that subsequent poll will return from that offset for that group id.
The doubt now is that, is this the right approach? If we get an error and we set the offset then until that is fixed no other message is processed. This might work for system errors like not able to connect to DB, but if the problem is only with that event and not others to process this one record we wont be able to process any other record. We thought of the concept of ErrorTopic where when we get an error the consumer will publish that event to the ErrorTopic and in the meantime it will keep on processing other subsequent events. But it looks like we are trying to bring in the design concepts of JMS (due to my previous experience) into kafka and there may be better way to solve error handling in kafka. Also reprocessing it from error topic may change the sequence of messages which we don't want for some scenarios
Please let me know how anyone has handled this scenario in their projects following the Kafka standards.
-Tatha
if the problem is only with that event and not others to process this one record we wont be able to process any other record
that's correct and your suggestion to use an error topic seems a possible one.
I also noticed that with your handling of onPartitionsAssigned you essentially do not use the consumer committed offset, as you seem you'll always seek to the end.
If you want to restart from the last succesfully committed offset, you should not perform a seek
Finally, I'd like to point out, though it looks like you know that, having 3 consumers in the same group subscribed to a single partition - means that 2 out of 3 will be idle.
HTH
Edo

Categories

Resources