Flink Dynamic Update Stream job - java

I am receiving a set of events in Avro format on different topics. I want to consume these and write to s3 in parquet format.
I have written a below job that creates a different stream for each event and fetches it schema from the confluent schema registry to create a parquet sink for an event.
This is working fine but the only problem I am facing is whenever a new event start coming I have to change in the YAML config and restart the job every time. Is there any way I do not have to restart the job and it start consuming a new set of events.
YamlReader reader = new YamlReader(topologyConfig);
EventTopologyConfig eventTopologyConfig = reader.read(EventTopologyConfig.class);
long checkPointInterval = eventTopologyConfig.getCheckPointInterval();
topics = eventTopologyConfig.getTopics();
List<EventConfig> eventTypesList = eventTopologyConfig.getEventsType();
CachedSchemaRegistryClient registryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000);
FlinkKafkaConsumer flinkKafkaConsumer = new FlinkKafkaConsumer(topics,
new KafkaGenericAvroDeserializationSchema(schemaRegistryUrl),
properties);
DataStream<GenericRecord> dataStream = streamExecutionEnvironment.addSource(flinkKafkaConsumer).name("source");
try {
for (EventConfig eventConfig : eventTypesList) {
LOG.info("creating a stream for ", eventConfig.getEvent_name());
final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema(eventConfig.getSchema_subject(), registryClient)))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
DataStream<GenericRecord> outStream = dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord != null && genericRecord.get(EVENT_NAME).toString().equals(eventConfig.getEvent_name())) {
return true;
}
return false;
});
outStream.addSink(sink).name(eventConfig.getSink_id()).setParallelism(parallelism);
}
} catch (Exception e) {
e.printStackTrace();
}
Yaml file :
!com.bounce.config.EventTopologyConfig
eventsType:
- !com.bounce.config.EventConfig
event_name: "search_list_keyless"
schema_subject: "search_list_keyless-com.bounce.events.keyless.bookingflow.search_list_keyless"
topic: "search_list_keyless"
- !com.bounce.config.EventConfig
event_name: "bike_search_details"
schema_subject: "bike_search_details-com.bounce.events.keyless.bookingflow.bike_search_details"
topic: "bike_search_details"
- !com.bounce.config.EventConfig
event_name: "keyless_bike_lock"
schema_subject: "analytics-keyless-com.bounce.events.keyless.bookingflow.keyless_bike_lock"
topic: "analytics-keyless"
- !com.bounce.config.EventConfig
event_name: "keyless_bike_unlock"
schema_subject: "analytics-keyless-com.bounce.events.keyless.bookingflow.keyless_bike_unlock"
topic: "analytics-keyless"
checkPointInterval: 1200000
topics: ["search_list_keyless","bike_search_details","analytics-keyless"]
Thanks.

I think you want to use a custom BucketAssigner that uses the genericRecord.get(EVENT_NAME).toString() value as the bucket ID, along with whatever event time bucketing was being done by the EventTimeBucketAssigner.
Then you don't need to create multiple streams, and it should be dynamic (whenever a new event name value occurs in a record being written, you'll get a new output sink).

Related

How to verify if a Kafka Topic is not empty meaning has at least 1 message?

I am writing a little Kafka metrics exporter (Yes there are loads available like prometheus etc but I want a light weight custom one. Kindly excuse me on this).
As part of this I would like to know as soon as first message is received (or topic has messages) in a Kafka topic. I am using Spring Boot and Kafka.
I have the below code which gives the name of the topic and number of partitions. I want to know if the topic has messages? Kindly let me know how can I get this stat. Any lead is much appreciated!
#ReadOperation
public List<TopicManifest> kafkaTopic() throws ExecutionException, InterruptedException {
ListTopicsOptions listTopicsOptions = new ListTopicsOptions();
listTopicsOptions.listInternal(true);
ListTopicsResult listTopicsResult = adminClient.listTopics(listTopicsOptions);
Set<String> topics = listTopicsResult.names().get().stream().filter(topic -> !topic.startsWith("_")).collect(Collectors.toSet());
System.out.println(topics);
DescribeTopicsResult describeTopicsResult = adminClient.describeTopics(topics);
Map<String, KafkaFuture<TopicDescription>> topicNameValues = describeTopicsResult.topicNameValues();
List<TopicManifest> topicManifests = topicNameValues.entrySet().stream().map(entry -> {
try {
TopicDescription topicDescription = entry.getValue().get();
return TopicManifest.builder().name(entry.getKey())
.noOfPartitions(topicDescription.partitions().size())
.build();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
return null;
}).collect(Collectors.toList());
return topicManifests;
}
Create a KafkaConsumer and call endOffsets (the consumer does not need to be subscribed to the topic(s)).
#Bean
ApplicationRunner runner1(ConsumerFactory cf) {
return args -> {
try (Consumer consumer = cf.createConsumer()) {
System.out.println(consumer.endOffsets(List.of(new TopicPartition("ktest29", 0),
new TopicPartition("ktest29", 1),
new TopicPartition("ktest29", 2))));
}
};
}
Offsets stored on the topic never reduce. Getting the end offset doesn't guarantee you have a non-empty topic (start and end offsets for the topic partitions could be the same).
Instead, you will still create a consumer but set
auto.offset.reset=earliest
group.id=UUID.randomUUID()
Then subscribe and run
ConsumerRecords records = consumer.poll(Duration.ofSeconds(2));
boolean empty = records.count() == 0;
By setting auto.offset.earliest with a random group, you are guaranteed to start at the earliest offset and seek to the first available record, if it exists, at which point, you can try to poll any number of records, to see if any are returned within the specified timeout.
This should work for regular and compacted topics without needing to check committed offsets.

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here
There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

Kafka Stream to sort messages based on timestamp key in json message

I am publishing Kafka with JSON messages, eg:
"UserID":111,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":2,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:40.200Z,"Comments":0,"Like":6
"UserID":222,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":1,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:44.600Z,"Comments":3,"Like":12
I would like to sort messages based on UpdateTime in 10 second time window using Kafka Streams and push back sorted messages in another Kafka topic.
I have created a stream, which reads data from the input topic and then I am creating TimeWindowedKStream after groupByKey() where the UserID is the key in the message (Although its not necessary to groupByKey and then sort, but I could not get WindowedBy directly). But I am not able to sort messages in 10 second window based on UpdateTime further. My source code is:
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-sorting");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("UnsortedMessages");
TimeWindowedKStream<String, String> countss = source.groupByKey().windowedBy(TimeWindows.of(10000L)
.until(10000L));
/*
SORTING CODE
*/
outputMessage.toStream().to("SortedMessages", Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-sorting-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
Many thanks in advance.
If you want to sort messages ignoring the key, it makes only sense to do this based on partitions and also only if the input topic has the same number of partitions as the output topic. For this case, you should extract the partition number and use it as message key (cf: https://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information)
For sorting, it's more tricky. Note, that Kafka Streams follows a "continuous output" model and does emit updates for each input record using the DSL. Thus, it might be better to use Processor API. You would use a Processor with an attached store and put records into the store. As an in-memory structure you keep a sorted list of records. While time advances, you can emit "finished" windows and delete the corresponding records from the store.
I don't think you can build this using the DSL.

using kafka-streams to conditionally sort a json input stream

I am new to developing kafka-streams applications. My stream processor is meant to sort json messages based on a value of a user key in the input json message.
Message 1: {"UserID": "1", "Score":"123", "meta":"qwert"}
Message 2: {"UserID": "5", "Score":"780", "meta":"mnbvs"}
Message 3: {"UserID": "2", "Score":"0", "meta":"fghjk"}
I have read here Dynamically connecting a Kafka input stream to multiple output streams that there is no dynamic solution.
In my use-case I know the user keys and output topics that I need to sort the input stream. So I am writing separate processor applications specific to each user where each processor application matches a different UserID.
All the different stream processor applications read from the same json input topic in kafka but each one only writes the message to a output topic for a specific user if the preset user condition is met.
public class SwitchStream extends AbstractProcessor<String, String> {
#Override
public void process(String key, String value) {
HashMap<String, String> message = new HashMap<>();
ObjectMapper mapper = new ObjectMapper();
try {
message = mapper.readValue(value, HashMap.class);
} catch (IOException e){}
// User condition UserID = 1
if(message.get("UserID").equals("1")) {
context().forward(key, value);
context().commit();
}
}
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "sort-stream-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
TopologyBuilder builder = new TopologyBuilder();
builder.addSource("Source", "INPUT_TOPIC");
builder.addProcessor("Process", SwitchStream::new, "Source");
builder.addSink("Sink", "OUTPUT_TOPIC", "Process");
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
}
}
Question 1:
Is it possible to achieve the same functionality easily using the High-Level Streams DSL instead if the Low-Level Processor API? (I admit I found it harder understand and follow the other online examples of the High-Level Streams DSL)
Question 2:
The input json topic is getting input at a high rate 20K-25K EPS. My processor applications don't seem to be able to keep pace with this input stream. I have tried deploying multiple instances of each process but the results are nowhere close to where I want them to be. Ideally each processor instance should be able to process 3-5K EPS.
Is there a way to improve my processor logic or write the same processor logic using the high level streams DSL? would that make a difference?
You can do this in high-level DSL via filter() (you effectively implemented a filter as you only return a message if it's userID==1). You could generalize this filter pattern, by using KStream#branch() (see the docs for further details: http://docs.confluent.io/current/streams/developer-guide.html#stateless-transformations). Also read the JavaDocs: http://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/streams
KStreamBuilder builder = new KStreamBuilder();
builder.stream("INPUT_TOPIC")
.filter(new Predicate() {
#Overwrite
boolean test(String key, String value) {
// put you processor logic here
return message.get("UserID").equals("1")
}
})
.to("OUTPUT_TOPIC");
About performance. A single instance should be able to process 10K+ records. It's hard to tell without any further information what the problem might be. I would recommend to ask at Kafka user list (see http://kafka.apache.org/contact)

Deserialize Avro messages into specific datum using KafkaAvroDecoder

I'm reading from a Kafka topic, which contains Avro messages serialized using the KafkaAvroEncoder (which automatically registers the schemas with the topics). I'm using the maven-avro-plugin to generate plain Java classes, which I'd like to use upon reading.
The KafkaAvroDecoder only supports deserializing into GenericData.Record types, which (in my opinion) misses the whole point of having a statically typed language. My deserialization code currently looks like this:
SpecificDatumReader<event> reader = new SpecificDatumReader<>(
event.getClassSchema() // event is my class generated from the schema
);
byte[] in = ...; // my input bytes;
ByteBuffer stuff = ByteBuffer.wrap(in);
// the KafkaAvroEncoder puts a magic byte and the ID of the schema (as stored
// in the schema-registry) before the serialized message
if (stuff.get() != 0x0) {
return;
}
int id = stuff.getInt();
// lets just ignore those special bytes
int length = stuff.limit() - 4 - 1;
int start = stuff.position() + stuff.arrayOffset();
Decoder decoder = DecoderFactory.get().binaryDecoder(
stuff.array(), start, length, null
);
try {
event ev = reader.read(null, decoder);
} catch (IOException e) {
e.printStackTrace();
}
I found my solution cumbersome, so I'd like to know if there is a simpler solution to do this.
Thanks to the comment I was able to find the answer. The secret was to instantiate KafkaAvroDecoder with a Properties specifying the use of the specific Avro reader, that is:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "...");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_C‌ONFIG, "...");
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
VerifiableProp vProps = new VerifiableProperties(props);
KafkaAvroDecoder decoder = new KafkaAvroDecoder(vProps);
MyLittleData data = (MyLittleData) decoder.fromBytes(input);
The same configuration applies for the case of using directly the KafkaConsumer<K, V> class (I'm consuming from Kafka in Storm using the KafkaSpout from the storm-kafka project, which uses the SimpleConsumer, so I have to manually deserialize the messages. For the courageous there is the storm-kafka-client project, which does this automatically by using the new style consumer).

Categories

Resources