Currently, I'm working on Apache Beam Pipeline implementation which consumes data from three different Kafka topics, and after some processing, I create three types of objects adding those data taken from the above-mentioned Kafka topics. Finally, it is required to publish those three objects into three different Kafka topics.
It is possible to read from multiple topics using withTopics method in KafkaIO.read but I did not find a KafkaIO feature to write to multiple topics.
I would like to get some advice on how to do this in the most ideal way, appreciate it if anyone can provide some code examples.
You can do that with 3 different sinks on a PCollection, example :
private transient TestPipeline pipeline = TestPipeline.create();
#Test
public void kafkaIOSinksTest(){
PCollection<String> inputCollection = pipeline.apply(Create.of(Arrays.asList("Object 1", "Object 2")));
inputCollection.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("topic1")
.withValueSerializer(new StringSerializer())
.values());
inputCollection.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("topic2")
.withValueSerializer(new StringSerializer())
.values());
inputCollection.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("topic3")
.withValueSerializer(new StringSerializer())
.values());
}
In this example, the same PCollection is sinked in 3 different topics, via multi sinks.
You can use KafkaIO.<K, V>writeRecords() for that. It takes a PCollection<ProducerRecord<K, V>> as an input - so, you just need to specify a required output topic for in ProducerRecord for every element or use a default one.
Please, take a look on this test as an example.
Some Beam sinks (like BigQueryIO) have support for "dynamic destinations" but this isn't the case for KafkaIO. You'll need to set up 3 different sinks for the different topics and you'll need to split up your messages (perhaps using a Partition transform) to separate collections to feed into those sinks.
Related
I am attempting to implement a solution where I need to write data (json) messages from pubsub into GCS using dataflow. My question is exactly similar to this one
I need to write either based on windowing or element count.
Here is the code sample for writes from the the above question:
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));
The solution suggests using FileIO.WriteDynamic function. But i am not able to understand what .by(Event::getKey) does and where it comes from.
Any help on this is greatly appreciated.
It's partitioning elements into groups according to events' keys.
From my understanding, the events come from a PCollection using the KV class since it has the getKey method.
Note that :: is a new operator included in Java 8 that is used to refer a method of a class.
I have a requirement, where I have to create the topic name based on different values coming in for a field in the <Value object>. So that all the records<K,V> with similar field values goes in Topic_<Field>. How can I do it using kstream?
In Kafka 1.1.0, you can use branch() to split a stream into substreams and than write the different substreams into different topics by adding a different sink operator (ie, to()) to each substream.
Kafka 2.0 (will be released in June), adds a new "dynamic routing" feature that simplifies this scenario. Compare: https://cwiki.apache.org/confluence/display/KAFKA/KIP-303%3A+Add+Dynamic+Routing+in+Streams+Sink
Note, that Kafka Streams requires that sink topics are created manually -- Kafka Streams does not create any sink topic for you. As mentioned by #Hemant, you could turn on auto topic creation. However, it's not recommended in general (one reason is you might want different configs for different topic, but via auto creation all would be created with the same default config).
Also note, that a rogue application could DDoS your Kafka cluster if auto topic creation is enabled by sending "bad data" into the application and thus creating hundreds or thousands of topics (by specifying a different topic name for each message). Thus, it's risky and not recommended to enable auto topic creation but to create topics manually.
I'd like to join two different topics using Kafka Streams. The two topic has its data in different formats, so I'd like to use different timestamp extractors. I saw that there was a merged pullrequest for this feature (KAFKA-4144), but I only find it for the Processor API.
Does this feature exist for the Stream API?
StreamsBuilder#stream(...) has an overload taking a Consumed parameter that allows you to specify all optional properties like timestamp extractor.
https://docs.confluent.io/current/streams/javadocs/org/apache/kafka/streams/Consumed.html
In general, you can find API changes describes in the upgrade guide:
https://docs.confluent.io/current/streams/upgrade-guide.html#api-changes
https://kafka.apache.org/10/documentation/streams/upgrade-guide#streams_api_changes_100
I am using Kafka's Streams API with topology builder.
I would like to know how I can do to have a processor that can convert one data type to another, so the next processor in the pipeline can use it.
As a simple use case :
[topic]--(string)-->[processor: parse json]--(object)-->[processor 2]--(object)-->[sink]
Any idea ?
I assume you want to convert the message values in a Kafka topic from String to JSON.
You only need two parts:
The code to convert String to JSON (or Pojo). Pick whatever you need here, e.g. there are a couple of Java libraries available to make this easy.
In Kafka Streams, define (1) a value serde for String for reading from your source topic, and (2) define a corresponding value serde for writing the JSON data (or Pojo) to the destination topic. Serdes are required to materialize your data when/where needed (e.g., writing your Pojos to Kafka requires materialization).
See the example code under https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/pageview for how to e.g. use JSON with Apache Kafka's Streams API.
Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?
If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations