I have a small topology. It has a kafka spout, a bolt reading from spout (Bolt A).
Bolt A emits to two bolts (Bolt B and Bolt C). I have used fields grouping.
The Bolt A emits two different types of data. One is intended for Bolt B and Other for Bolt C.
My question is, can i configure storm in such a way that, data intended for Bolt B always goes to instances of Bolt B and same for Bolt c?
Currently i am checking the data received in the bolts and skipping unwanted data.
thanks
With standard Storm, the easiest way to do this would be to use "streams." You define a stream in declareOutputFields with the declareStream method on the output field declarer and emit using one of the overloaded versions of emit that lets you specify a stream ID. You also need to use the version of shuffleGrouping that makes the bolt subscribe to a stream.
Related
Currently, I'm working on Apache Beam Pipeline implementation which consumes data from three different Kafka topics, and after some processing, I create three types of objects adding those data taken from the above-mentioned Kafka topics. Finally, it is required to publish those three objects into three different Kafka topics.
It is possible to read from multiple topics using withTopics method in KafkaIO.read but I did not find a KafkaIO feature to write to multiple topics.
I would like to get some advice on how to do this in the most ideal way, appreciate it if anyone can provide some code examples.
You can do that with 3 different sinks on a PCollection, example :
private transient TestPipeline pipeline = TestPipeline.create();
#Test
public void kafkaIOSinksTest(){
PCollection<String> inputCollection = pipeline.apply(Create.of(Arrays.asList("Object 1", "Object 2")));
inputCollection.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("topic1")
.withValueSerializer(new StringSerializer())
.values());
inputCollection.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("topic2")
.withValueSerializer(new StringSerializer())
.values());
inputCollection.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("topic3")
.withValueSerializer(new StringSerializer())
.values());
}
In this example, the same PCollection is sinked in 3 different topics, via multi sinks.
You can use KafkaIO.<K, V>writeRecords() for that. It takes a PCollection<ProducerRecord<K, V>> as an input - so, you just need to specify a required output topic for in ProducerRecord for every element or use a default one.
Please, take a look on this test as an example.
Some Beam sinks (like BigQueryIO) have support for "dynamic destinations" but this isn't the case for KafkaIO. You'll need to set up 3 different sinks for the different topics and you'll need to split up your messages (perhaps using a Partition transform) to separate collections to feed into those sinks.
So i am running kafka with two topics and they will both be serving files -
Topic 1 - I am reading the string value from it and processing it.
Topic 2 - I need the binary values for processing.
So i am guessing i would need two consumers?
The reasoning behind this is that in the properties for the data from first topic i need a StringSerializer/Deserializer and for the second one i need a BinaryArraySerializer Desererializer.
So is it feasible to have two consumer implementations?
I know what I can get the string value and convert it to binary or vice versa but somehow having two consumers seems to be the clean way of doing it.
Thanks in Advance.
I have a requirement, where I have to create the topic name based on different values coming in for a field in the <Value object>. So that all the records<K,V> with similar field values goes in Topic_<Field>. How can I do it using kstream?
In Kafka 1.1.0, you can use branch() to split a stream into substreams and than write the different substreams into different topics by adding a different sink operator (ie, to()) to each substream.
Kafka 2.0 (will be released in June), adds a new "dynamic routing" feature that simplifies this scenario. Compare: https://cwiki.apache.org/confluence/display/KAFKA/KIP-303%3A+Add+Dynamic+Routing+in+Streams+Sink
Note, that Kafka Streams requires that sink topics are created manually -- Kafka Streams does not create any sink topic for you. As mentioned by #Hemant, you could turn on auto topic creation. However, it's not recommended in general (one reason is you might want different configs for different topic, but via auto creation all would be created with the same default config).
Also note, that a rogue application could DDoS your Kafka cluster if auto topic creation is enabled by sending "bad data" into the application and thus creating hundreds or thousands of topics (by specifying a different topic name for each message). Thus, it's risky and not recommended to enable auto topic creation but to create topics manually.
Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?
If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations
My project streams object data through storm to a graphics application. The appearance of these objects depends upon variables assigned by a bolt in the storm topology.
My question is whether it is possible to update the bolt process by sending a message to it that changes the variables it attaches to object data. For example, after sending a message to the bolt declaring that I want any object with parameter x above a certain number to appear as red rather than blue.
The bolt process would then append a red rgb variable to the object data rather than blue.
I was thinking this would be possible by having a displayConfig class that the bolt uses to apply appearance and who's contents can be edited by messages with a certain header.
Is this possible?
It is possible, but you need to do it manually and prepare you topology before you start it.
There are two ways to do this:
use a local config file for bolt that you put into the worker machine (maybe via NFS). The bolts regularly check the file for updates an read an updated configuration if you do change the file.
You use one more spout that produces a configuration stream. All bolts you want to send a configuration during runtime, need to consumer from this configuration-spout via "allGrouping". When processing input tuple, you check if its a regular data tuple or and configuration tuple (and update you config accordingly).