I have two Kafka Spouts whose values I want to send to the same bolt.
Is it possible ?
Yes it is possible:
TopologyBuilder b = new TopologyBuilder();
b.setSpout("topic_1", new KafkaSpout(...));
b.setSpout("topic_2", new KafkaSpout(...));
b.setBolt("bolt", new MyBolt(...)).shuffleGrouping("topic_1").shuffleGrouping("topic_2");
You can use any other grouping, too.
Update:
In order to distinguish tuples (ie, topic_1 or topic_2) in consumer bolt, there are two possibilities:
1) You can use operator IDs (as suggested by #user-4870385):
if(input.getSourceComponent().equalsIgnoreCase("topic_1")) {
//do something
} else {
//do something
}
2) You can use stream names (as suggested by #zenbeni). For this case, both spouts need to declare named streams and the bolt need to connect to spouts by stream names:
public class MyKafkaSpout extends KafkaSpout {
final String streamName;
public MyKafkaSpout(String stream) {
this.streamName = stream;
}
// other stuff omitted
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// compare KafkaSpout.declareOutputFields(...)
declarer.declare(streamName, _spoutConfig.scheme.getOutputFields());
}
}
Build the topology, stream names need to be used now:
TopologyBuilder b = new TopologyBuilder();
b.setSpout("topic_1", new MyKafkaSpout("stream_t1"));
b.setSpout("topic_2", new MyKafkaSpout("stream_t2"));
b.setBolt("bolt", new MyBolt(...)).shuffleGrouping("topic_1", "stream_t1").shuffleGrouping("topic_2", "stream_t2");
In MyBolt the stream name can now be used to distinguish input tuples:
// in my MyBolt.execute():
if(input.getSourceStreamId().equals("Topic1")) {
// do something
} else {
// do something
}
Discussion:
While the second approach using stream names is more natural (according to #zenbeni), the first is more flexible (IHMO). Stream names are declared by spout/bolt directly (ie, at the time the spout/bolt code is written); in contrast, operator IDs are assigned when topology is put together (ie, at the time the spout/bolt is used).
Let's assume we get three bolts as class files (no source code). The first two should be used as producers and both declare output streams with the same name. If the third consumer distinguishes input tuples by stream, this will not work. Even if both given producer bolts declare different output stream names, the expected input stream names might be hard coded in consumer bolt and might not match. Thus, it does not work either. However, if the consumer bolt uses component names (even if they are hard coded) to distinguish incoming tuples, the expected component IDs can be assigned correctly.
Of course, it would be possible to inherit from the given classes (if not declared final and overwrite declareOutputFields(...) in order to assign own stream names. However, this is more additional work to do.
Yes its possible. You can have any spout talking to same bolt.
Refer https://storm.apache.org/documentation/Tutorial.html "Streams" section.
Related
My scenario is, I want to call one stream based on another stream input. Both Stream type is different. The following is my sample code. I want to trigger one stream when some message is received from Kafka stream.
While Application start up, i can read data from DB. Then again i want to get data from DB based on some kafka message. When i receive kafka message in stream , i want to get data from DB again.This is my actual use case.
How to achieve this? Is it possible ?
public class DataStreamCassandraExample implements Serializable{
private static final long serialVersionUID = 1L;
static Logger LOG = LoggerFactory.getLogger(DataStreamCassandraExample.class);
private transient static StreamExecutionEnvironment env;
static DataStream<Tuple4<UUID,String,String,String>> inputRecords;
public static void main(String[] args) throws Exception {
env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool argParameters = ParameterTool.fromArgs(args);
env.getConfig().setGlobalJobParameters(argParameters);
Properties kafkaProps = new Properties();
kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "group1");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("testtopic", new SimpleStringSchema(), kafkaProps);
ClusterBuilder cb = new ClusterBuilder() {
private static final long serialVersionUID = 1L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("127.0.0.1")
.withPort(9042)
.withoutJMXReporting()
.build();
}
};
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat =
new CassandraInputFormat<> ("select * from employee_details", cb);
//While Application is start up , Read data from table and send as stream
inputRecords = getDBData(env,cassandraInputFormat);
// If any data comes from kafka means, again i want to get data from table.
//How to i trigger getDBData() method from inside this stream.
//The below code is not working
DataStream<String> inputRecords1= env.addSource(kafkaConsumer)
.map(new MapFunction<String,String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
inputRecords = getDBData(env,cassandraInputFormat);
return "OK";
}
});
//This is not printed , when i call getDBData() stream from inside the kafka stream.
inputRecords1.print();
DataStream<Employee> empDataStream = inputRecords.map(new MapFunction<Tuple4<UUID,String,String,String>, Tuple2<String,Employee>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, Employee> map(Tuple4<UUID,String,String,String> value) throws Exception {
Employee emp = new Employee();
try{
emp.setEmpid(value.f0);
emp.setFirstname(value.f1);
emp.setLastname(value.f2);
emp.setAddress(value.f3);
}
catch(Exception e){
}
return new Tuple2<>(emp.getEmpid().toString(), emp);
}
}).keyBy(0).map(new MapFunction<Tuple2<String,Employee>,Employee>() {
private static final long serialVersionUID = 1L;
#Override
public Employee map(Tuple2<String, Employee> value)
throws Exception {
return value.f1;
}
});
empDataStream.print();
env.execute();
}
private static DataStream<Tuple4<UUID,String,String,String>> getDBData(StreamExecutionEnvironment env,
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat){
DataStream<Tuple4<UUID,String,String,String>> inputRecords = env
.createInput
(cassandraInputFormat
,TupleTypeInfo.of(new TypeHint<Tuple4<UUID,String,String,String>>() {}));
return inputRecords;
}
}
this is going to be a very verbose answer.
To correctly use Flink as a developper, you need to have an understading of its basic concepts. I suggest you start by the architecture overview (https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html), it contains all you need to know in order to get into the world of Flink when you come from programming.
Now, looking at your code, it should not do what you expect because of how Flink will read it. You need to understand that Flink has at least two big steps when it executes your code: first it builds an execution graph which only describes what it needs to do. This happens at the job manager level. The second big step is to ask one or many workers to execute the graph. These two steps are sequential and anything you do regarding the graph description has to be done at the job manager level not inside your operations.
In your case, the graph has:
A Kafak source.
A map that will call getDBData() at a worker level (not good because getDBData() alters the graph by adding a new Input each time it is called).
The line inputRecords = getDBData(env,cassandraInputFormat); will create an orphan branch of the graph. And the line DataStream<Employee> empDataStream = inputRecords.map... will append a branch of a map->keyBy->map to that orphan branch. This will build a part of the graph that will read all the employee records from Cassandra and apply the map->keyBy->map transformations. This will not be linked with the Kafka source in any way.
Now let's get back to your need. I understand you need to fetch data for an employee when his/her id comes from Kafka and do some operations.
The most clean way to handle this is called Side Inputs. This is a data input that you declare when you build your graph and the job manager handles the reading of data and its transmission to the workers. The bad news is that Side Inputs are not yet working for streaming jobs in Flink (https://issues.apache.org/jira/browse/FLINK-2491 - this bug causes streamning jobs to not create checskpoints because side inputs finish quickly and this puts the job in a bizzare state).
With this being said you still have three more options. The right option depends on the size of your employee cassandra table.
The second option is to load all employees to a static final variable employees and use it inside your map functions. The backside of this approach is that the job manager will send a serialized copy of this variable to all workers and may congest your network and may also overload the RAM. If the size of the table is small and should not grow big in the future, then this may be an acceptable work-arround until the Side Inputs are working for streaming jobs. If the size of the table is big or should evolve in the future then consider the third option.
The third option is an improvement of the second one. It uses Flink's broadcast variables (see https://flink.apache.org/2019/06/26/broadcast-state.html and https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html). Short story: it is the same as before with better transfer management. Flink will find the best way to store and send the variable to the workers. This approach is though a litle bit more complicated to implement correctly.
The last option is not advisable in normal circumstances. It simply consists in making a call to Cassandra inside your map operation. This is not a good practice because it adds repeated latency to all your map executions (there will be as many calls as items passing through Kafka). A call means a connection creation, the actual request with the query and waiting for Cassandra to reply and freeing the connection. This can be a lot of work for a step in your graph. It is a solution to consider when you really can not find any alternatives.
For your case, I would advise the third option. I guess the employee table should not be very big and using Broadcast variables is a good choice.
I have multi-schema Kafka Streams application that enriches a record via a join to a KTable, and then passes the enriched record along.
The input topic naming format is currently well defined but I'm changing this to a wildcard. I want to determine the input topic of each record, derive the output topic via regex replacement, and send it on.
E.g. While listening to event.raw.* a record comes in on event.raw.foo and I wish to pass it out on event.foo.
I realise I can get the input topics via the Processor API:
public class EnrichmentProcessor extends AbstractProcessor<String, GenericRecord> {
#Override
public void process(String key, GenericRecord value) {
//Do Join...
//Determine output topic and forward
String outputTopic = context().topic().replaceFirst(".raw.", ".");
context().forward(key, value, To.child(outputTopic));
context().commit();
}
}
But this doesn't help me when I'm trying to define my Topology because I have no way of knowing up front what my output topic is going to be.
InternalTopologyBuilder topologyBuilder = new InternalTopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, genericRecordDeserializer, "event.raw.*")
.addProcessor("ENRICHER", EnrichmentProcessor::new, "SOURCE")
.addSink("OUTPUT", outputTopic, stringSerializer, genericRecordSerializer, "ENRICHER"); // How can I register all possible output topics here?
Has anyone solved a situation like this before?
I know that if I had a list of possible output-topic names up front I could have multiple sinks defined on the topology but I'm not going to.
Is there a way I can define the topology to have dynamically allocated output topic names when I dont't have a hard coded list of possible output topic names up front?
This should be possible: You can use Topology#addSink(..., new TopicNameExtractor(){...}, ...) to dynamically set an output topic name. TopicNameExtractor has access to the RecordContext that allows you to get the input topic name via context.topic(). Hence, you should be able to compute the output topic name, base on the input topic name.
I am trying to generate a class and methods in it, using Byte Buddy, based on some configuration that is available at runtime. The class is trying to create a Hazelcast Jet pipeline to join multiple IMaps.
Based on the provided configuration, the no. of IMaps to join can vary. In the sample below, I am trying to join three IMaps.
private Pipeline getPipeline(IMap<String, Object1> object1Map, IMap<String, Object2> object2Map,
IMap<String, Object3> object3Map) {
Pipeline p = Pipeline.create();
BatchStage<Entry<String, Object1>> obj1 = p.drawFrom(Sources.map(object1Map));
BatchStage<Entry<String, Object2>> obj2 = p.drawFrom(Sources.map(object2Map));
BatchStage<Entry<String, Object3>> obj3 = p.drawFrom(Sources.map(object3Map));
DistributedFunction<Tuple2<Object1, Object2>, String> obj1Obj2JoinFunc = entry -> entry.f1().getField31();
DistributedBiFunction<Tuple2<Object1, Object2>, Object3, Tuple2<Tuple2<Object1, Object2>, Object3>> output = (
in1, in2) -> (Tuple2.tuple2(in1, in2));
BatchStage<Tuple2<Object1, Object2>> obj1_obj2 = obj1.map(entry -> entry.getValue())
.hashJoin(obj2.map(entry -> entry.getValue()),
JoinClause.onKeys(Object1::getField11, Object2::getField21), Tuple2::tuple2).filter(entry -> entry.getValue() != null);
BatchStage<Tuple2<Tuple2<Object1, Object2>, Object3>> obj1_obj2_obj3 = obj1_obj2.hashJoin(
obj3.map(entry -> entry.getValue()),
JoinClause.onKeys(obj1Obj2JoinFunc, Object3::getField31), output)
.filter(entry -> entry.getValue() != null);
// the transformResult method will get the required fields from above operation and create object of AllObjectJoinClass
BatchStage<Entry<String, AllObjectJoinClass>> result = transformResult(obj1_obj2_obj3);
result.drainTo(Sinks.map("obj1_obj2_obj3"));
return p;
}
The problem here is that the no. of arguments to my method depend on the runtime configuration and that determines the method body as well.
I am able to generate the method signature using TypeDescription.Generic.Builder.parameterizedType.
But, I am having trouble generating the method body. I tried using MethodDelegation.to so that the method resides in a separate class. The trouble with this approach is that the method in the separate class needs to be very generic so that it can take arbitrary no. of arguments of different types and also needs to know about the fields of each of the objects in the IMap.
I wonder if there's an an alternate approach to achieving this with maybe templates of some type so that a separate class can be generated for each pipeline with this body. I didn't find any documentation for generating a method with a defined body (maybe I missed something).
-- Anoop
It very much depends on what you are trying to do:
With Advice, you can write a template as byte code that is inlined into your method.
With StackManipulations you can compose individual byte code instructions.
It seems to me that option (2) is what you are aiming for. For individually composed code, this is often the easiest option.
Writing individual byte code is of course not the most convenient option but if you can easily compose handling of each input, you might be able to compose multiple Advice classes to avoid using byte code instructions directly.
I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});
Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
My code is the following:
StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<MyObject> input = env.addSource(new MyCustomSource());
Pattern<MyObject, ?> pattern = Pattern.<MyObject>begin("start");
PatternStream<MyObject> patternStream = CEP.pattern(input, pattern);
... define my pattern
DataStream<MyObject> resultStream = patternStream.select(new MyCustomPatternSelectFunction());
resultStream.addSink(new MyCustomSinkFunction(subscriptionCriteria));
try
{
env.execute();
}
catch (Exception exception)
{
log.debug("Error while ", exception);
}
this code works and does what i want and i get a result stream that follows the pattern i set.
what i want to know is if it is possible to apply new patterns to this source i added to the environment later when i wish and thus getting different result stream matching different patterns without calling env.execute() another time because when i do that in addition to my new result stream i get redundant old result streams (i.e. old patterns get executed multiple times)?
At the moment Flink's CEP library does not support dynamic pattern changes out of the box. Thus, once you've defined your pattern and started your job, it will only process this defined pattern.
However, you can write your own operator implementing the TwoInputStreamOperator interface which receives on one input pattern definitions and on the other input the stream records (similar to a CoFlatMap function). For every new pattern you would then have to compile a new NFA on the operator and feed any new incoming stream elements to this NFA as well. That way, you could achieve your intended behaviour.
In the future, we will most likely add this feature to Flink's CEP library.