I have a stream execution configured as
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Record> stream = env.addSource(new FlinkKafkaConsumer(
SystemsCpu.TOPIC,
ConfluentRegistryAvroDeserializationSchema.forGeneric(SystemsCpu.SCHEMA, registry),
config)
.setStartFromLatest());
DataStream<Anomaly> anomalies = stream
.keyBy(x -> x.get("host").toString())
.window(SlidingEventTimeWindows.of(Time.minutes(20), Time.seconds(20))) // produces output with TumblingEventTimeWindows
.process(new AnomalyDetector())
.name("anomaly-detector");
public class AnomalyDetector extends ProcessWindowFunction<Record, Anomaly, String, TimeWindow> {
#Override
public void process(String key, Context context, Iterable<Record> input, Collector<Anomaly> out) {
var anomaly = new Anomaly();
anomaly.setValue(1.0);
out.collect(anomaly);
}
}
However for some reason SlidingEventTimeWindows does not produce any output to be processed by the AnomalyDetector (i.e. process is not triggered at all). If I use, for example, TumblingEventTimeWindows it works as expected.
Any ideas what might be causing this? Am I using SlidingEventTimeWindows incorrectly?
When doing any sort of event time windowing it is necessary to provide a WatermarkStrategy. Watermarks mark a spot in the stream, and signal that the stream is complete up through some specific point in time. Event time windows can only be triggered by the arrival of a sufficiently large watermark.
See the docs for details, but this could be something like this:
DataStream<MyType> timestampedEvents = events
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<MyType>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((event, timestamp) -> event.timestamp));
However, since you are using Kafka, it's usually better to have the Flink Kafka consumer do the watermarking:
FlinkKafkaConsumer<MyType> kafkaSource = new FlinkKafkaConsumer<>("myTopic", schema, props);
kafkaSource.assignTimestampsAndWatermarks(WatermarkStrategy...);
DataStream<MyType> stream = env.addSource(kafkaSource);
Note that if you use this later approach, and if your events are in temporal order within each Kafka partition, you can take advantage of the per-parition watermarking that the Flink Kafka source provides, and use WatermarkStrategy.forMonotonousTimestamps() rather than the bounded-of-orderness strategy. This has a number of advantages.
By the way, and this is unrelated to your question, but you should be aware that by specifying SlidingEventTimeWindows.of(Time.minutes(20), Time.seconds(20)), every event will be copied into each of 60 overlapping windows.
You are using SlidingEventTimeWindows but your stream execution environment is configured for processing time by default. Either use SlidingProcessingTimeWindows or configure your environment for event time like so
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
Event time will also require a special time stamp assigner, you can find more info here.
https://www.ververica.com/blog/stream-processing-introduction-event-time-apache-flink?hs_amp=true
Related
My scenario is, I want to call one stream based on another stream input. Both Stream type is different. The following is my sample code. I want to trigger one stream when some message is received from Kafka stream.
While Application start up, i can read data from DB. Then again i want to get data from DB based on some kafka message. When i receive kafka message in stream , i want to get data from DB again.This is my actual use case.
How to achieve this? Is it possible ?
public class DataStreamCassandraExample implements Serializable{
private static final long serialVersionUID = 1L;
static Logger LOG = LoggerFactory.getLogger(DataStreamCassandraExample.class);
private transient static StreamExecutionEnvironment env;
static DataStream<Tuple4<UUID,String,String,String>> inputRecords;
public static void main(String[] args) throws Exception {
env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool argParameters = ParameterTool.fromArgs(args);
env.getConfig().setGlobalJobParameters(argParameters);
Properties kafkaProps = new Properties();
kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "group1");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("testtopic", new SimpleStringSchema(), kafkaProps);
ClusterBuilder cb = new ClusterBuilder() {
private static final long serialVersionUID = 1L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("127.0.0.1")
.withPort(9042)
.withoutJMXReporting()
.build();
}
};
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat =
new CassandraInputFormat<> ("select * from employee_details", cb);
//While Application is start up , Read data from table and send as stream
inputRecords = getDBData(env,cassandraInputFormat);
// If any data comes from kafka means, again i want to get data from table.
//How to i trigger getDBData() method from inside this stream.
//The below code is not working
DataStream<String> inputRecords1= env.addSource(kafkaConsumer)
.map(new MapFunction<String,String>() {
private static final long serialVersionUID = 1L;
#Override
public String map(String value) throws Exception {
inputRecords = getDBData(env,cassandraInputFormat);
return "OK";
}
});
//This is not printed , when i call getDBData() stream from inside the kafka stream.
inputRecords1.print();
DataStream<Employee> empDataStream = inputRecords.map(new MapFunction<Tuple4<UUID,String,String,String>, Tuple2<String,Employee>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, Employee> map(Tuple4<UUID,String,String,String> value) throws Exception {
Employee emp = new Employee();
try{
emp.setEmpid(value.f0);
emp.setFirstname(value.f1);
emp.setLastname(value.f2);
emp.setAddress(value.f3);
}
catch(Exception e){
}
return new Tuple2<>(emp.getEmpid().toString(), emp);
}
}).keyBy(0).map(new MapFunction<Tuple2<String,Employee>,Employee>() {
private static final long serialVersionUID = 1L;
#Override
public Employee map(Tuple2<String, Employee> value)
throws Exception {
return value.f1;
}
});
empDataStream.print();
env.execute();
}
private static DataStream<Tuple4<UUID,String,String,String>> getDBData(StreamExecutionEnvironment env,
CassandraInputFormat<Tuple4<UUID,String,String,String>> cassandraInputFormat){
DataStream<Tuple4<UUID,String,String,String>> inputRecords = env
.createInput
(cassandraInputFormat
,TupleTypeInfo.of(new TypeHint<Tuple4<UUID,String,String,String>>() {}));
return inputRecords;
}
}
this is going to be a very verbose answer.
To correctly use Flink as a developper, you need to have an understading of its basic concepts. I suggest you start by the architecture overview (https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html), it contains all you need to know in order to get into the world of Flink when you come from programming.
Now, looking at your code, it should not do what you expect because of how Flink will read it. You need to understand that Flink has at least two big steps when it executes your code: first it builds an execution graph which only describes what it needs to do. This happens at the job manager level. The second big step is to ask one or many workers to execute the graph. These two steps are sequential and anything you do regarding the graph description has to be done at the job manager level not inside your operations.
In your case, the graph has:
A Kafak source.
A map that will call getDBData() at a worker level (not good because getDBData() alters the graph by adding a new Input each time it is called).
The line inputRecords = getDBData(env,cassandraInputFormat); will create an orphan branch of the graph. And the line DataStream<Employee> empDataStream = inputRecords.map... will append a branch of a map->keyBy->map to that orphan branch. This will build a part of the graph that will read all the employee records from Cassandra and apply the map->keyBy->map transformations. This will not be linked with the Kafka source in any way.
Now let's get back to your need. I understand you need to fetch data for an employee when his/her id comes from Kafka and do some operations.
The most clean way to handle this is called Side Inputs. This is a data input that you declare when you build your graph and the job manager handles the reading of data and its transmission to the workers. The bad news is that Side Inputs are not yet working for streaming jobs in Flink (https://issues.apache.org/jira/browse/FLINK-2491 - this bug causes streamning jobs to not create checskpoints because side inputs finish quickly and this puts the job in a bizzare state).
With this being said you still have three more options. The right option depends on the size of your employee cassandra table.
The second option is to load all employees to a static final variable employees and use it inside your map functions. The backside of this approach is that the job manager will send a serialized copy of this variable to all workers and may congest your network and may also overload the RAM. If the size of the table is small and should not grow big in the future, then this may be an acceptable work-arround until the Side Inputs are working for streaming jobs. If the size of the table is big or should evolve in the future then consider the third option.
The third option is an improvement of the second one. It uses Flink's broadcast variables (see https://flink.apache.org/2019/06/26/broadcast-state.html and https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html). Short story: it is the same as before with better transfer management. Flink will find the best way to store and send the variable to the workers. This approach is though a litle bit more complicated to implement correctly.
The last option is not advisable in normal circumstances. It simply consists in making a call to Cassandra inside your map operation. This is not a good practice because it adds repeated latency to all your map executions (there will be as many calls as items passing through Kafka). A call means a connection creation, the actual request with the query and waiting for Cassandra to reply and freeing the connection. This can be a lot of work for a step in your graph. It is a solution to consider when you really can not find any alternatives.
For your case, I would advise the third option. I guess the employee table should not be very big and using Broadcast variables is a good choice.
Let's say I want to make some transformation 'A' configurable. This transformation manages some state using state-store and also requires repartitioning, which means repartitioning will be done only if configured. Now if I run the application 3 times (may be rolling upgrade as well) in following way (or any other combination) :-
Transformation 'A' is disabled
Transformation 'A' is enabled
Transformation 'A' is disabled
Given that all 3 runs uses the same cluster of Kafka brokers:-
If EOS is enabled, will EOS guarantee exist across all the 3 runs ?
If EOS is not enabled, Is there a case which may cause message loss( Failed to provide even at least once)?
The topology code to get better understanding of what I am trying to do:-
KStream<String, Cab> kStream = getStreamsBuilder()
.stream("topic_a", Consumed.with(keySerde, valueSerde))
.transformValues(() -> transformer1)
.transformValues(() -> transformer2, "stateStore_a")
.flatMapValues(events -> events);
mayBeEnrichAgain(kStream, keySerde, valueSerde)
.selectKey((ignored, event) -> event.getAnotherId())
.through(INTERMEDIATE_TOPIC_2, Produced.with(keySerde, valueSerde)) //this repartitioning will always be there
.transformValues(() -> transformer3, "stateStore_b")
.to(txStreamsConfig.getAlertTopic(), Produced.with(keySerde, valueSerde));
private <E extends Cab> KStream<String, E> mayBeEnrichAgain(final KStream<String, E> kStream,
final Serde<String> keySerde,
final Serde<E> valueSerde) {
if(enrichmentEnabled){ //repartitioning is configurable
return kStream.selectKey((ignored, event) -> event.id())
.through(INTERMEDIATE_TOPIC_1, Produced.with(keySerde, valueSerde))
.transformValues(enricher1)
.transformValues(enricher2);
}
else{
return kStream;
}
}
You cannot simply change the topology without potentially breaking it.
Hard to say in general if inserting the through-topic will break the application in the first place.
If it does not break, you might "loose" data when we remove the topic, as some unprocessed data might still be in this topic and after removing the topic, the topology would not read those data.
In general, you should reset an application cleanly or use a new application.id if you upgrade your app to a newer version that changes the structure of the topology.
I am pretty new to Kafka and Kafka Streams so please bear with me. I would like to know if I am on the right track here.
I am writing to a Kafka topic at the moment and try to access the data through a rest service. The raw data kind of needs to be transformed before it will be accessed.
What I have so far is a producer that writes the raw data into a topic.
1.) Now I want streams App (should be a jar running in a container) that just transforms the data in my desired shape. Following the materialized view paradigm here.
Over simplified version of 1.)
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> source =
builder.stream("my-raw-data-topic");
KafkaStreams streams = new KafkaStreams(builder,props);
KTable<String, Long> t = source.groupByKey().count("My-Table");
streams.start();
2.) And another streams App (should be a jar running in a container) that justs holds the KTable as some sort of Repository which can be accessed via a wrapping rest service.
Here I am kind of stuck with the proper way to work with the api.
What is the bare minimun to access and query a KTable? Do I need to assign the transformation topology to the builder again?
KStreamBuilder builder = new KStreamBuilder();
KTable table = builder.table("My-Table"); //Casting?
KafkaStreams streams = new KafkaStreams(builder, props);
RestService service = new RestService(table);
// Use the Table as Repository which is wrapped by a Rest-Service and gets updated reactivly
Right now this is pseudo code
Am I on the right path here? Does is make sense to separate 1.) and 2.)? Is this the indented way to work with streams to materialize views? For me, it would have the benefit to scale up the writes and the reads via container independently where I see more traffic.
How is the repopulating of the KTable handled on a crash of either 1.) or 2.). Is this done via replication to the streaming api or is this something I would need to address via code. Like resetting the cursor and reply the events?
Couple of comments:
In your code snippet (1) you modify your topology after you handed the builder into the KafkaStreams constructor:
KafkaStreams streams = new KafkaStreams(builder,props);
// don't modify builder anymore!
You should not do this but first specify you topology and afterwards create the KafkaStreams instance.
About splitting you application into two. This can make sense to scale both parts independently. But it's hard to say in general. However, if you do spit both, the first one needs to write the transformed date into an output topic and the second one should read this output topic as a table (builder.table("output-topic-of-transformation") to serve the REST requests.
For accessing the store of the KTable, you need to get a query handle via the provided store name:
ReadOnlyKeyValueStore keyValueStore =
streams.store("My-Table", QueryableStoreTypes.keyValueStore());
See the docs for further details:
http://docs.confluent.io/current/streams/developer-guide.html#interactive-queries
I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});
Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
I'm trying to find a way to re-order messages within a topic partition and send ordered messages to a new topic.
I have Kafka publisher that sends String messages of the following format:
{system_timestamp}-{event_name}?{parameters}
for example:
1494002667893-client.message?chatName=1c&messageBody=hello
1494002656558-chat.started?chatName=1c&chatPatricipants=3
Also, we add some message key for each message, to send them to the corresponding partition.
What I want to do is reorder events based on {system-timestamp} part of the message and within a 1-minute window, cause our publishers doesn't guarantee that messages will be sent in accordance with {system-timestamp} value.
For example, we can deliver to the topic, a message with a bigger {system-timestamp} value first.
I've investigated Kafka Stream API and found some examples regarding messages windowing and aggregation:
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-sorter");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> stream = builder.stream("events");
KGroupedStream<String>, String> groupedStream = stream.groupByKey();//grouped events within partion.
/* commented since I think that I don't need any aggregation, but I guess without aggregation I can't use time windowing.
KTable<Windowed<String>, String> windowedEvents = stream.groupByKey().aggregate(
() -> "", // initial value
(aggKey, value, aggregate) -> aggregate + "", // aggregating value
TimeWindows.of(1000), // intervals in milliseconds
Serdes.String(), // serde for aggregated value
"test-store"
);*/
But what should I do next with this grouped stream? I don't see any 'sort() (e1,e2) -> e1.compareTo(e2)' methods available, also windows could be applied to methods like aggregation(), reduce() ,count() , but I think that I don't need any messages data manipulations.
How can I re-order message in the 1-minute window and send them to another topic?
Here's an outline:
Create a Processor implementation that:
in process() method, for each message:
reads the timestamp from the message value
inserts into a KeyValueStore using (timestamp, message-key) pair as the key and the message-value as the value. NB this also provides de-duplication. You'll need to provide a custom Serde to serialize the key so that the timestamp comes first, byte-wise, so that ranged queries are ordered by timestamp first.
in the punctuate() method:
reads the store using a ranged fetch from 0 to timestamp - 60'000 (=1 minute)
sends the fetched messages in order using context.forward() and deletes them from the store
The problem with this approach is that punctuate() is not triggered if no new msgs arrive to advance the "stream time". If this is a risk in your case, you can create an external scheduler that sends periodic "tick" messages to each(!) partition of your topic, that your processor should just ignore, but they'll cause punctuate to trigger in the absence of "real" msgs.
KIP-138 will address this limitation by adding explicit support for system-time punctuation:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-138%3A+Change+punctuate+semantics
Here is how I ordered streams in my project.
Created topology with source, processor, sink.
In Processor
process(key, value) -> Added each record to List(instance variable).
Init() -> schedule(WINDOW_BUFFER_TIME, WALL_CLOCK_TIME) -> punctuate (timestamp) sort list of items of window buffer time in List (instance variable) and iterate and forward. Clear List (instance variable).
This logic is working fine for me.