I'm trying to find a way to re-order messages within a topic partition and send ordered messages to a new topic.
I have Kafka publisher that sends String messages of the following format:
{system_timestamp}-{event_name}?{parameters}
for example:
1494002667893-client.message?chatName=1c&messageBody=hello
1494002656558-chat.started?chatName=1c&chatPatricipants=3
Also, we add some message key for each message, to send them to the corresponding partition.
What I want to do is reorder events based on {system-timestamp} part of the message and within a 1-minute window, cause our publishers doesn't guarantee that messages will be sent in accordance with {system-timestamp} value.
For example, we can deliver to the topic, a message with a bigger {system-timestamp} value first.
I've investigated Kafka Stream API and found some examples regarding messages windowing and aggregation:
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-sorter");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> stream = builder.stream("events");
KGroupedStream<String>, String> groupedStream = stream.groupByKey();//grouped events within partion.
/* commented since I think that I don't need any aggregation, but I guess without aggregation I can't use time windowing.
KTable<Windowed<String>, String> windowedEvents = stream.groupByKey().aggregate(
() -> "", // initial value
(aggKey, value, aggregate) -> aggregate + "", // aggregating value
TimeWindows.of(1000), // intervals in milliseconds
Serdes.String(), // serde for aggregated value
"test-store"
);*/
But what should I do next with this grouped stream? I don't see any 'sort() (e1,e2) -> e1.compareTo(e2)' methods available, also windows could be applied to methods like aggregation(), reduce() ,count() , but I think that I don't need any messages data manipulations.
How can I re-order message in the 1-minute window and send them to another topic?
Here's an outline:
Create a Processor implementation that:
in process() method, for each message:
reads the timestamp from the message value
inserts into a KeyValueStore using (timestamp, message-key) pair as the key and the message-value as the value. NB this also provides de-duplication. You'll need to provide a custom Serde to serialize the key so that the timestamp comes first, byte-wise, so that ranged queries are ordered by timestamp first.
in the punctuate() method:
reads the store using a ranged fetch from 0 to timestamp - 60'000 (=1 minute)
sends the fetched messages in order using context.forward() and deletes them from the store
The problem with this approach is that punctuate() is not triggered if no new msgs arrive to advance the "stream time". If this is a risk in your case, you can create an external scheduler that sends periodic "tick" messages to each(!) partition of your topic, that your processor should just ignore, but they'll cause punctuate to trigger in the absence of "real" msgs.
KIP-138 will address this limitation by adding explicit support for system-time punctuation:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-138%3A+Change+punctuate+semantics
Here is how I ordered streams in my project.
Created topology with source, processor, sink.
In Processor
process(key, value) -> Added each record to List(instance variable).
Init() -> schedule(WINDOW_BUFFER_TIME, WALL_CLOCK_TIME) -> punctuate (timestamp) sort list of items of window buffer time in List (instance variable) and iterate and forward. Clear List (instance variable).
This logic is working fine for me.
Related
I have a stream execution configured as
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Record> stream = env.addSource(new FlinkKafkaConsumer(
SystemsCpu.TOPIC,
ConfluentRegistryAvroDeserializationSchema.forGeneric(SystemsCpu.SCHEMA, registry),
config)
.setStartFromLatest());
DataStream<Anomaly> anomalies = stream
.keyBy(x -> x.get("host").toString())
.window(SlidingEventTimeWindows.of(Time.minutes(20), Time.seconds(20))) // produces output with TumblingEventTimeWindows
.process(new AnomalyDetector())
.name("anomaly-detector");
public class AnomalyDetector extends ProcessWindowFunction<Record, Anomaly, String, TimeWindow> {
#Override
public void process(String key, Context context, Iterable<Record> input, Collector<Anomaly> out) {
var anomaly = new Anomaly();
anomaly.setValue(1.0);
out.collect(anomaly);
}
}
However for some reason SlidingEventTimeWindows does not produce any output to be processed by the AnomalyDetector (i.e. process is not triggered at all). If I use, for example, TumblingEventTimeWindows it works as expected.
Any ideas what might be causing this? Am I using SlidingEventTimeWindows incorrectly?
When doing any sort of event time windowing it is necessary to provide a WatermarkStrategy. Watermarks mark a spot in the stream, and signal that the stream is complete up through some specific point in time. Event time windows can only be triggered by the arrival of a sufficiently large watermark.
See the docs for details, but this could be something like this:
DataStream<MyType> timestampedEvents = events
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<MyType>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((event, timestamp) -> event.timestamp));
However, since you are using Kafka, it's usually better to have the Flink Kafka consumer do the watermarking:
FlinkKafkaConsumer<MyType> kafkaSource = new FlinkKafkaConsumer<>("myTopic", schema, props);
kafkaSource.assignTimestampsAndWatermarks(WatermarkStrategy...);
DataStream<MyType> stream = env.addSource(kafkaSource);
Note that if you use this later approach, and if your events are in temporal order within each Kafka partition, you can take advantage of the per-parition watermarking that the Flink Kafka source provides, and use WatermarkStrategy.forMonotonousTimestamps() rather than the bounded-of-orderness strategy. This has a number of advantages.
By the way, and this is unrelated to your question, but you should be aware that by specifying SlidingEventTimeWindows.of(Time.minutes(20), Time.seconds(20)), every event will be copied into each of 60 overlapping windows.
You are using SlidingEventTimeWindows but your stream execution environment is configured for processing time by default. Either use SlidingProcessingTimeWindows or configure your environment for event time like so
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
Event time will also require a special time stamp assigner, you can find more info here.
https://www.ververica.com/blog/stream-processing-introduction-event-time-apache-flink?hs_amp=true
In my Java application, I have three DataStreams. For example, for One stream data is consumed from Kafka, for another stream data is consumed from Apache Nifi. For these two streams Object type is different. For example, Stream-1 object type is Person, Stream-2 object type is Address.
The third one is the broadcast stream (for this data is consumed from Kafka).
Now I want to combine Stream-1 and Stream-2 in a Job class and want to split in the task process element. How to implement this?
Note :
Stream-1 is mainstream and Stream-2 is side input. MainStream is continuously fetching data from Kafka. For Side Input, initially while the application is UP all table data is loaded from DB and then read new data when the table data is updated (not frequently) .
Sample structure:
DataStream<Person> stream-1 = env.addSource(read data from kafka)....
DataStream<Address> stream-2 = env.addSource(read data from nifi)....
BroadcastStream<String> BroadCastStream = stream-3.broadcast(read data from kafka);
I was referred to as the following links.
FLIP-17 Side Inputs for DataStream API
jira/browse/FLINK-6131
My Use case is :
Join stream with slowly evolving data: The side input that we use for enriching is evolving over time (Data is read from DB). This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.
Based on the latest response, the recommendation by #Arvid was in fact what was needed here.
Core of the answer:
You can easily join stream1 and stream2 even if they have different
types. Then you can add the broadcast to the result
Links to doc and example, and a relevant snippet from the doc (the example is too long to be included in here):
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
#Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
I have a materialized in-memory statestore in my code. I have another separate stream that is supposed to look up and delete records based on some criteria.
I need to allow my stream to access and delete records in the previously built statestore.
I have the following code below
#bean
public StreamBuilder myStreamCodeBean(StreamBuilder streamBuilder) {
//create store supplier
KeyValueBytesStoreSupplier myStoreSupplier = Stores.inMemoryKeyValueStore("MyStateStore");
//materialize statstore and enable caching
Materialized materializedStore = Materialized.<String, MyObject>as(myStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(myObjectSerde)
.withCachingEnabled();
//other code here that creates KTable, and another stream to consume records into ktable from another topic
//........
//another stream that consumes another topic and deletes records in store with some logic
streamsBuilder
.stream("someTopicName", someConsumerObject)
.filter((key, value) -> {
KeyValueStore<Bytes, byte[]> kvStore = myStoreSupplier.get();
kvStore.delete(key); //StateStore never "open" and this throws nullpointerexception (even tho key is NOT null)
return true;
}
.to("some topic name here", producerObject);
return streamBuilder;
}
The error that is thrown is really generic. The error is that Kafka streams is not running.
Doing some debugging, I found that my statestore isn't "open" when doing delete.
What am I doing wrong here? I can read records using ReadOnlyKeyValueStore but I need to delete so I can't use that.
any help appreciated.
State store must be accessed through the processor's context and not using the supplier object.
After creating the store, you need to ensure that it is accessible by the processor in which you are trying to access the store from.
If your store is a local store, then you need to specify which processors are going to access the store.
If your store is a global store, then it is accessible by all the processors in the topology.
You are creating a stream using streamsBuilder.stream() and at least from the code you have posted, you doesn't seem to give your processor access to the state store.
Ensure that you have called addStateStore() in StreamsBuilder
To get the state store in the processor, we need to use context.getStateStore(storeName).
You can refer the following example
(I don't think we can access state store in filter() because it is a stateless operation). So, you can use Processor or Transformer for and pass in the state store names (MyStateStore in your case).
I have the following code
//Kafka Config setup
Properties props = ...; //setup
List<String> topicList = Arrays.asList({"A", "B", "C"});
StreamBuilder builder = new StreamBuilder();
KStream<String, String> source = builder.stream(topicList);
source
.map((k,v) -> { //busy code for mapping data})
.transformValues(new MyGenericTransformer());
.to((k,v,r) -> {//busy code for topic routing});
new KafkaStream(builder.build(), properties).start();
My Problem : When I add more than one topic to subscribe to (ie A,B,C in above) the Kstream code stops receiving records.
References : https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/StreamsBuilder.html
Relevant Documentation
public <K,V> KStream<K,V> stream(java.util.Collection<java.lang.String> topics)
"If multiple topics are specified there is no ordering guarantee for records from different topics."
What I'm trying to achieve : Have one Kstream (ie 'source' from above) consume/process from multiple topics.
Do the topics share the same key?
Note that the specified input topics must be partitioned by key. If
this is not the case it is the user's responsibility to repartition
the data before any key based operation (like aggregation or join) is
applied to the returned KStream.
this maybe your blocker.
Another possible issue maybe the consumer group used.
I am pretty new to Kafka and Kafka Streams so please bear with me. I would like to know if I am on the right track here.
I am writing to a Kafka topic at the moment and try to access the data through a rest service. The raw data kind of needs to be transformed before it will be accessed.
What I have so far is a producer that writes the raw data into a topic.
1.) Now I want streams App (should be a jar running in a container) that just transforms the data in my desired shape. Following the materialized view paradigm here.
Over simplified version of 1.)
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> source =
builder.stream("my-raw-data-topic");
KafkaStreams streams = new KafkaStreams(builder,props);
KTable<String, Long> t = source.groupByKey().count("My-Table");
streams.start();
2.) And another streams App (should be a jar running in a container) that justs holds the KTable as some sort of Repository which can be accessed via a wrapping rest service.
Here I am kind of stuck with the proper way to work with the api.
What is the bare minimun to access and query a KTable? Do I need to assign the transformation topology to the builder again?
KStreamBuilder builder = new KStreamBuilder();
KTable table = builder.table("My-Table"); //Casting?
KafkaStreams streams = new KafkaStreams(builder, props);
RestService service = new RestService(table);
// Use the Table as Repository which is wrapped by a Rest-Service and gets updated reactivly
Right now this is pseudo code
Am I on the right path here? Does is make sense to separate 1.) and 2.)? Is this the indented way to work with streams to materialize views? For me, it would have the benefit to scale up the writes and the reads via container independently where I see more traffic.
How is the repopulating of the KTable handled on a crash of either 1.) or 2.). Is this done via replication to the streaming api or is this something I would need to address via code. Like resetting the cursor and reply the events?
Couple of comments:
In your code snippet (1) you modify your topology after you handed the builder into the KafkaStreams constructor:
KafkaStreams streams = new KafkaStreams(builder,props);
// don't modify builder anymore!
You should not do this but first specify you topology and afterwards create the KafkaStreams instance.
About splitting you application into two. This can make sense to scale both parts independently. But it's hard to say in general. However, if you do spit both, the first one needs to write the transformed date into an output topic and the second one should read this output topic as a table (builder.table("output-topic-of-transformation") to serve the REST requests.
For accessing the store of the KTable, you need to get a query handle via the provided store name:
ReadOnlyKeyValueStore keyValueStore =
streams.store("My-Table", QueryableStoreTypes.keyValueStore());
See the docs for further details:
http://docs.confluent.io/current/streams/developer-guide.html#interactive-queries