I'm new to Stackoverflow, so forgive me if the question is badly asked. Any help/inspiration is much appreciated!
I'm using Kafka streams to filter incoming data to my database. The incoming messages looks like {"ID":"X","time":"HH:MM"} and a few other parameters, irrelevant in this case. I managed to get a java application running that reads from a topic and prints out the incoming messages. Now what I want to do is to use KTables(?) to group incoming messages with the same ID and then use a session window to group the table in time-slots. I want a time window of X minutes continuously running on the time axis.
The first thing is of course to get a KTable running to count incoming messages with the same ID. What I would like to do should result in something like this:
ID Count
X 1
Y 3
Z 1
that keeps getting updated continuously, so messages with an outdated timestamp is removed from the table.
I'm not a hundred percent sure, but I think what I want is KTables and not KStreams, am I right? And how do I achieve the Sliding Window if this is the proper way of achieving my desired results?
This is the code I use right now. It only reads from a topic and prints the incoming messages.
private static List<String> printEvent(String o) {
System.out.println(o);
return Arrays.asList(o);
}
final StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream(srcTopic)
.flatMapValues(value -> printEvent(value));
I would like to know what I have to add to achieve my desired output stated above, and where I put it in my code.
Thanks in advance for the help!
Yes you need Ktable and sliding window, i also recommend you look on watermark feature, to handle late delivery message.
Example
KTable<Windowed<Key>, Value> oneMinuteWindowed = yourKStream
.groupByKey()
.reduce(/*your adder*/, TimeWindows.of(60*1000, 60*1000), "store1m");
//where your adder can be as simple as (val, agg) -> agg + val
//for primitive types or as complex as you need
Related
We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages.
When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another GroupByKey at far end of pipeline is failing with 'KeyCommitTooLargeException: Commit request for stage P27 and key abc#123 has size 225337153 which is more than the limit of..'
I have gone through the below link and found the suggestion was to use Reshuffle but Reshuffle has GroupByKey internally.
Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?
My pipeline code:
PCollection<String> messages = getReadPubSubSubscription(options, pipeline);
PCollection<String> windowedMessages = messages
.apply(
Window
.<String>into(
FixedWindows.of(Duration.standardMinutes(15)))
.discardingFiredPanes());
PCollectionTuple objectsTuple = windowedMessages
.apply(
"UnmarshalStrings",
ParDo
.of(new StringUnmarshallFn())
.withOutputTags(
StringUnmarshallFn.mainOutputTag,
TupleTagList.of(StringUnmarshallFn.deadLetterTag)));
PCollection<KV<String, Iterable<ABCObject>>> groupedObjects =
objectsTuple.get(StringUnmarshallFn.mainOutputTag)
.apply(
"GroupByObjects",
GroupByKey.<String, ABCObject>create());
PCollection results = groupedObjects
.apply(
"FetchForEachKey",
ParDo.of(SomeFn())).get(SomeFn.tag)
.apply(
"Reshuffle",
Reshuffle.viaRandomKey());
results.apply(...)
...
PubSub is not duplicating messages for sure and there are no additional failures, GroupByKey is creating these duplicates, is something wrong with the Windowing I am using?
One observation is GroupBy is producing same no of elements as the next step produce. I am attaching two screenshots one for GroupByKey and Other For Fetch Function.
GroupByKey step
Fetch step
UPDATE After additional analysis
Stage P27 is actually the first GroupByKey which is outputting many elements than expected. I can't see them as duplicates of actual output element as all these million elements are not processed by next Fetch step. I am not sure if these are some dummy elements introduced by dataflow or wrong metric from dataflow.
I am still analyzing further on why this KeyCommitTooLargeException is thrown as I only have one input element and grouping should only produce one element iterable. I have opened a ticket with Google as well.
GroupByKey groups by key and window. Without a trigger, it outputs just one element per key and window, which is also at most 1 element per input element.
If you are seeing any other behavior it may be a bug and you can report it. You will probably need to provide more steps to reproduce the issue, including example data and the entire runnable pipeline.
Since in the UPDATE you clarified that there are not duplicates, instead somehow dummy records are being added (what is really strange), this old thread reports similar issue and the answer is interesting since points out to a protobuf serialization issue caused by grouping a very large amount of data in a single window.
I recommend using the available troubleshooting steps (e.g. 1 or 2) to identify in which part of the code the issue is starting. For example, I'm still think that new StringUnmarshallFn() could be performing tasks that contribute to generate the dummy records. You might want to implement counters in your steps to try to identify how many records each step generates.
If you don't find a solution, the outstanding option is contact GCP Support and maybe they can figure it out.
I am trying to wrap my head around Kafka Streams and having some fundamental questions that I can't seem to figure out on my own. I understand the concept of a KTable and Kafka State Stores but am having trouble deciding how to approach this. I am also using Spring Cloud Streams, which adds another level of complexity on top of this.
My use case:
I have a rule engine that reads in a Kafka event, processes the event, returns a list of rules that matched and writes it into another topic. This is what I have so far:
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input.mapValues(this::analyze).filter((host, evaluation) -> evaluation != null);
}
public List<IndicatorEvaluation> analyze(final String host, final ProcessNode process) {
// Does stuff
}
Some of the stateful rules look like:
[some condition] REPEATS 5 TIMES WITHIN 1 MINUTE
[some condition] FOLLOWEDBY [some condition] WITHIN 1 MINUTE
[rule A exists and rule B exists]
My current implementation is storing all this information in memory to be able to perform the analysis. For obvious reasons, it is not easily scalable. So I figured I would persist this into a Kafka State Store.
I am unsure of the best way to go about it. I know there is a way to create custom state stores that allow for a higher level of flexibility. I'm not sure if the Kafka DSL will support this.
Still new to Kafka Streams and wouldn't mind hearing a variety of suggestions.
From the description you have given, I believe this use case can still be implemented using the DSL in Kafka Streams. The code you have shown above does not track any state. In your topology, you need to add state by tracking the counts of the rules and store them in a state store. Then you only need to send the output rules when that count hits a threshold. Here is the general idea behind this as a pseudo-code. Obviously, you have to tweak this to satisfy the particular specifications of your use case.
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input
.mapValues(this::analyze)
.filter((host, evaluation) -> evaluation != null)
...
.groupByKey(...)
.windowedBy(TimeWindows.of(Duration.ofHours(1)))
.count(Materialized.as("rules"))
.filter((key, value) -> value > 4)
.toStream()
....
}
We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner). Our flow is as follows:
Pull data from pub/sub
Deserialize JSON into Java object
Window events w/ fixed windows of 5 seconds
Using a custom CombineFn, combine each window of Events into a List<Event>
For the sake of testing, simply output the resulting List<Event>
Pipeline code:
pipeline
// Read from pubsub topic to create unbounded PCollection
.apply(PubsubIO
.<String>read()
.topic(options.getTopic())
.withCoder(StringUtf8Coder.of())
)
// Deserialize JSON into Event object
.apply("ParseEvent", ParDo
.of(new ParseEventFn())
)
// Window events with a fixed window size of 5 seconds
.apply("Window", Window
.<Event>into(FixedWindows
.of(Duration.standardSeconds(5))
)
)
// Group events by window
.apply("CombineEvents", Combine
.globally(new CombineEventsFn())
.withoutDefaults()
)
// Log grouped events
.apply("LogEvent", ParDo
.of(new LogEventFn())
);
The result we are seeing is that the final step is never run, as we don't get any logging.
Also, we have added System.out.println("***") in each method of our custom CombineFn class, in order to track when these are run, and it seems they don't run either.
Is windowing set up incorrectly here? We followed an example found at https://beam.apache.org/documentation/programming-guide/#windowing and it seems fairly straightforward, but clearly there is something fundamental missing.
Any insight is appreciated - thanks in advance!
Looks like the main issue was indeed a missing trigger - the window was opening and there was nothing telling it when to emit results. We wanted to simply window based on processing time (not event time) and so did the following:
.apply("Window", Window
.<Event>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
Essentially this creates a global window, which is triggered to emit events 5 seconds after the first element is processed. Every time the window is closed, another is opened once it receives an element. Beam complained when we didn't have the withAllowedLateness piece - as far as I know this just tells it to ignore any late data.
My understanding may be a bit off the mark here, but the above snippet has solved our problem!
(Working in RxKotlin and RxJava, but using metacode for simplicity)
Many Reactive Extensions guides begin by creating an Observable from already available data. From The introduction to Reactive Programming you've been missing, it's created from a single string
var soureStream= Rx.Observable.just('https://api.github.com/users');
Similarly, from the frontpage of RxKotlin, from a populated list
val list = listOf(1,2,3,4,5)
list.toObservable()
Now consider a simple filter that yields an outStream,
var outStream = sourceStream.filter({x > 3})
In both guides the source events are declared apriori. Which means the timeline of events has some form
source: ----1,2,3,4,5-------
out: --------------4,5---
How can I modify sourceStream to become more of a pipeline? In other words, no input data is available during sourceStream creation? When a source event becomes available, it is immediately processed by out:
source: ---1--2--3-4---5-------
out: ------------4---5-------
I expected to find an Observable.add() for dynamic updates
var sourceStream = Observable.empty()
var outStream = sourceStream.filter({x>3})
//print each element as its added
sourceStream .subscribe({println(it)})
outStream.subscribe({println(it)})
for i in range(5):
sourceStream.add(i)
Is this possible?
I'm new, but how could I solve my problem without a subject? If I'm
testing an application, and I want it to "pop" an update every 5
seconds, how else can I do it other than this Publish subscribe
business? Can someone post an answer to this question that doesn't
involve a Subscriber?
If you want to pop an update every five seconds, then create an Observable with the interval operator, don't use a Subject. There are some dozen different operators for constructing Observables so you rarely need a subject.
That said, sometimes you do need one, and they come in very handy when testing code. I use them extensively in unit tests.
To Use Subject Or Not To Use Subject? is and excellent article on the subject of Subjects.
I use org.apache.kafka:kafka-streams:0.10.0.1
I'm attempting to work with a time series based stream that doesn't seem to be triggering a KStream.Process() to trigger ("punctuate"). (see here for reference)
In a KafkaStreams config I'm passing in this param (among others):
config.put(
StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
EventTimeExtractor.class.getName());
Here, EventTimeExtractor is a custom timestamp extractor (that implements org.apache.kafka.streams.processor.TimestampExtractor) to extract the timestamp information from JSON data.
I would expect this to call my object (derived from TimestampExtractor) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have punctuate() set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up.
In fact it never gets called at all.
Is this the wrong approach to setting timestamps on KStream records?
Is this the wrong way to declare this configuration?
Update Nov 2017: Kafka Streams in Kafka 1.0 now supports punctuate() with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer.
Your setup seems correct to me.
What you need to be aware of: As of Kafka 0.10.0, the punctuate() method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records.
For example:
Let's assume you have set punctuate() to be called every 1 minute = 60 * 1000 (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, punctuate() will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because punctuate() depends on stream-time, and the stream-time is only advanced based on newly received data records.
Might this be causing the behavior you are seeing?
Looking ahead: There's already a ongoing discussion in the Kafka project on how to make punctuate() more flexible, e.g. to have trigger it not only based on stream-time (which defaults to event-time) but also based on processing-time.
Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters
Not sure, why your custom timestamp extractor is not used. Have a look into org.apache.kafka.streams.processor.internals.StreamTask. In the constructor there should be something like
TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class);
Check if your custom extractor is picked up there or not...
I think this is another case of issues at the broker level. I went and rebuilt the cluster using instances with more CPU and RAM. Now I'm getting the results I expected.
Note to distant observer(s): if your KStream app is behaving strangely take a look at your brokers and make sure they aren't stuck in GC and have plenty of 'headroom' for file handles, RAM, etc.
See also