Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing?

Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing? - java

We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner). Our flow is as follows:
Pull data from pub/sub
Deserialize JSON into Java object
Window events w/ fixed windows of 5 seconds
Using a custom CombineFn, combine each window of Events into a List<Event>
For the sake of testing, simply output the resulting List<Event>
Pipeline code:
pipeline
// Read from pubsub topic to create unbounded PCollection
.apply(PubsubIO
.<String>read()
.topic(options.getTopic())
.withCoder(StringUtf8Coder.of())
)
// Deserialize JSON into Event object
.apply("ParseEvent", ParDo
.of(new ParseEventFn())
)
// Window events with a fixed window size of 5 seconds
.apply("Window", Window
.<Event>into(FixedWindows
.of(Duration.standardSeconds(5))
)
)
// Group events by window
.apply("CombineEvents", Combine
.globally(new CombineEventsFn())
.withoutDefaults()
)
// Log grouped events
.apply("LogEvent", ParDo
.of(new LogEventFn())
);
The result we are seeing is that the final step is never run, as we don't get any logging.
Also, we have added System.out.println("***") in each method of our custom CombineFn class, in order to track when these are run, and it seems they don't run either.
Is windowing set up incorrectly here? We followed an example found at https://beam.apache.org/documentation/programming-guide/#windowing and it seems fairly straightforward, but clearly there is something fundamental missing.
Any insight is appreciated - thanks in advance!

Looks like the main issue was indeed a missing trigger - the window was opening and there was nothing telling it when to emit results. We wanted to simply window based on processing time (not event time) and so did the following:
.apply("Window", Window
.<Event>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
Essentially this creates a global window, which is triggered to emit events 5 seconds after the first element is processed. Every time the window is closed, another is opened once it receives an element. Beam complained when we didn't have the withAllowedLateness piece - as far as I know this just tells it to ignore any late data.
My understanding may be a bit off the mark here, but the above snippet has solved our problem!

Related

Apache Beam GroupByKey outputting duplicate elements with PubSubIO

We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages.
When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another GroupByKey at far end of pipeline is failing with 'KeyCommitTooLargeException: Commit request for stage P27 and key abc#123 has size 225337153 which is more than the limit of..'
I have gone through the below link and found the suggestion was to use Reshuffle but Reshuffle has GroupByKey internally.
Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?
My pipeline code:
PCollection<String> messages = getReadPubSubSubscription(options, pipeline);
PCollection<String> windowedMessages = messages
.apply(
Window
.<String>into(
FixedWindows.of(Duration.standardMinutes(15)))
.discardingFiredPanes());
PCollectionTuple objectsTuple = windowedMessages
.apply(
"UnmarshalStrings",
ParDo
.of(new StringUnmarshallFn())
.withOutputTags(
StringUnmarshallFn.mainOutputTag,
TupleTagList.of(StringUnmarshallFn.deadLetterTag)));
PCollection<KV<String, Iterable<ABCObject>>> groupedObjects =
objectsTuple.get(StringUnmarshallFn.mainOutputTag)
.apply(
"GroupByObjects",
GroupByKey.<String, ABCObject>create());
PCollection results = groupedObjects
.apply(
"FetchForEachKey",
ParDo.of(SomeFn())).get(SomeFn.tag)
.apply(
"Reshuffle",
Reshuffle.viaRandomKey());
results.apply(...)
...
PubSub is not duplicating messages for sure and there are no additional failures, GroupByKey is creating these duplicates, is something wrong with the Windowing I am using?
One observation is GroupBy is producing same no of elements as the next step produce. I am attaching two screenshots one for GroupByKey and Other For Fetch Function.
GroupByKey step
Fetch step
UPDATE After additional analysis
Stage P27 is actually the first GroupByKey which is outputting many elements than expected. I can't see them as duplicates of actual output element as all these million elements are not processed by next Fetch step. I am not sure if these are some dummy elements introduced by dataflow or wrong metric from dataflow.
I am still analyzing further on why this KeyCommitTooLargeException is thrown as I only have one input element and grouping should only produce one element iterable. I have opened a ticket with Google as well.

GroupByKey groups by key and window. Without a trigger, it outputs just one element per key and window, which is also at most 1 element per input element.
If you are seeing any other behavior it may be a bug and you can report it. You will probably need to provide more steps to reproduce the issue, including example data and the entire runnable pipeline.

Since in the UPDATE you clarified that there are not duplicates, instead somehow dummy records are being added (what is really strange), this old thread reports similar issue and the answer is interesting since points out to a protobuf serialization issue caused by grouping a very large amount of data in a single window.
I recommend using the available troubleshooting steps (e.g. 1 or 2) to identify in which part of the code the issue is starting. For example, I'm still think that new StringUnmarshallFn() could be performing tasks that contribute to generate the dummy records. You might want to implement counters in your steps to try to identify how many records each step generates.
If you don't find a solution, the outstanding option is contact GCP Support and maybe they can figure it out.

Kafka streams and windowing to keep count over a time window

I'm new to Stackoverflow, so forgive me if the question is badly asked. Any help/inspiration is much appreciated!
I'm using Kafka streams to filter incoming data to my database. The incoming messages looks like {"ID":"X","time":"HH:MM"} and a few other parameters, irrelevant in this case. I managed to get a java application running that reads from a topic and prints out the incoming messages. Now what I want to do is to use KTables(?) to group incoming messages with the same ID and then use a session window to group the table in time-slots. I want a time window of X minutes continuously running on the time axis.
The first thing is of course to get a KTable running to count incoming messages with the same ID. What I would like to do should result in something like this:
ID Count
X 1
Y 3
Z 1
that keeps getting updated continuously, so messages with an outdated timestamp is removed from the table.
I'm not a hundred percent sure, but I think what I want is KTables and not KStreams, am I right? And how do I achieve the Sliding Window if this is the proper way of achieving my desired results?
This is the code I use right now. It only reads from a topic and prints the incoming messages.
private static List<String> printEvent(String o) {
System.out.println(o);
return Arrays.asList(o);
}
final StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream(srcTopic)
.flatMapValues(value -> printEvent(value));
I would like to know what I have to add to achieve my desired output stated above, and where I put it in my code.
Thanks in advance for the help!

Yes you need Ktable and sliding window, i also recommend you look on watermark feature, to handle late delivery message.
Example
KTable<Windowed<Key>, Value> oneMinuteWindowed = yourKStream
.groupByKey()
.reduce(/*your adder*/, TimeWindows.of(60*1000, 60*1000), "store1m");
//where your adder can be as simple as (val, agg) -> agg + val
//for primitive types or as complex as you need

Spark streaming maintain state over window

For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!

You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))

Kafka - problems with TimestampExtractor

I use org.apache.kafka:kafka-streams:0.10.0.1
I'm attempting to work with a time series based stream that doesn't seem to be triggering a KStream.Process() to trigger ("punctuate"). (see here for reference)
In a KafkaStreams config I'm passing in this param (among others):
config.put(
StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
EventTimeExtractor.class.getName());
Here, EventTimeExtractor is a custom timestamp extractor (that implements org.apache.kafka.streams.processor.TimestampExtractor) to extract the timestamp information from JSON data.
I would expect this to call my object (derived from TimestampExtractor) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have punctuate() set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up.
In fact it never gets called at all.
Is this the wrong approach to setting timestamps on KStream records?
Is this the wrong way to declare this configuration?

Update Nov 2017: Kafka Streams in Kafka 1.0 now supports punctuate() with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer.
Your setup seems correct to me.
What you need to be aware of: As of Kafka 0.10.0, the punctuate() method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records.
For example:
Let's assume you have set punctuate() to be called every 1 minute = 60 * 1000 (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, punctuate() will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because punctuate() depends on stream-time, and the stream-time is only advanced based on newly received data records.
Might this be causing the behavior you are seeing?
Looking ahead: There's already a ongoing discussion in the Kafka project on how to make punctuate() more flexible, e.g. to have trigger it not only based on stream-time (which defaults to event-time) but also based on processing-time.

Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters
Not sure, why your custom timestamp extractor is not used. Have a look into org.apache.kafka.streams.processor.internals.StreamTask. In the constructor there should be something like
TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class);
Check if your custom extractor is picked up there or not...

I think this is another case of issues at the broker level. I went and rebuilt the cluster using instances with more CPU and RAM. Now I'm getting the results I expected.
Note to distant observer(s): if your KStream app is behaving strangely take a look at your brokers and make sure they aren't stuck in GC and have plenty of 'headroom' for file handles, RAM, etc.
See also

Barcode scan into java swing app intermittently fails on embedded tab control characters

I have a java swing application which has a form that I populate by scanning a barcode containing tab (or $I) delimited data as keyboard input through a USB connection. Intermittently, the form's text fields are incorrectly populated such that it appears the tab is processed too late. For example, if the data set in the barcode is something like 'abc$Idef', the expected output would be 'abc' in the 1st text field and 'def' in the 2nd text field. What we see sometimes instead is 'abcde' in the 1st text field and 'f' in the 2nd or even all data in the 1st text field and nothing in the 2nd.
I have seen this issue manifest at different frequencies across different days. Today could be good and I only see it happen 1 out of every 150 attempts. Yesterday it could have been poor, happening 1 out of 10 attempts. The scanner is at or near default factory settings with the exception of me toggling the parameter to implement tab vs $I delimiter. I have also attempted slowing down the transmission speed, and while that does appear to decrease the frequency of events, it does not eliminate them and the the slowed speed is not ideal for user workflow and so, reset it to full speed.
I am doubtful that the issue lies within the scanner however. in the application, I've attempted to disable all text field validations and data backups to essentially remove any custom code which might cause some delay, but the intermittent issue still exists. Currently the application is running on a WinXPSP3 using JRE 1.5.0_18. The scanner is a Symbol model ds6707. I could use some guidance in investigating this issue further to determine where the problem may lie.

Consider reading the stream on a separate thread and posting completed units on the EventQueue. This will ensure that events arrive "Sequentially…In the same order as they are enqueued." SwingWorker is convenient for this, as the process() method executes "asynchronously on the Event Dispatch Thread."

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.