Apache Beam GroupByKey outputting duplicate elements with PubSubIO

Apache Beam GroupByKey outputting duplicate elements with PubSubIO - java

We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages.
When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another GroupByKey at far end of pipeline is failing with 'KeyCommitTooLargeException: Commit request for stage P27 and key abc#123 has size 225337153 which is more than the limit of..'
I have gone through the below link and found the suggestion was to use Reshuffle but Reshuffle has GroupByKey internally.
Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?
My pipeline code:
PCollection<String> messages = getReadPubSubSubscription(options, pipeline);
PCollection<String> windowedMessages = messages
.apply(
Window
.<String>into(
FixedWindows.of(Duration.standardMinutes(15)))
.discardingFiredPanes());
PCollectionTuple objectsTuple = windowedMessages
.apply(
"UnmarshalStrings",
ParDo
.of(new StringUnmarshallFn())
.withOutputTags(
StringUnmarshallFn.mainOutputTag,
TupleTagList.of(StringUnmarshallFn.deadLetterTag)));
PCollection<KV<String, Iterable<ABCObject>>> groupedObjects =
objectsTuple.get(StringUnmarshallFn.mainOutputTag)
.apply(
"GroupByObjects",
GroupByKey.<String, ABCObject>create());
PCollection results = groupedObjects
.apply(
"FetchForEachKey",
ParDo.of(SomeFn())).get(SomeFn.tag)
.apply(
"Reshuffle",
Reshuffle.viaRandomKey());
results.apply(...)
...
PubSub is not duplicating messages for sure and there are no additional failures, GroupByKey is creating these duplicates, is something wrong with the Windowing I am using?
One observation is GroupBy is producing same no of elements as the next step produce. I am attaching two screenshots one for GroupByKey and Other For Fetch Function.
GroupByKey step
Fetch step
UPDATE After additional analysis
Stage P27 is actually the first GroupByKey which is outputting many elements than expected. I can't see them as duplicates of actual output element as all these million elements are not processed by next Fetch step. I am not sure if these are some dummy elements introduced by dataflow or wrong metric from dataflow.
I am still analyzing further on why this KeyCommitTooLargeException is thrown as I only have one input element and grouping should only produce one element iterable. I have opened a ticket with Google as well.

GroupByKey groups by key and window. Without a trigger, it outputs just one element per key and window, which is also at most 1 element per input element.
If you are seeing any other behavior it may be a bug and you can report it. You will probably need to provide more steps to reproduce the issue, including example data and the entire runnable pipeline.

Since in the UPDATE you clarified that there are not duplicates, instead somehow dummy records are being added (what is really strange), this old thread reports similar issue and the answer is interesting since points out to a protobuf serialization issue caused by grouping a very large amount of data in a single window.
I recommend using the available troubleshooting steps (e.g. 1 or 2) to identify in which part of the code the issue is starting. For example, I'm still think that new StringUnmarshallFn() could be performing tasks that contribute to generate the dummy records. You might want to implement counters in your steps to try to identify how many records each step generates.
If you don't find a solution, the outstanding option is contact GCP Support and maybe they can figure it out.

Related

Apache Beam GCP Dataflow Suggestion Implementation

I am running apache beam pipeline on GCP dataflow.
The dataflow pipeline suggests following item
A fusion break can be inserted after the following transforms to increase parallelism: ReadFromGCS/Match All/Match filepatterns/ParMultiDo(Match). The transforms had the following output-to-input element count ratio, respectively: 1006307.
my pipeline looks something like this
PCollection<String> records = p.apply("ReadFromGCS", TextIO.read().from(options.getInput())
.withHintMatchesManyFiles());
PCollection<Document> documents = records.apply("ConvertToDocument", ParDo.of(new ProcessJSON(options.getBatch())));
// Write to MongoDB using ParDo transform sink
documents.apply("WriteToMongoDB", MongoDbIO.write()
.withUri("mongodb+srv://"+options.getMongo())
.withDatabase(options.getDatabase())
.withCollection(options.getCollection())
.withBatchSize(options.getBatchSize()));
my input is a gcs bucket of pattern 'gs://test-bucket/test/*.json' which contains million json files
I want to understand what does the suggestion mean and how do I increase parallelism as suggested my dataflow in my case.
I tried this documentation but could not figure out how to solve this
https://cloud.google.com/dataflow/docs/guides/using-dataflow-insights?&_ga=2.162468924.-1096227812.1671933466#high-fan-out
attaching screenshot
image

Please look at Fusion Optimization for some background information on how to enforce/prevent a fusion.
A very common way is to have GroupByKey if there is some natural way to group things, or use operations such as Reshuffle.ViaRandomKey if you want to just spread evenly.

Spring Integration aggregating messages that were split twice

I have a use case where my message are being split twice and i want to aggregate all these messages. How can this best be achieved, should i aggregate the messages twice by introducing different sequence headers, or is there a way to aggregate the messages in single aggregating step by overriding the method how messages are grouped?

That's called a "nested splitting" and there is built-in algorithm to push sequence detail headers to the stack for a new splitting context. This would allow in the end to have an ascendant aggregation: the first one aggregate for the closest nested splitting, pops sequence detail headers and allows the next aggregator to deal with its own sequence context.
So, in two words: it is better to have as many aggregator as you have splitting if you want to send a single message in the start and receive a single message in the end.
Of course you can have a custom splitting algorithm with an applySequence = false. As many as you need. And have only a single aggregator in the end, but with a custom correlation logic already.
We have some explanation in the docs: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregatingmessagehandler
Starting with version 5.3, after processing message group, an AbstractCorrelatingMessageHandler performs a MessageBuilder.popSequenceDetails() message headers modification for the proper splitter-aggregator scenario with several nested levels.
We don't have a sample on the matter, but here is a configuration for test case: https://github.com/spring-projects/spring-integration/blob/main/spring-integration-core/src/test/java/org/springframework/integration/aggregator/scenarios/NestedAggregationTests-context.xml

Writing to GCS with Dataflow using element count

This is in reference to Apache Beam SDK Version 2.2.0.
I'm attempting to use AfterPane.elementCountAtLeast(...) but not having any success so far. What I want looks a lot like Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn, but needs to be adapted to 2.2.0. Ultimately I just need a simple OR where a file is written after X elements OR Y time has passed. I intend to set the time very high so that the write happens on the number of elements in the majority of cases, and only writes based on duration during times of very low message volume.
Using GCP Dataflow 2.0 PubSub to GCS as a reference here's what I've tried:
String bucketPath =
String.format("gs://%s/%s",
options.getBucketName(),
options.getDestinationDirName());
PCollection<String> windowedValues = stringMessages
.apply("Create windows",
Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(250)))
.discardingFiredPanes());
windowedValues
.apply("Write to GCS",
TextIO
.write()
.to(bucketPath)
.withNumShards(options.getNumShards())
.withWindowedWrites());
Where stringMessages is a PCollection that is reading from an Avro-encoded pubsub subscription. There is some unpacking happening upstream to get the events converted to strings, but no merging/partitioning/grouping, just transforms.
Element count is hard coded at 250 just for PoC. Once it is proven, it will likely be cranked up to the 10s or 100s of thousands range.
The Problem
This implementation has resulted in text files of various lengths. The files lengths start very high (1000s of elements) when the job first starts up (presumably processing backlogged data, and then stabilize at some point. I've tried altering the 'numShards' to 1 and 10. At 1, the element count of the written files stabilizes at 600, and with 10, it stabilizes at 300.
What am I missing here?
As a side note, this is only step 1. Once I figure out writing using
element count, I still need to figure out writing these files as
compressed json (.json.gz) as opposed to plain-text files.

Posting what I learned for reference by others.
What was not clear to me when I wrote this is the following from the Apache Beam Documentation:
Transforms that aggregate multiple elements, such as GroupByKey and
Combine, work implicitly on a per-window basis
With this knowledge, I rethought my pipeline a bit. From the FileIO documentation under Writing files -> How many shards are generated per pane:
Note that setting a fixed number of shards can hurt performance: it adds an additional GroupByKey to the pipeline. However, it is required to set it when writing an unbounded PCollection due to BEAM-1438 and similar behavior in other runners.
So I decided to use FileIO's writeDynamic to perform the writes and specify withNumShards in order to get the implicit GroupByKey. The final result looks like this:
PCollection<String> windowedValues = validMessageStream.apply(Window
.<String>configure()
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(2000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(
Duration.standardSeconds(windowDurationSeconds)))))
.discardingFiredPanes());
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));

Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing?

We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner). Our flow is as follows:
Pull data from pub/sub
Deserialize JSON into Java object
Window events w/ fixed windows of 5 seconds
Using a custom CombineFn, combine each window of Events into a List<Event>
For the sake of testing, simply output the resulting List<Event>
Pipeline code:
pipeline
// Read from pubsub topic to create unbounded PCollection
.apply(PubsubIO
.<String>read()
.topic(options.getTopic())
.withCoder(StringUtf8Coder.of())
)
// Deserialize JSON into Event object
.apply("ParseEvent", ParDo
.of(new ParseEventFn())
)
// Window events with a fixed window size of 5 seconds
.apply("Window", Window
.<Event>into(FixedWindows
.of(Duration.standardSeconds(5))
)
)
// Group events by window
.apply("CombineEvents", Combine
.globally(new CombineEventsFn())
.withoutDefaults()
)
// Log grouped events
.apply("LogEvent", ParDo
.of(new LogEventFn())
);
The result we are seeing is that the final step is never run, as we don't get any logging.
Also, we have added System.out.println("***") in each method of our custom CombineFn class, in order to track when these are run, and it seems they don't run either.
Is windowing set up incorrectly here? We followed an example found at https://beam.apache.org/documentation/programming-guide/#windowing and it seems fairly straightforward, but clearly there is something fundamental missing.
Any insight is appreciated - thanks in advance!

Looks like the main issue was indeed a missing trigger - the window was opening and there was nothing telling it when to emit results. We wanted to simply window based on processing time (not event time) and so did the following:
.apply("Window", Window
.<Event>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
Essentially this creates a global window, which is triggered to emit events 5 seconds after the first element is processed. Every time the window is closed, another is opened once it receives an element. Beam complained when we didn't have the withAllowedLateness piece - as far as I know this just tells it to ignore any late data.
My understanding may be a bit off the mark here, but the above snippet has solved our problem!

A single output request from multiple elements of a JavaRDD in Apache Spark Streaming

Summary
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
Environment
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:
Java 8
Apache Spark 2.11
Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.
Our Stream
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream. I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.
The code looks something like the following.
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
Possibilities for Optimization
I have two thoughts about how to improve throughput:
I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?
Am I over thinking this? I've tried to research so much that I might be in too deep.
Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).
Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.

I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.