Apache Beam GCP Dataflow Suggestion Implementation

Apache Beam GCP Dataflow Suggestion Implementation - java

I am running apache beam pipeline on GCP dataflow.
The dataflow pipeline suggests following item
A fusion break can be inserted after the following transforms to increase parallelism: ReadFromGCS/Match All/Match filepatterns/ParMultiDo(Match). The transforms had the following output-to-input element count ratio, respectively: 1006307.
my pipeline looks something like this
PCollection<String> records = p.apply("ReadFromGCS", TextIO.read().from(options.getInput())
.withHintMatchesManyFiles());
PCollection<Document> documents = records.apply("ConvertToDocument", ParDo.of(new ProcessJSON(options.getBatch())));
// Write to MongoDB using ParDo transform sink
documents.apply("WriteToMongoDB", MongoDbIO.write()
.withUri("mongodb+srv://"+options.getMongo())
.withDatabase(options.getDatabase())
.withCollection(options.getCollection())
.withBatchSize(options.getBatchSize()));
my input is a gcs bucket of pattern 'gs://test-bucket/test/*.json' which contains million json files
I want to understand what does the suggestion mean and how do I increase parallelism as suggested my dataflow in my case.
I tried this documentation but could not figure out how to solve this
https://cloud.google.com/dataflow/docs/guides/using-dataflow-insights?&_ga=2.162468924.-1096227812.1671933466#high-fan-out
attaching screenshot
image

Please look at Fusion Optimization for some background information on how to enforce/prevent a fusion.
A very common way is to have GroupByKey if there is some natural way to group things, or use operations such as Reshuffle.ViaRandomKey if you want to just spread evenly.

Related

Apache Beam GroupByKey outputting duplicate elements with PubSubIO

We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages.
When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another GroupByKey at far end of pipeline is failing with 'KeyCommitTooLargeException: Commit request for stage P27 and key abc#123 has size 225337153 which is more than the limit of..'
I have gone through the below link and found the suggestion was to use Reshuffle but Reshuffle has GroupByKey internally.
Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?
My pipeline code:
PCollection<String> messages = getReadPubSubSubscription(options, pipeline);
PCollection<String> windowedMessages = messages
.apply(
Window
.<String>into(
FixedWindows.of(Duration.standardMinutes(15)))
.discardingFiredPanes());
PCollectionTuple objectsTuple = windowedMessages
.apply(
"UnmarshalStrings",
ParDo
.of(new StringUnmarshallFn())
.withOutputTags(
StringUnmarshallFn.mainOutputTag,
TupleTagList.of(StringUnmarshallFn.deadLetterTag)));
PCollection<KV<String, Iterable<ABCObject>>> groupedObjects =
objectsTuple.get(StringUnmarshallFn.mainOutputTag)
.apply(
"GroupByObjects",
GroupByKey.<String, ABCObject>create());
PCollection results = groupedObjects
.apply(
"FetchForEachKey",
ParDo.of(SomeFn())).get(SomeFn.tag)
.apply(
"Reshuffle",
Reshuffle.viaRandomKey());
results.apply(...)
...
PubSub is not duplicating messages for sure and there are no additional failures, GroupByKey is creating these duplicates, is something wrong with the Windowing I am using?
One observation is GroupBy is producing same no of elements as the next step produce. I am attaching two screenshots one for GroupByKey and Other For Fetch Function.
GroupByKey step
Fetch step
UPDATE After additional analysis
Stage P27 is actually the first GroupByKey which is outputting many elements than expected. I can't see them as duplicates of actual output element as all these million elements are not processed by next Fetch step. I am not sure if these are some dummy elements introduced by dataflow or wrong metric from dataflow.
I am still analyzing further on why this KeyCommitTooLargeException is thrown as I only have one input element and grouping should only produce one element iterable. I have opened a ticket with Google as well.

GroupByKey groups by key and window. Without a trigger, it outputs just one element per key and window, which is also at most 1 element per input element.
If you are seeing any other behavior it may be a bug and you can report it. You will probably need to provide more steps to reproduce the issue, including example data and the entire runnable pipeline.

Since in the UPDATE you clarified that there are not duplicates, instead somehow dummy records are being added (what is really strange), this old thread reports similar issue and the answer is interesting since points out to a protobuf serialization issue caused by grouping a very large amount of data in a single window.
I recommend using the available troubleshooting steps (e.g. 1 or 2) to identify in which part of the code the issue is starting. For example, I'm still think that new StringUnmarshallFn() could be performing tasks that contribute to generate the dummy records. You might want to implement counters in your steps to try to identify how many records each step generates.
If you don't find a solution, the outstanding option is contact GCP Support and maybe they can figure it out.

Looking for performance tips on Spark DStream to Parquet files

I want to store Elasticsearch indices to HDFS files not using ES-Hadoop Connector.
A proposed solution is using Streaming Custom Receivers to read and save as parquet files and the code is like,
JavaDStream<String> jsonDocs = ssc.union(dsList.get(0), dsList.subList(1, dsList.size())); // I have a couple receivers
jsonDocs.foreachRDD( rdd -> {
Dataset<Row> ds = spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
ds.write().mode(SaveMode.Append).option("compression","gzip").parquet(path);
With this, I get some okay performance number, however, for I am new to Spark, I wonder if there is any room to improve.
For example, I see that json() and parquet() jobs take most of time, and is json() jobs taking long time necessary or can it be avoided?
(I have omitted some other jobs, e.g. count(), from the code snippet for simplicity.)
Using Structured Streaming looks a good but haven’t found a simple solution with Custom Receivers Streaming.
Thanks in advance,

spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
Looking above, reading json() might not best for performance sensitive work. Spark uses JacksonParser in it's data source api for reading json. If your json structure is simple try to parse it by yourself using map() function to create Row.

Writing to GCS from dataflow based on windowing and element count

I am attempting to implement a solution where I need to write data (json) messages from pubsub into GCS using dataflow. My question is exactly similar to this one
I need to write either based on windowing or element count.
Here is the code sample for writes from the the above question:
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));
The solution suggests using FileIO.WriteDynamic function. But i am not able to understand what .by(Event::getKey) does and where it comes from.
Any help on this is greatly appreciated.

It's partitioning elements into groups according to events' keys.
From my understanding, the events come from a PCollection using the KV class since it has the getKey method.
Note that :: is a new operator included in Java 8 that is used to refer a method of a class.

Writing to GCS with Dataflow using element count

This is in reference to Apache Beam SDK Version 2.2.0.
I'm attempting to use AfterPane.elementCountAtLeast(...) but not having any success so far. What I want looks a lot like Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn, but needs to be adapted to 2.2.0. Ultimately I just need a simple OR where a file is written after X elements OR Y time has passed. I intend to set the time very high so that the write happens on the number of elements in the majority of cases, and only writes based on duration during times of very low message volume.
Using GCP Dataflow 2.0 PubSub to GCS as a reference here's what I've tried:
String bucketPath =
String.format("gs://%s/%s",
options.getBucketName(),
options.getDestinationDirName());
PCollection<String> windowedValues = stringMessages
.apply("Create windows",
Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(250)))
.discardingFiredPanes());
windowedValues
.apply("Write to GCS",
TextIO
.write()
.to(bucketPath)
.withNumShards(options.getNumShards())
.withWindowedWrites());
Where stringMessages is a PCollection that is reading from an Avro-encoded pubsub subscription. There is some unpacking happening upstream to get the events converted to strings, but no merging/partitioning/grouping, just transforms.
Element count is hard coded at 250 just for PoC. Once it is proven, it will likely be cranked up to the 10s or 100s of thousands range.
The Problem
This implementation has resulted in text files of various lengths. The files lengths start very high (1000s of elements) when the job first starts up (presumably processing backlogged data, and then stabilize at some point. I've tried altering the 'numShards' to 1 and 10. At 1, the element count of the written files stabilizes at 600, and with 10, it stabilizes at 300.
What am I missing here?
As a side note, this is only step 1. Once I figure out writing using
element count, I still need to figure out writing these files as
compressed json (.json.gz) as opposed to plain-text files.

Posting what I learned for reference by others.
What was not clear to me when I wrote this is the following from the Apache Beam Documentation:
Transforms that aggregate multiple elements, such as GroupByKey and
Combine, work implicitly on a per-window basis
With this knowledge, I rethought my pipeline a bit. From the FileIO documentation under Writing files -> How many shards are generated per pane:
Note that setting a fixed number of shards can hurt performance: it adds an additional GroupByKey to the pipeline. However, it is required to set it when writing an unbounded PCollection due to BEAM-1438 and similar behavior in other runners.
So I decided to use FileIO's writeDynamic to perform the writes and specify withNumShards in order to get the implicit GroupByKey. The final result looks like this:
PCollection<String> windowedValues = validMessageStream.apply(Window
.<String>configure()
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(2000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(
Duration.standardSeconds(windowDurationSeconds)))))
.discardingFiredPanes());
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));

A single output request from multiple elements of a JavaRDD in Apache Spark Streaming

Summary
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
Environment
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:
Java 8
Apache Spark 2.11
Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.
Our Stream
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream. I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.
The code looks something like the following.
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
Possibilities for Optimization
I have two thoughts about how to improve throughput:
I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?
Am I over thinking this? I've tried to research so much that I might be in too deep.
Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).
Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.

I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.