Writing to GCS with Dataflow using element count

Writing to GCS with Dataflow using element count - java

This is in reference to Apache Beam SDK Version 2.2.0.
I'm attempting to use AfterPane.elementCountAtLeast(...) but not having any success so far. What I want looks a lot like Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn, but needs to be adapted to 2.2.0. Ultimately I just need a simple OR where a file is written after X elements OR Y time has passed. I intend to set the time very high so that the write happens on the number of elements in the majority of cases, and only writes based on duration during times of very low message volume.
Using GCP Dataflow 2.0 PubSub to GCS as a reference here's what I've tried:
String bucketPath =
String.format("gs://%s/%s",
options.getBucketName(),
options.getDestinationDirName());
PCollection<String> windowedValues = stringMessages
.apply("Create windows",
Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(250)))
.discardingFiredPanes());
windowedValues
.apply("Write to GCS",
TextIO
.write()
.to(bucketPath)
.withNumShards(options.getNumShards())
.withWindowedWrites());
Where stringMessages is a PCollection that is reading from an Avro-encoded pubsub subscription. There is some unpacking happening upstream to get the events converted to strings, but no merging/partitioning/grouping, just transforms.
Element count is hard coded at 250 just for PoC. Once it is proven, it will likely be cranked up to the 10s or 100s of thousands range.
The Problem
This implementation has resulted in text files of various lengths. The files lengths start very high (1000s of elements) when the job first starts up (presumably processing backlogged data, and then stabilize at some point. I've tried altering the 'numShards' to 1 and 10. At 1, the element count of the written files stabilizes at 600, and with 10, it stabilizes at 300.
What am I missing here?
As a side note, this is only step 1. Once I figure out writing using
element count, I still need to figure out writing these files as
compressed json (.json.gz) as opposed to plain-text files.

Posting what I learned for reference by others.
What was not clear to me when I wrote this is the following from the Apache Beam Documentation:
Transforms that aggregate multiple elements, such as GroupByKey and
Combine, work implicitly on a per-window basis
With this knowledge, I rethought my pipeline a bit. From the FileIO documentation under Writing files -> How many shards are generated per pane:
Note that setting a fixed number of shards can hurt performance: it adds an additional GroupByKey to the pipeline. However, it is required to set it when writing an unbounded PCollection due to BEAM-1438 and similar behavior in other runners.
So I decided to use FileIO's writeDynamic to perform the writes and specify withNumShards in order to get the implicit GroupByKey. The final result looks like this:
PCollection<String> windowedValues = validMessageStream.apply(Window
.<String>configure()
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(2000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(
Duration.standardSeconds(windowDurationSeconds)))))
.discardingFiredPanes());
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));

Related

Apache Beam GCP Dataflow Suggestion Implementation

I am running apache beam pipeline on GCP dataflow.
The dataflow pipeline suggests following item
A fusion break can be inserted after the following transforms to increase parallelism: ReadFromGCS/Match All/Match filepatterns/ParMultiDo(Match). The transforms had the following output-to-input element count ratio, respectively: 1006307.
my pipeline looks something like this
PCollection<String> records = p.apply("ReadFromGCS", TextIO.read().from(options.getInput())
.withHintMatchesManyFiles());
PCollection<Document> documents = records.apply("ConvertToDocument", ParDo.of(new ProcessJSON(options.getBatch())));
// Write to MongoDB using ParDo transform sink
documents.apply("WriteToMongoDB", MongoDbIO.write()
.withUri("mongodb+srv://"+options.getMongo())
.withDatabase(options.getDatabase())
.withCollection(options.getCollection())
.withBatchSize(options.getBatchSize()));
my input is a gcs bucket of pattern 'gs://test-bucket/test/*.json' which contains million json files
I want to understand what does the suggestion mean and how do I increase parallelism as suggested my dataflow in my case.
I tried this documentation but could not figure out how to solve this
https://cloud.google.com/dataflow/docs/guides/using-dataflow-insights?&_ga=2.162468924.-1096227812.1671933466#high-fan-out
attaching screenshot
image

Please look at Fusion Optimization for some background information on how to enforce/prevent a fusion.
A very common way is to have GroupByKey if there is some natural way to group things, or use operations such as Reshuffle.ViaRandomKey if you want to just spread evenly.

Apache Beam GroupByKey outputting duplicate elements with PubSubIO

We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages.
When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another GroupByKey at far end of pipeline is failing with 'KeyCommitTooLargeException: Commit request for stage P27 and key abc#123 has size 225337153 which is more than the limit of..'
I have gone through the below link and found the suggestion was to use Reshuffle but Reshuffle has GroupByKey internally.
Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?
My pipeline code:
PCollection<String> messages = getReadPubSubSubscription(options, pipeline);
PCollection<String> windowedMessages = messages
.apply(
Window
.<String>into(
FixedWindows.of(Duration.standardMinutes(15)))
.discardingFiredPanes());
PCollectionTuple objectsTuple = windowedMessages
.apply(
"UnmarshalStrings",
ParDo
.of(new StringUnmarshallFn())
.withOutputTags(
StringUnmarshallFn.mainOutputTag,
TupleTagList.of(StringUnmarshallFn.deadLetterTag)));
PCollection<KV<String, Iterable<ABCObject>>> groupedObjects =
objectsTuple.get(StringUnmarshallFn.mainOutputTag)
.apply(
"GroupByObjects",
GroupByKey.<String, ABCObject>create());
PCollection results = groupedObjects
.apply(
"FetchForEachKey",
ParDo.of(SomeFn())).get(SomeFn.tag)
.apply(
"Reshuffle",
Reshuffle.viaRandomKey());
results.apply(...)
...
PubSub is not duplicating messages for sure and there are no additional failures, GroupByKey is creating these duplicates, is something wrong with the Windowing I am using?
One observation is GroupBy is producing same no of elements as the next step produce. I am attaching two screenshots one for GroupByKey and Other For Fetch Function.
GroupByKey step
Fetch step
UPDATE After additional analysis
Stage P27 is actually the first GroupByKey which is outputting many elements than expected. I can't see them as duplicates of actual output element as all these million elements are not processed by next Fetch step. I am not sure if these are some dummy elements introduced by dataflow or wrong metric from dataflow.
I am still analyzing further on why this KeyCommitTooLargeException is thrown as I only have one input element and grouping should only produce one element iterable. I have opened a ticket with Google as well.

GroupByKey groups by key and window. Without a trigger, it outputs just one element per key and window, which is also at most 1 element per input element.
If you are seeing any other behavior it may be a bug and you can report it. You will probably need to provide more steps to reproduce the issue, including example data and the entire runnable pipeline.

Since in the UPDATE you clarified that there are not duplicates, instead somehow dummy records are being added (what is really strange), this old thread reports similar issue and the answer is interesting since points out to a protobuf serialization issue caused by grouping a very large amount of data in a single window.
I recommend using the available troubleshooting steps (e.g. 1 or 2) to identify in which part of the code the issue is starting. For example, I'm still think that new StringUnmarshallFn() could be performing tasks that contribute to generate the dummy records. You might want to implement counters in your steps to try to identify how many records each step generates.
If you don't find a solution, the outstanding option is contact GCP Support and maybe they can figure it out.

Get the size of objects on S3 with a single API call

I have a Java application that extracts compress quite a few objects on S3 through streaming. So to make it more efficient, the application does not download objects on the local disk and upload them again, but it streams the files in 5MB chunks and compress them on the fly. The challenge I am facing is in order to provide progress on this operation, I need to rely on the total size of all the objects and use a counter to see how much from the total size is handled as the source of calculating the progress.
The challenge I have been facing is in order to get the size of objects, I need to iterate through all the objects first and get the size one by one and calculate the total before starting the process. However, this is going to be too slow as there might be millions of objects which means millions of API calls. If I try to calculate the size before starting the compression, this calculation process will take more than the actual compression and it defeats the whole purpose. Therefore, I was wondering if there is any way I can pass the list of objects via a single API call and receive the total size. I know there is a way to add a prefix and get the prefix match for all the objects that match a prefix, but since objects may get stored with different prefixes, this approach will not work.
The following code snippet is how I can get the object size one by one:
public Long getObjectSize(AmazonS3Client amazonS3Client, String bucket, String key)
throws IOException {
return amazonS3Client.getObjectMetadata(bucket, key).getContentLength();
}
NOTE: If I relied on the number of objects to calculate the progress, that wouldn't be accurate at all. Some objects are 2-3KB and some of them are quite big (1-2GB).

You could use Stream API of java 8 to turn iterate and made the sum of the values or
maybe with using AmazonCloudWatch api to help you getting the BucketSizeBytes metric.
So you need to listMetrics and use BucketSizeBytes to GetMetricData.
Here the link of documentation:
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/cloudwatch/AmazonCloudWatch.html#listMetrics-com.amazonaws.services.cloudwatch.model.ListMetricsRequest-
https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudwatch-monitoring.html
Here some examples of AmazonCloudWatch:
https://www.javatips.net/api/com.amazonaws.services.cloudwatch.model.metric
https://www.programcreek.com/java-api-examples/?api=com.amazonaws.services.cloudwatch.AmazonCloudWatchClient
UPDATE:
Like I told you in one of this comments, you also could use command line interface.
In this case, you also use cloudwatch, but through aws cli and you receive a JSON response format.
In one of the links that I put has an example, follows here:
aws cloudwatch get-metric-statistics --metric-name BucketSizeBytes
--namespace AWS/S3 --start-time 2016-10-19T00:00:00Z --end-time 2016-10-20T00:00:00Z --statistics Average --unit Bytes --region us-west-2 --dimensions Name=BucketName,Value=ExampleBucket Name=StorageType,Value=StandardStorage --period 86400 --output json
This other link have more explanations:
http://cloudsqale.com/2018/10/08/s3-monitoring-step-1-bucket-size-and-number-of-objects/
In summary, it seems that using cloudwatch is the easiest way to avoid many calls with iterations.

A single output request from multiple elements of a JavaRDD in Apache Spark Streaming

Summary
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
Environment
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:
Java 8
Apache Spark 2.11
Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.
Our Stream
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream. I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.
The code looks something like the following.
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
Possibilities for Optimization
I have two thoughts about how to improve throughput:
I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?
Am I over thinking this? I've tried to research so much that I might be in too deep.
Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).
Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.

I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.

Spark Local Mode - all jobs only use one CPU core

We are running Spark Java in local mode on a single AWS EC2 instance using
"local[*]"
However, profiling using New Relic tools and a simple 'top' show that only one CPU core of our 16 core machine is ever in use for three different Java spark jobs we've written (we've also tried different AWS instances but only one core is ever used).
Runtime.getRuntime().availableProcessors() reports 16 processors and
sparkContext.defaultParallelism() reports 16 as well.
I've looked at various Stackoverflow local mode issues but none seem to have resolved the issue.
Any advice much appreciated.
Thanks
EDIT: Process
1) Use sqlContext to read gzipped CSV file 1 using com.databricks.spark.csv from disc (S3) into DataFrame DF1.
2) Use sqlContext to read gzipped CSV file 2 using com.databricks.spark.csv from disc (S3) into DataFrame DF2.
3) Use DF1.toJavaRDD().mapToPair(new mapping function that returns a Tuple>) RDD1
4) Use DF2.toJavaRDD().mapToPair(new mapping function that returns a Tuple>) RDD2
5) Call union on the RDDs
6) Call reduceByKey() on the unioned RDDs to "merge by key" so have a Tuple>) with only one instance of a particular key (as the same key appears in both RDD1 and RDD2).
7) Call .values().map(new mapping Function which iterates over all items in the provided List and merges them as required to return a List of the same or smaller length
8) Call .flatMap() to get an RDD
9) Use sqlContext to create a DataFrame from the flat map of type DomainClass
10) Use DF.coalease(1).write() to write the DF as gzipped CSV to S3.

I think your problem is that your CSV files are gzipped. When Spark reads files, it loads them in parallel, but it can only do this if the file codec is splittable*. Plain (non-gzipped) text and parquet are splittable, as well as the bgzip codec used in genomics (my field). Your entire files are ending up in one partition each.
Try decompressing the csv.gz files and running this again. I think you'll see much better results!
splittable formats mean that if you are given an arbitrary file offset at which to start reading, you can find the beginning of the next record in your block and interpret it. Gzipped files are not splittable.
Edit: I replicated this behavior on my machine. Using sc.textFile on a 3G gzipped text file produced 1 partition.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.