Apache Beam in Dataflow Large Side Input - java

This is most similar to this question.
I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and have all the relevant values attached to it (based on a key) before being written to a database.
The trouble is that the mapping dataset from BigQuery is very large - any attempt to use it as a side input fails with the Dataflow runners throwing the error "java.lang.IllegalArgumentException: ByteString would be too long". I have attempted the following strategies:
1) Side input
As stated,the mapping data is (apparently) too large to do this. If I'm wrong here or there is a work-around for this, please let me know because this would be the simplest solution.
2) Key-Value pair mapping
In this strategy, I read the BigQuery data and Pubsub message data in the first part of the pipeline, then run each through ParDo transformations that change every value in the PCollections to KeyValue pairs. Then, I run a Merge.Flatten transform and a GroupByKey transform to attach the relevant mapping data to each message.
The trouble here is that streaming data requires windowing to be merged with other data, so I have to apply windowing to the large, bounded BigQuery data as well. It also requires that the windowing strategies are the same on both datasets. But no windowing strategy for the bounded data makes sense, and the few windowing attempts I've made simply send all the BQ data in a single window and then never send it again. It needs to be joined with every incoming pubsub message.
3) Calling BQ directly in a ParDo (DoFn)
This seemed like a good idea - have each worker declare a static instance of the map data. If it's not there, then call BigQuery directly to get it. Unfortunately this throws internal errors from BigQuery every time (as in the entire message just says "Internal error"). Filing a support ticket with Google resulted in them telling me that, essentially, "you can't do that".
It seems this task doesn't really fit the "embarrassingly parallelizable" model, so am I barking up the wrong tree here?
EDIT :
Even when using a high memory machine in dataflow and attempting to make the side input into a map view, I get the error java.lang.IllegalArgumentException: ByteString would be too long
Here is an example (psuedo) of the code I'm using:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<String, TableRow>> mapData = pipeline
.apply("ReadMapData", BigQueryIO.read().fromQuery("SELECT whatever FROM ...").usingStandardSql())
.apply("BQToKeyValPairs", ParDo.of(new BQToKeyValueDoFn()))
.apply(View.asMap());
PCollection<PubsubMessage> messages = pipeline.apply(PubsubIO.readMessages()
.fromSubscription(String.format("projects/%1$s/subscriptions/%2$s", projectId, pubsubSubscription)));
messages.apply(ParDo.of(new DoFn<PubsubMessage, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
JSONObject data = new JSONObject(new String(c.element().getPayload()));
String key = getKeyFromData(data);
TableRow sideInputData = c.sideInput(mapData).get(key);
if (sideInputData != null) {
LOG.info("holyWowItWOrked");
c.output(new TableRow());
} else {
LOG.info("noSideInputDataHere");
}
}
}).withSideInputs(mapData));
The pipeline throws the exception and fails before logging anything from within the ParDo.
Stack trace:
java.lang.IllegalArgumentException: ByteString would be too long: 644959474+1551393497
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.concat(ByteString.java:524)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:576)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.copyFrom(ByteString.java:559)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString$Output.toByteString(ByteString.java:1006)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:951)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1000)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Check out the section called "Pattern: Streaming mode large lookup tables" in Guide to common Cloud Dataflow use-case patterns, Part 2. It might be the only viable solution since your side input doesn't fit into memory.
Description:
A large (in GBs) lookup table must be accurate, and changes often or
does not fit in memory.
Example:
You have point of sale information from a retailer and need to
associate the name of the product item with the data record which
contains the productID. There are hundreds of thousands of items
stored in an external database that can change constantly. Also, all
elements must be processed using the correct value.
Solution:
Use the "Calling external services for data enrichment" pattern
but rather than calling a micro service, call a read-optimized NoSQL
database (such as Cloud Datastore or Cloud Bigtable) directly.
For each value to be looked up, create a Key Value pair using the KV
utility class. Do a GroupByKey to create batches of the same key type
to make the call against the database. In the DoFn, make a call out to
the database for that key and then apply the value to all values by
walking through the iterable. Follow best practices with client
instantiation as described in "Calling external services for data
enrichment".
Other relevant patterns are described in Guide to common Cloud Dataflow use-case patterns, Part 1:
Pattern: Slowly-changing lookup cache
Pattern: Calling external services for data enrichment

Related

Google Cloud DLP Api InspectResult

Good day!
I'm using cloud dlp api to inspect bigquery views by converting chunks of the data into ContentItem and passing it to the inspect request. However, I am having trouble converting the findings and saving it to a bigquery table. Before, I used an airflow DLP operator for this and it is being done automatically by passing output storage config in an InspectConfig. However, that approach won't be applicable anymore as I'm calling the DLP api per chunks of data using apache beam in java.
I saw that the finding object has a writeTo() method but I'm not sure how to use it and how to save the findings with correct types into a bigquery table. can you help me with this? I'm currently stuck. thank you!
what I want to do is something like this
for (Finding res : result.getFindingsList()){
TableRow bqRow = new TableRow();
Object data = res.getLocation();
bqRow.set("field", data);
context.output(bqRow);
}
but this approach wouldn't save it in bigquery with correct types, especially for getLocation as it returns something like a protobuf message type.
I was trying to see if I can use the writeTo() method but I'm not sure how to use it. Thank you in advance for the help!
for (Finding res : result.getFindingsList()){
res.writeTo(...)
...
context.output(...);
}
If you use HybridInspect we'll store the findings for you to BigQuery.
https://cloud.google.com/dlp/docs/how-to-hybrid-jobs
If you do it yourself you will need to convert to a native BQ format like json
Load protobuf data to bigquery

Apache Flink : Add side inputs for DataStream API

In my Java application, I have three DataStreams. For example, for One stream data is consumed from Kafka, for another stream data is consumed from Apache Nifi. For these two streams Object type is different. For example, Stream-1 object type is Person, Stream-2 object type is Address.
The third one is the broadcast stream (for this data is consumed from Kafka).
Now I want to combine Stream-1 and Stream-2 in a Job class and want to split in the task process element. How to implement this?
Note :
Stream-1 is mainstream and Stream-2 is side input. MainStream is continuously fetching data from Kafka. For Side Input, initially while the application is UP all table data is loaded from DB and then read new data when the table data is updated (not frequently) .
Sample structure:
DataStream<Person> stream-1 = env.addSource(read data from kafka)....
DataStream<Address> stream-2 = env.addSource(read data from nifi)....
BroadcastStream<String> BroadCastStream = stream-3.broadcast(read data from kafka);
I was referred to as the following links.
FLIP-17 Side Inputs for DataStream API
jira/browse/FLINK-6131
My Use case is :
Join stream with slowly evolving data: The side input that we use for enriching is evolving over time (Data is read from DB). This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.
Based on the latest response, the recommendation by #Arvid was in fact what was needed here.
Core of the answer:
You can easily join stream1 and stream2 even if they have different
types. Then you can add the broadcast to the result
Links to doc and example, and a relevant snippet from the doc (the example is too long to be included in here):
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
#Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

Hazelcast Jet Pipelines API: processing data from more than one parent node

This question is about the Pipeline API in Hazelcast Jet 0.5.1
The pipeline I am trying to create has two infinite sources: one is a ticker (a custom source which sends one event every minute), the other is a Kafka Topic.
It looks like that:
Pipeline pipeline = Pipeline.create();
ComputeStage<Object> tickerSource = pipeline.drawFrom(Sources.fromProcessor("ticker", TickerSource.getSupplier()));
ComputeStage<Object> kafkaSource = pipeline.drawFrom(KafkaSources.kafka(sourceProperties, KAFKA_TOPIC));
When either of those sources emit an event, I want that event to go through the same steps and drain to the same sinks. I want a "UNION", if we translate my problem to SQL terms. Something that would look like this:
target pipeline
All of the examples and documentation I've found about having two nodes go into one would be the equivalent of a SQL "JOIN" operation, not a "UNION".
The only way I've found to bypass my issue is to do something like this, but I feel like this is something the framework should already have despite the fact that I can't seem to find it.
Arrays.asList(tickerSource, kafkaSource).forEach(source ->
{
ComputeStage<Object> result = source.map(MyCustomProcessor::process);
result.drainTo(Sinks.fromProcessor("first-sink", MyFirstSink.getSupplier());
result.drainTo(Sinks.fromProcessor("second-sink", MySecondSink.getSupplier());
});
The result looks like this:
resulting pipeline

Kafka streams app seperate reads from writes

I am pretty new to Kafka and Kafka Streams so please bear with me. I would like to know if I am on the right track here.
I am writing to a Kafka topic at the moment and try to access the data through a rest service. The raw data kind of needs to be transformed before it will be accessed.
What I have so far is a producer that writes the raw data into a topic.
1.) Now I want streams App (should be a jar running in a container) that just transforms the data in my desired shape. Following the materialized view paradigm here.
Over simplified version of 1.)
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> source =
builder.stream("my-raw-data-topic");
KafkaStreams streams = new KafkaStreams(builder,props);
KTable<String, Long> t = source.groupByKey().count("My-Table");
streams.start();
2.) And another streams App (should be a jar running in a container) that justs holds the KTable as some sort of Repository which can be accessed via a wrapping rest service.
Here I am kind of stuck with the proper way to work with the api.
What is the bare minimun to access and query a KTable? Do I need to assign the transformation topology to the builder again?
KStreamBuilder builder = new KStreamBuilder();
KTable table = builder.table("My-Table"); //Casting?
KafkaStreams streams = new KafkaStreams(builder, props);
RestService service = new RestService(table);
// Use the Table as Repository which is wrapped by a Rest-Service and gets updated reactivly
Right now this is pseudo code
Am I on the right path here? Does is make sense to separate 1.) and 2.)? Is this the indented way to work with streams to materialize views? For me, it would have the benefit to scale up the writes and the reads via container independently where I see more traffic.
How is the repopulating of the KTable handled on a crash of either 1.) or 2.). Is this done via replication to the streaming api or is this something I would need to address via code. Like resetting the cursor and reply the events?
Couple of comments:
In your code snippet (1) you modify your topology after you handed the builder into the KafkaStreams constructor:
KafkaStreams streams = new KafkaStreams(builder,props);
// don't modify builder anymore!
You should not do this but first specify you topology and afterwards create the KafkaStreams instance.
About splitting you application into two. This can make sense to scale both parts independently. But it's hard to say in general. However, if you do spit both, the first one needs to write the transformed date into an output topic and the second one should read this output topic as a table (builder.table("output-topic-of-transformation") to serve the REST requests.
For accessing the store of the KTable, you need to get a query handle via the provided store name:
ReadOnlyKeyValueStore keyValueStore =
streams.store("My-Table", QueryableStoreTypes.keyValueStore());
See the docs for further details:
http://docs.confluent.io/current/streams/developer-guide.html#interactive-queries

Using TextIO.Write to create new bucket if isn't exist in Google Cloud Dataflow

I'm trying to check a file created in a ParDo is different to a stored in the GCS.
To do this I try to read the file and compare their differences.
Pipeline p = Pipeline.create(c.getPipelineOptions());
try {
PCollection<String> lines = p.apply(
TextIO.Read
.named("Read Section on GS")
.from("gs://failbucket/foo/boo/ret.txt"))
.apply(ParDo
.of(new Util.viewDifferences2(c.element))
.named("only different"));
lines.apply(
TextIO.Write.named("Write Document Different")
.to(pathGS)
.withSuffix(".json"));
p.run();
} catch (Exception e) {
p = Pipeline.create(c.getPipelineOptions());
PCollection<String> lines = p.apply(Create.of(sectionContent));
lines.apply(TextIO.Write.named("Write new Document")
.to("gs://failbucket/foo/boo/ret").withSuffix(".txt"));
p.run();
}
Initially the file does not exist so Exception but when try created have this message "Output path does not exist or is not writeable"
Do you know how I can create entire new path?
Thank you
You can just use the option withoutValidation which voids the validation but it will create the corresponding buckets in gs. But it will throw exception if the parent bucket does not exists
In your case, if "failbucket" bucket does not exist it will throw below error.
Caused by: java.io.IOException: Failed to write to GCS path gs://failbucket/foo/boo/ret/xxx.txt
But if the "failbucket" bucket exists in your gs project, then it will create foo/boo/ buckets if those buckets does not exist.
In your case the below should go fine, if "failbucket" bucket is exists in your gs
lines.apply(TextIO.Write.named("Write new Document")
.to("gs://failbucket/foo/boo/ret")
.withoutValidation()
.withSuffix(".txt"));
It seems like your exception handling code is submitting a Dataflow pipeline whose only purpose is to create an empty file in Google Cloud Storage.
This is not particularly efficient. Instead, you can use Google Cloud Storage API directly to interact with your GCS buckets. This API is much more efficient and comprehensive for this purpose. For example, you can use this API before starting your main Dataflow pipeline.
Another approach is to try out the gsutil tool. This command-line tool has similar capabilities of interacting with your GCS buckets. You can also invoke it from your Java program, or separately, before starting the Java program.
An approach of disabling validation on TextIO in Dataflow is generally discouraged. This validation may provide the benefit of catching errors fast and early, before starting the execution of your pipeline in the Cloud Platform. That said, the validation should be disabled in rare cases where prerequisites cannot be verified at the job submission time.

Categories

Resources