Best practice to pass large pipeline option in apache beam

Best practice to pass large pipeline option in apache beam - java

We have a use case where we want in to pass hundred lines of json spec to our apache beam pipeline.? One straight forward way is to create custom pipeline option as mentioned below. Is there any other way where we can pass the input as file?
public interface CustomPipelineOptions extends PipelineOptions {
#Description("The Json spec")
String getJsonSpec();
void setJsonSpec(String jsonSpec);
}
I want to deploy the pipeline in Google dataflow engine. Even If I pass the spec as filepath and read the file contents inside the beam code before starting the pipeline, how do I bundle the spec file part of pipeline.
P.S Note, I don't want to commit the spec file(in resource folder) part of my source code where my beam code is available. It needs to configurable, i.e I want to pass different spec file for different beam pipeline job.

You can pass the options as a POJO.
public class JsonSpec {
public String stringArg;
public int intArg;
}
Then reference in your options
public interface CustomPipelineOptions extends PipelineOptions {
#Description("The Json spec")
JsonSpec getJsonSpec();
void setJsonSpec(JsonSpec jsonSpec);
}
Options will be parsed to the class; I believe by Jackson though not sure.
I am wondering why you want to pass in "hundreds of lines of JSON" as a pipeline option? This doesn't seem like a very "Beam" way of doing things. Pipeline options should pass configuration; do you really need hundreds of lines of configuration per pipeline run? If you intend to pass data to create a PCollection then better off using TextIO and then processing lines as JSON.

Beam PipelineOptions, as name implies, are intended to be used to provide small configuration parameters to configure a pipeline. PipelineOptions are usually read at job submission. So even if you get your json spec to job submission program using a PipelineOption, you have to make sure that you write your program so that your DoFns have access to this file at runtime. For this:
You have to save your files in a distributed storage system that Dataflow VMs have access to (for example, GCS)
You have to pass your input file to the transform that is reading the file.
There are multiple ways to do (2). For example,
Directly pass in the file path to constructor of your DoFn.
Pass in the file path as a side input to your transform (which allows you to configure it during runtime)

Related

Apache Beam How to use TestStream with files

I have a simple pipeline that just copies files from source to destination. Im trying to write tests for the windowing I had set up.
Is there a way to use the TestStream class for files?
For example:
#Test
public void elementsAreInCorrectWindows() {
TestStream<FileIO.ReadableFile> testStream = TestStream.create(ReadableFileCoder.of())
.advanceWatermarkTo(start)
.addElements(readableFile1)
.advanceWatermarkTo(end)
.addElements(readableFile2)
.advanceWatermarkToInfinity();
}
However the constructor for ReadableFile is packaged protected so I wouldn't be able to create those objects.

I think it would be a reasonable feature/pull request to make this Coder public. In the meantime, you could have a TestStream that produces elements of another type that you then transform with a DoFn into ReadableFiles.

Is there any way auto generate graphql schema from protobuf?

I am developing springboot with GraphQL. Since the data structure is already declared within Protobuf, I tried to use it. This is example of my code.
#Service
public class Query implements GraphQLQueryResolver {
public MyProto getMyProto() {
/**/
}
}
I want make code like upper structure. To to this, I divided job into 2 sections.
Since ".proto file" can be converted to java class, I will use this class as return type.
And The second section is a main matter.
Also Schema is required. At first, I tried to code schema with my hand. But, the real size of proto is about 1000 lines. So, I want to know Is there any way to convert ".proto file" to ".graphqls file".

There is a way. I am using the a protoc plugin for that purpose: go-proto-gql
It is fairly simple to be used, for example:
protoc --gql_out=paths=source_relative:. -I=. ./*.proto
Hope this is working for you as well.

Use placeholders in feature files

I would like to use placeholders in a feature file, like this:
Feature: Talk to two servers
Scenario: Forward data from Server A to Server B
Given MongoDb collection "${db1}/foo" contains the following record:
"""
{"key": "value"}
"""
When I send GET "${server1}/data"
When I forward the respone to PUT "${server2}/data"
Then MongoDB collection "${db2}/bar" MUST contain the following record:
"""
{"key": "value"}
"""
The values of ${server1} etc. would depend on the environment in which the test is to be executed (dev, uat, stage, or prod). Therefore, Scenario Outlines are not applicable in this situation.
Is there any standard way of doing this? Ideally there would be something which maintains a Map<String, String> that can be filled in a #Before or so, and runs automatically between Cucumber and the Step Definition so that inside the step definitions no code is needed.
Given the following step definitions
public class MyStepdefs {
#When("^I send GET "(.*)"$)
public void performGET(final String url) {
// …
}
}
And an appropriate setup, when performGET() is called, the placeholder ${server1} in String uri should already be replaced with a lookup of a value in a Map.
Is there a standard way or feature of Cucumber-Java of doing this? I do not mind if this involves dependency injection. If dependency injection is involved, I would prefer Spring, as Spring is already in use for other reasons in my use case.

The simple answer is that you can't.
The solution to your problem is to remove the incidental details from your scenario all together and access specific server information in the step defintions.
The server and database obviously belong together so lets describe them as a single entity, a service.
The details about the rest calls doesn't really help to convey what you're
actually doing. Features don't describe implementation details, they describe behavior.
Testing if records have been inserted into the database is another bad practice and again doesn't describe behavior. You should be able to replace that by an other API call that fetches the data or some other process that proves the other server has received the information. If there are no such means to extract the data available you should create them. If they can't be created you can wonder if the information even needs to be stored (your service would then appear to have the same properties as a black hole :) ).
I would resolve this all by rewriting the story such that:
Feature: Talk to two services
Scenario: Forward foobar data from Service A to Service B
Given "Service A" has key-value information
When I forward the foobar data from "Service A" to "Service B"
Then "Service B" has received the key-value information
Now that we have two entities Service A and Service B you can create a ServiceInformationService to look up information about Service A and B. You can inject this ServiceInformationService into your step definitions.
So when ever you need some information about Service A, you do
Service a = serviceInformationService.lookup("A");
String apiHost = a.getApiHost():
String dbHost = a.getDatabaseHOst():
In the implementation of the Service you look up the property for that service System.getProperty(serviceName + "_" + apiHostKey) and you make sure that your CI sets A_APIHOST and A_DBHOST, B_APIHOST, B_DBHOST, ect.
You can put the name of the collections in a property file that you look up in a similar way as you'd look up the system properties. Though I would avoid direct interaction with the DB if possible.

The feature you are looking for is supported in gherkin with qaf. It supports to use properties defined in properties file using ${prop.key}. In addition it offers strong resource configuration features to work with different environments. It also supports web-services

Running external library with Cloud Dataflow

I'm trying to run some external shared library functions with Cloud Dataflow similar to described here: Running external libraries with Cloud Dataflow for grid-computing workloads.
I have a couple of questions according to the approach.
There is the following passage in the article mentioned earlier:
In the case of making a call to an external library, you need to do this step manually for that library. The approach is to:
Store the code (along with versioning information) in Cloud Storage, this removes any concerns about throughput if running 10,000s of cores in the flow.
In the #beginBundle [sic] method, create a synchronized block to check if the file is available on the local resource. If not, use the Cloud Storage client library to pull the file across.
However, with my Java package, I simply put the library .so file into the src/main/resource/linux-x86-64 directory and call the library functions the following way (stripped to a bare minimum for brevity):
import com.sun.jna.Library;
import com.sun.jna.Native;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.values.KV;
public class HostLookupPipeline {
public interface LookupLibrary extends Library {
String Lookup(String domain);
}
static class LookupFn extends DoFn<String, KV<String, String>> {
private static LookupLibrary lookup;
#StartBundle
public void startBundle() {
// src/main/resource/linux-x86-64/liblookup.so
lookup = Native.loadLibrary("lookup", LookupLibrary.class);
}
#ProcessElement
public void processElement(ProcessContext c) {
String domain = c.element();
String results = lookup.Lookup(domain);
if (results != null) {
c.output(KV.of(domain, results));
}
}
}
}
Is such approach considered acceptable or extracting .so file from JAR performs poorly compared to downloading from GCS? If not, where should I put the file after downloading to make it accessible by the Cloud Dataflow worker?
I've noticed that the transformation calling the external library function works rather slow — about 90 elements/s — utilizing 15 Cloud Dataflow workers (autoscaling, default max workers). If my rough calculations are correct, it should be twice as fast. I suppose that's because I call the external library function for every element.
Are there any best practices to improve external libraries performance when running with Java?

The guidance in that blog post is slightly incorrect - a much better place to put the initialization code is the #Setup method, not #StartBundle.
#Setup is called to initialize an instance of your DoFn in every thread on every worker that will be executing it. It is the intended place for heavy setup code. Its counterpart is #Teardown.
#StartBundle and #FinishBundle are much finer granularity: per bundle, which is a quite low-level concept, and I believe the only common legitimate use for them writing batches of elements to an external service: then typically in #StartBundle you would initialize the next batch and in #FinishBundle flush it.
Generally, to debug the performance, try adding logging to your DoFn's methods and see how many milliseconds the calls take and how that compares against your expectations. If you get stuck, include a Dataflow job ID in the question and an engineer will take a look at it.

How to implement dynamic referenced configuration in Java?

I have a configuration (config.properties) something like
app.rootDir=
app.reportDir=${app.rootDir}/report
The app.rootDir is not a fixed parameter and it must be initialized by external module. I need keep the ${app.reportDir} keep dynamic reference to ${app.rootDir}.
Use pseudo code to illustrate the problem:
// Init the root dir as '/usr/app'
config.setValue('app.rootDir','/usr/app');
// I need the reportDir to be '/usr/app/report'
String reportDir = config.getValue('app.reportDir');
I can write some codes to get this feature but I'd like to know if there is any existing library do this?
I can use properties, yaml, or json as configuration file type, according to the library availability.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.