Java Apache Beam Testing pipeline replaces test data with null values

Java Apache Beam Testing pipeline replaces test data with null values - java

I have created a test pipeline like this:
Pipeline pipeline;
PipelineOptions pipelineOptions = TestPipeline.testingPipelineOptions();
pipeline = Pipeline.create(pipelineOptions);
FlattenLight flattenLight = new FlattenLight();
DataflowMessage dataflowMessage = getTestDataflowMessage();
PCollection<TableRow> flatttened = pipeline
.apply("Create Input", Create.of(dataflowMessage))
.apply(ParDo.of(flattenLight));
I want tot test the FlattenLight class, it is a DoFn child with a processElement(ProcessContext c) method.
The problem is that the test data generated with getTestDataflowMessage() does not goes through the pipeline. The FlattenLight object receives an Object with null values as fields.
The getTestDataflowMessage() creates fields as expected. You can see a lot of different test values are present:
debugger step at test data creation
But the FlattenLight class receives an Object that is mostly empty:
debugger step entering the FlattenLight object
As you can see there is not step between the data creation and the FlattenLight processing. Why does this happens? How to fix it?

I had the same issue. The solution was to add implements Serializable to all models within the model hierarchy. Take a closer look at your DataflowMessage, maybe you missed it somewhere.

Related

Best practice to pass large pipeline option in apache beam

We have a use case where we want in to pass hundred lines of json spec to our apache beam pipeline.? One straight forward way is to create custom pipeline option as mentioned below. Is there any other way where we can pass the input as file?
public interface CustomPipelineOptions extends PipelineOptions {
#Description("The Json spec")
String getJsonSpec();
void setJsonSpec(String jsonSpec);
}
I want to deploy the pipeline in Google dataflow engine. Even If I pass the spec as filepath and read the file contents inside the beam code before starting the pipeline, how do I bundle the spec file part of pipeline.
P.S Note, I don't want to commit the spec file(in resource folder) part of my source code where my beam code is available. It needs to configurable, i.e I want to pass different spec file for different beam pipeline job.

You can pass the options as a POJO.
public class JsonSpec {
public String stringArg;
public int intArg;
}
Then reference in your options
public interface CustomPipelineOptions extends PipelineOptions {
#Description("The Json spec")
JsonSpec getJsonSpec();
void setJsonSpec(JsonSpec jsonSpec);
}
Options will be parsed to the class; I believe by Jackson though not sure.
I am wondering why you want to pass in "hundreds of lines of JSON" as a pipeline option? This doesn't seem like a very "Beam" way of doing things. Pipeline options should pass configuration; do you really need hundreds of lines of configuration per pipeline run? If you intend to pass data to create a PCollection then better off using TextIO and then processing lines as JSON.

Beam PipelineOptions, as name implies, are intended to be used to provide small configuration parameters to configure a pipeline. PipelineOptions are usually read at job submission. So even if you get your json spec to job submission program using a PipelineOption, you have to make sure that you write your program so that your DoFns have access to this file at runtime. For this:
You have to save your files in a distributed storage system that Dataflow VMs have access to (for example, GCS)
You have to pass your input file to the transform that is reading the file.
There are multiple ways to do (2). For example,
Directly pass in the file path to constructor of your DoFn.
Pass in the file path as a side input to your transform (which allows you to configure it during runtime)

Is it OK to call Kafka Streams StreamBuilder.build() multiple times

We're using micronaut/kafka-streams. Using this framework in order to create a streams application you build something like this:
#Factory
public class FooTopologyConfig {
#Singleton
#Named
public KStream<String, FooPojo> configureTopology {
return builder.stream("foo-topic-in")
.peek((k,v) -> System.out.println(String.format("key %s, value: %s", k,v))
.to("foo-topic-out");
}
}
This:
Receives a ConfiguredStreamBuilder (a very light wrapper around StreamsBuilder)
Build and return the stream (we're not actually sure how important returning the stream is,
but that's a different question).
ConfiguredStreamBuilder::build() (which invokes the same on StreamsBuilder) is called later by the framework and the returned Topology is not made available for injection by Micronaut.
We want the Topology bean in order to log a description of the topology (via Topology::describe).
Is it safe to do the following?
Call ConfiguredStreamBuilder::build (and therefore StreamsBuilder::build) and use the returned instance of Topology to print a human readable description.
Allow the framework to call ConfiguredStreamBuilder::build for a second time later, and use the second instance of the returned topology to build the application.

There should be no problem calling build() multiple times. This is common in the internal code of Streams as well as in the tests.
To answer your other question. you only need the stream from builder.stream() operations if you want to expand on that branch of the topology later.

Is there a way to store a dynamodb object for unit tests

I am running a lambda function which at some point in the code receives the following object from DynamoDb:
ItemCollection<ScanOutcome>
I then run this function passing in this object, which I want to implement a unit test for:
public String deleteItems(ItemCollection<ScanOutcome> items) {
...
Iterator<Item> iterator = items.iterator();
while (iterator.hasNext()) {
Item item = iterator.next();
Object id = item.get("Id");
DeleteItemSpec itemSpec = new DeleteItemSpec().withPrimaryKey("Id", id);
someDynamoTable.deleteItem(itemSpec);
...
}
...
}
The problem is that it is hard for me to recreate a test version of 'ItemCollection ' in run-time. I am wondering if instead, I can somehow save one by serializing it, and store it locally? and then re-instantiate it in run-time during my unit tests?
Similar to how in Javascript, I could just store a JSON object representation to a text/json file and reuse it later.
Is this a recommended/wise approach? What would be the ideal way to resolve a problem like this?
An alternative solution is to run a dynamoDb instance locally, as seen here: Easier DynamoDB local testing

How to set global process variables in Camunda-BPM?

I have a simple bpmn process
in which i am using 2 service task,I am executing my process by using
processEngine.getRuntimeService().startProcessInstanceByKey("Process_1", variables);
where my variables is as follows:
Map variables = new HashMap();
variables.put("a", 2);
variables.put("b", 5);
Service task 1 implements an Addition java class and service task 2 implements a Multiplication class.
Now I want to have 3 variables (constants) c = 5, d = 10, e = 2 so that I can use c for service task 1 such that in Addition I can use this variable, similarly I want to use d in my Multiplication class, and e should be global so that I can use it in both classes.
Can anyone guide me on this?

As a quick fix you could include a Setup-Service Task as the first Task of the process which prefills your process-variables.
Depending on how you start a process you could either:
Set the Variables via the java-object-api
https://docs.camunda.org/manual/7.5/user-guide/process-engine/variables/#java-object-api
or you if you use a REST call you can provide these fixed values within the request body:
https://docs.camunda.org/manual/7.5/reference/rest/process-definition/post-start-process-instance/
Another simple solution would be a class with static values or a enum holding the needed values.
--edit--
if you want to use the inputOutput extension add something like this to your bpmn file:
<bpmn:process id="Process_1" isExecutable="false">
<bpmn:extensionElements>
<camunda:inputOutput>
<camunda:inputParameter name="c">5</camunda:inputParameter>
<camunda:inputParameter name="d">10</camunda:inputParameter>
<camunda:inputParameter name="e">2</camunda:inputParameter>
</camunda:inputOutput>
</bpmn:extensionElements>
this can't be done in the diagram view of the camunda modeler, just switch to the XML representation of the process and add the extensionElement.

The documentation shows two different ways to store the value:
Java object api
Typed value api
I think using Java object api requires the java object to implement serializable interface? The following code would break, if Order object does not implement Serializable interface
com.example.Order order = new com.example.Order();
runtimeService.setVariable(execution.getId(), "order", order);
com.example.Order retrievedOrder = (com.example.Order) runtimeService.getVariable(execution.getId(), "order");
==
I would use the following format for java object
ObjectValue customerDataValue = Variables.objectValue(customerData)
.serializationDataFormat(Variables.SerializationDataFormats.JAVA)
.create();
execution.setVariable("someVariable", customerDataValue);
customerdata refers to any java object. However if there member variables contains some other references, those references needs to serializable as well. To avoid this you will have declare those references as transient
Further more, use setVariableLocal method if you dont want the data to be persisted in DB

To create variable as gloable:org.camunda.bpm.engine.variable.Variables.putValue("keyName", VariableType);
To get global varibale: VariableType value = (VariableType) delegateExecution.getVariable("getKey");
Note: Your dto has to be serializable otherwise it camnuda will throw serialization error.

Executing a Spring Batch job incrementer when running through JUnit

I have a Spring Batch job which takes parameters, and the parameters are usually the same every time the job is run. By default, Spring Batch doesn't let you re-use the same parameters like that... so I created a simple incrementer and added it to my job like this:
http://numberformat.wordpress.com/2010/02/07/multiple-batch-runs-with-spring-batch/
When using the standard CommandLineJobRunner to run my job, I have to pass the -next parameter in order for my incrementer to be used.
However, when I run an end-to-end job test from within a JUnit class, using JobLauncherTestUtils.launchJob( JobParameters )... I can't find a way to declare that my incrementer should be used. The job is just quietly skipped, presumably because it has already been run with those parameters (see note below).
The JobParameters class is meant to hold a collection of name-value pairs... but the -next parameter is different. It starts with a dash, and has no corresponding value. I tried various experiments, but trying to add something to the JobParameters collection doesn't seem to be the ticket.
Does anyone know the JUnit equivalent to passing -next to CommandLineJobRunner?
NOTE: I presume that the the issue is my incrementer being ignored, because:
The job works the first time, and it works if I wipe out the job repository database. It only fails on retries.
The job works fine, retries and all, when I hardcode the variables and remove the parameters altogether.

JobLauncherTestUtils class contains a getUniqueJobParameters method which serves exactly the same need.
/**
* #return a new JobParameters object containing only a parameter for the
* current timestamp, to ensure that the job instance will be unique.
*/
public JobParameters getUniqueJobParameters() {Map<String, JobParameter>
parameters = new HashMap<String, JobParameter>();
parameters.put("random", new JobParameter((long) (Math.random() * JOB_PARAMETER_MAXIMUM)));
return new JobParameters(parameters);
}
Sample usage would be,
JobParameters params = new JobParametersBuilder (jobLauncherTestUtils.getUniqueJobParameters()).toJobParameters();
//extra parameters to be added
JobExecution jobExecution = jobLauncherTestUtils.launchJob(params);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Apache Beam Testing pipeline replaces test data with null values - java

I had the same issue. The solution was to add implements Serializable to all models within the model hierarchy. Take a closer look at your DataflowMessage, maybe you missed it somewhere.

Related

Best practice to pass large pipeline option in apache beam

Is it OK to call Kafka Streams StreamBuilder.build() multiple times

Is there a way to store a dynamodb object for unit tests

How to set global process variables in Camunda-BPM?

Executing a Spring Batch job incrementer when running through JUnit

Categories

Resources