Using run time parameters with BigtableIO in Apache Beam

Using run time parameters with BigtableIO in Apache Beam - java

I am trying to use run time parameters with BigtableIO in Apache Beam to write to BigTable.
I have created a pipeline to read from BigQuery and writing to Bigtable.
The pipeline works fine when i provide static parameters (using ConfigBigtableIO and ConfigBigtableConfiguration, referring to example here - https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/blob/master/java/dataflow-connector-examples/src/main/java/com/google/cloud/bigtable/dataflow/example/HelloWorldWrite.java) but I am getting a compile error while trying to setup the pipeline with run time parameters.
The options is setup with all parameters being runtime Value Providers.
p.apply(BigQueryIO.readTableRows().fromQuery(options.getBqQuery())
.usingStandardSql())
.apply(ParDo.of(new TransFormFn(options.getColumnFamily(), options.getRowKey(), options.getColumnKey(), options.getRowKeySuffix())))
.apply(BigtableIO.write().withProjectId(options.getBigtableProjectId()).
withInstanceId(options.getBigtableInstanceId()).
withTableId(options.getBigtableTableId()));
It is expecting the output of Bigtable.write()... to be org.apache.beam.sdk.transforms.PTransform,OutputT>
while Bigtable.write() is returning a Write object.
Can you help with providing the correct syntax to fix this? Thanks.

Runtime parameters are meant to be used in Dataflow templates.
Are you trying to create a template and run the pipeline using the template? If yes, you would need following steps:
Create an Options that has runtime parameters you need, i.e.
ValueProvider tableId.
Pass these runtime parameters to the config object: i.e. withTableId(ValueProvider tableId) =>
withTableId(options.getTableId())
Construct your template
Execute your pipeline using the template.
The advantage of using a template is that it allows pipeline to be constructed once and executed multiple times later with runtime parameters.
For more information on how to use Dataflow template: https://cloud.google.com/dataflow/docs/templates/overview
When not using Dataflow template, you don't have set runtime parameters, i.e. withTableId(ValueProvider tableId). Instead, use withTableId(String tableId).
Hope this helps!

Related

AWS CDK getting DynamoDB Stream ARN returns null

Is it possible to get the Stream ARN of a DynamoDB table using the AWS CDK?
I tried the below, but when I access the streamArn using the getTableStreamArn it returns back null.
ITable table = Table.fromTableArn(this, "existingTable", <<existingTableArn>>);
System.out.println("ITable Stream Arn : " + table.getTableStreamArn());
Tried using fromTableAttribute as well, but the stream arn is still empty.
ITable table =
Table.fromTableAttributes(
this, "existingTable", TableAttributes.builder().tableArn(<<existingTableArn>>).build());

this is not possible with the fromTableArn method... Please see the documentation here:
https://docs.aws.amazon.com/cdk/api/latest/docs/aws-dynamodb-readme.html#importing-existing-tables
If you intend to use the tableStreamArn (including indirectly, for
example by creating an
#aws-cdk/aws-lambda-event-source.DynamoEventSource on the imported
table), you must use the Table.fromTableAttributes method and the
tableStreamArn property must be populated.

That value is most likely not available when your Java Code is running.
With the CDK there is a multi-step process to get your code to execute:
Your Java Code is executed and triggers the underlying JSII Layer
JSII executes the underlying Javascript/Typescript implementation of the CDK
The Typescript Layer produces the CloudFormation Code
The CloudFormation Code (and other assets) is sent to the AWS API
CloudFormation executes the template and provisions the resources
Some attributes are only available during Step 5) and before that only contain internal references that are eventually put into the CloudFormation template. If I recall correctly, the Table Stream ARN is one of them.
That means if you want that value, you have to create a CloudFormation Output that shows them, which will be populated during the deployment.

Create AEM packages via code

Is there a way to create an AEM package via a java code ?
We need to package some content every night via a service run by a cron job.
I checked online and it seems to be possible using a curl command. But either way, I'd need this done via a daily service running a java code.

Please refer to some of the links given below :
1)https://helpx.adobe.com/experience-manager/using/dynamic_aem_packages.html
2)http://cq5experiences.blogspot.in/2014/01/creating-packages-using-java-code-in-cq5.html
The main code goes something like this :
final JcrPackage jcrPackage = getPackageHelper().createPackageFromPathFilterSets(packageResources,
request.getResourceResolver().adaptTo(Session.class),
properties.get(PACKAGE_GROUP_NAME, getDefaultPackageGroupName()),
properties.get(PACKAGE_NAME, getDefaultPackageName()),
properties.get(PACKAGE_VERSION, DEFAULT_PACKAGE_VERSION),
PackageHelper.ConflictResolution.valueOf(properties.get(CONFLICT_RESOLUTION,
PackageHelper.ConflictResolution.IncrementVersion.toString())),
packageDefinitionProperties
);
So first of all you can create a scheduler and in the scheduler's run method you can write the logic to package the required filter paths .
Hoping this is helpful for you.

Pentaho SDK, how to define a text file input

I'm trying to define a Pentaho Kettle (ktr) transformation via code. I would like to add to the transformation a Text File Input Step: http://wiki.pentaho.com/display/EAI/Text+File+Input.
I don't know how to do this (note that I want to achieve the result in a custom Java application, not using the standard Spoon GUI). I think I should use the TextFileInputMeta class, but when I try to define the filename the trasformation doesn't work anymore (it seems empty in Spoon).
This is the code I'm using. I think the third line has something wrong:
PluginRegistry registry = PluginRegistry.getInstance();
TextFileInputMeta fileInMeta = new TextFileInputMeta();
fileInMeta.setFileName(new String[] {myFileName});
String fileInPluginId = registry.getPluginId(StepPluginType.class, fileInMeta);
StepMeta fileInStepMeta = new StepMeta(fileInPluginId, myStepName, fileInMeta);
fileInStepMeta.setDraw(true);
fileInStepMeta.setLocation(100, 200);
transAWMMeta.addStep(fileInStepMeta);

To run a transformation programmatically, you should do the following:
Initialise Kettle
Prepare a TransMeta object
Prepare your steps
Don't forget about Meta and Data objects!
Add them to TransMeta
Create Trans and run it
By default, each transformation germinates a thread per step, so use trans.waitUntilFinished() to force your thread to wait until execution completes
Pick execution's results if necessary
Use this test as example: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/trans/steps/textfileinput/TextFileInputTests.java
Also, I would recommend you create the transformation manually and to load it from file, if it is acceptable for your circumstances. This will help to avoid lots of boilerplate code. It is quite easy to run transformations in this case, see an example here: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/TestUtilities.java#L346

How to deal with the column data in the RCFile using Spark Java.‏

I want to use Spark Java to read RCFile. I find the functions via searching on Google, they told me to use the hadoopFile function. I write this:
JavaSparkContext ctx = new JavaSparkContext("local", "Accumulate",System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(spark.wendy.RCFileAccumulate.class));
JavaPairRDD<String,Array> idListData = ctx.hadoopFile("/rcfinancetest",RCFileInputFormat.class,String.class, String.class);
idListData.saveAsTextFile("/financeout");
Then I open the result file, the result is as followed:
(0,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(1,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(2,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(3,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(4,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
The second phase of all record is the same. How can I deal with the variable? How to deal with the column data in the RCFile?By the way, I create the RCFile via Hive. I think it has nothing to do with the spark reading program. Could you help me ?

Running Talend Job from within Java application

I am developing a web app using Spring MVC. Simply put, a user uploads a file which can be of different types (.csv, .xls, .txt, .xml) and the application parses this file and extracts data for further processing. The problem is that I format of the file can change frequently. So there must be some way for quick and easy customization. Being a bit familiar with Talend, I decided to give it a shot and use it as ETL tool for my app. This short tutorial shows how to run Talend job from within Java app - http://www.talendforge.org/forum/viewtopic.php?id=2901
However, jobs created using Talend can read from/write to physical files, directories or databases. Is it possible to modify Talend job so that it can be given some Java object as a parameter and then return Java object just as usual Java methods?
For example something like:
String[] param = new String[]{"John Doe"};
String talendJobOutput = teaPot.myjob_0_1.myJob.main(param);
where teaPot.myjob_0_1.myJob is the talend job integrated into my app

I did something similar I guess. I created a mapping in tallend using tMap and exported this as talend job (java se programm). If you include the libraries of that job, you can run the talend job as described by others.
To pass arbitrary java objects you can use the following methods which are present in every talend job:
public Object getValueObject() {
return this.valueObject;
}
public void setValueObject(Object valueObject) {
this.valueObject = valueObject;
}
In your job you have to cast this object. e.g. you can put in a List of HashMaps and use Java reflection to populate rows. Use tJavaFlex or a custom component for that.
Using this method I can adjust the mapping of my data visually in Talend, but still use the generated code as library in my java application.

Now I better understand your willing, I think this is NOT possible because Talend's architecture is made like a standalone app, with a "main" entry point merely as does the Java main() method :
public String[][] runJob(String[] args) {
int exitCode = runJobInTOS(args);
String[][] bufferValue = new String[][] { { Integer.toString(exitCode) } };
return bufferValue;
}
That is to say : the Talend execution entry point only accepts a String array as input and doesn't returns anything as output (except as a system return code).
So, you won't be able link to Talend (generated) code as a library but as an isolated tool that you can only parameterize (using context vars, see my other response) before launching.
You can see that in Talend help center or forum the only integration described is as an "external" job execution ... :
Talend knowledge base "Calling a Talend Job from an external Java application" article
Talend Community Forum "Java Object to Talend" topic
May be you have to rethink the architecture of your application if you want to use Talend as the ETL tool for your purpose.

Now from Talend ETL point of view : if you want to parameter the execution environment of your Jobs (for exemple the physical directory of the uploaded files), you should use context variables that can be loaded at execution time from a configuration file as mentioned here :
https://help.talend.com/display/TalendOpenStudioforDataIntegrationUserGuide53EN/2.6.6+Context+settings

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using run time parameters with BigtableIO in Apache Beam - java

Related

AWS CDK getting DynamoDB Stream ARN returns null

Create AEM packages via code

Pentaho SDK, how to define a text file input

How to deal with the column data in the RCFile using Spark Java.‏

Running Talend Job from within Java application

Categories

Resources