Generating Oozie Workflows using Java Code

Generating Oozie Workflows using Java Code - java

Looking through Oozie examples and documentation, it looks like you need a workflow file in order to run an oozie job from Java code. Is ther any way to submit a job directly fro Java code, without needing a workflow file? Is there any pre-existing way to dynamically generate these files through java code? Are there any pre-existing tools that will make generating them easier? Or will I have to write the entirety of the code to generate the file?
Current Situation
OozieClient wc = new OozieClient("http://bar:8080/oozie");
Properties conf = wc.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "workflow file path");
// set other properties
...
// submit and start the workflow job
wc.run(conf);
Ideal situation is something vaguely like this.
OozieAction action = new OozieAction("actionName");
action.setOkDestination("nextAction");
action.setErrorDestination("errorDestination");
//Rest of config for action
OozieWorkflow workflow = new Oozieworkflow();
workfow.setStartAction(action);
workflow.addAction(otherAction);
//rest of conf
OozieClient wc = new OozieClient("http://bar:8080/oozie");
wc.runWorkflow(workflow);
Another situation that would be desirable if the former is impossibble is:
OozieAction action = new OozieAction("actionName");
action.setOkDestination("nextAction");
action.setErrorDestination("errorDestination");
//Rest of config for action
OozieWorkflow workflow = new Oozieworkflow();
workfow.setStartAction(action);
workflow.addAction(otherAction);
//rest of conf
workflow.writeToFile("some localFile")
//load file to HDFS
//This would also work
// workflow.writeToHDFS("someHdfsLocation");
OozieClient wc = new OozieClient("http://bar:8080/oozie");
//run with created workflow

I have been in a similar situation.
What I would suggest is to use the oozie schema definition (xsd) and generating the java equivalent objects through xjc. Given these objects you can probably create the workflow (not trivial though)
There are scala based DSL's you can use
https://github.com/klout/scoozie does something similar with Scala->oozie generation

Oozie 5.1.0 added support for Fluent Job API which makes it possible to write java code instead of workflow XML files (under the hood, Oozie will generate the XML file for you).
Simple example for the java code which creates a workflow similar to the shell action demo of Oozie:
public class MyFirstWorkflowFactory implements WorkflowFactory {
#Override
public Workflow create() {
final ShellAction shellAction = ShellActionBuilder.create()
.withName("shell-action")
.withResourceManager("${resourceManager}")
.withNameNode("${nameNode}")
.withConfigProperty("mapred.job.queue.name", "${queueName}")
.withExecutable("echo")
.withArgument("my_output=Hello Oozie")
.withCaptureOutput(true)
.build();
final Workflow shellWorkflow = new WorkflowBuilder()
.withName("shell-workflow")
.withDagContainingNode(shellAction).build();
return shellWorkflow;
}
}
More detailed documentation can be found here: https://oozie.apache.org/docs/5.1.0/DG_FluentJobAPI.html

There's a graphical tool to generate Oozi workflows via an eclipse plugin. Find it here Eclipse marketplace: https://marketplace.eclipse.org/content/oozie-eclipse-plugin
It looks like this:

have a static oozie workflow in your HDFS which just takes 2 parameters and writes the content of parameter1(say content which user enters) to parameter2(say writing to HDFS). Now call the oozie CLI and specify app.path as the location created by workflow1

Related

Create AEM packages via code

Is there a way to create an AEM package via a java code ?
We need to package some content every night via a service run by a cron job.
I checked online and it seems to be possible using a curl command. But either way, I'd need this done via a daily service running a java code.

Please refer to some of the links given below :
1)https://helpx.adobe.com/experience-manager/using/dynamic_aem_packages.html
2)http://cq5experiences.blogspot.in/2014/01/creating-packages-using-java-code-in-cq5.html
The main code goes something like this :
final JcrPackage jcrPackage = getPackageHelper().createPackageFromPathFilterSets(packageResources,
request.getResourceResolver().adaptTo(Session.class),
properties.get(PACKAGE_GROUP_NAME, getDefaultPackageGroupName()),
properties.get(PACKAGE_NAME, getDefaultPackageName()),
properties.get(PACKAGE_VERSION, DEFAULT_PACKAGE_VERSION),
PackageHelper.ConflictResolution.valueOf(properties.get(CONFLICT_RESOLUTION,
PackageHelper.ConflictResolution.IncrementVersion.toString())),
packageDefinitionProperties
);
So first of all you can create a scheduler and in the scheduler's run method you can write the logic to package the required filter paths .
Hoping this is helpful for you.

Interacting with a large java program as a service?

How can I do the following?
What I want to do is load Stanford NLP ONCE, then interact with it via an HTTP or other endpoint. The reason is that it takes a long time to load, and loading for every string to analyze is out of the question.
For example, here is Stanford NLP loading in a simple C# program that loads the jars... I'm looking to do what I did below, but in java:
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [9.3 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.all.3class.distsim.crf.ser.gz ... done [12.8 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.muc.7class.distsim.crf.ser.gz ... done [5.9 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.conll.4class.distsim.crf.ser.gz ... done [4.1 sec].
done [8.8 sec].
Sentence #1 ...
This is over 30 seconds. If these all have to load each time, yikes. To show what I want to do in java, I wrote a working example in C#, and this complete example may help someone some day:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using java.io;
using java.util;
using edu.stanford.nlp;
using edu.stanford.nlp.pipeline;
using Console = System.Console;
namespace NLPConsoleApplication
{
class Program
{
static void Main(string[] args)
{
// Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar`
var jarRoot = #"..\..\..\..\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models";
// Text for intial run processing
var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
// Annotation pipeline configuration
var props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
props.setProperty("ner.useSUTime", "0");
// We should change current directory, so StanfordCoreNLP could find all the model files automatically
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(jarRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
// loop
while (text != "quit")
{
// Annotation
var annotation = new Annotation(text);
pipeline.annotate(annotation);
// Result - Pretty Print
using (var stream = new ByteArrayOutputStream())
{
pipeline.prettyPrint(annotation, new PrintWriter(stream));
Console.WriteLine(stream.toString());
stream.close();
}
edu.stanford.nlp.trees.TreePrint tprint = new edu.stanford.nlp.trees.TreePrint("words");
Console.WriteLine();
Console.WriteLine("Enter a sentence to evaluate, and hit ENTER (enter \"quit\" to quit)");
text = Console.ReadLine();
} // end while
}
}
}
So it takes the 30 seconds to load, but each time you give it a string on the console, it takes the smallest fraction of a second to parse & tokenize that string.
You can see that I loaded the jar files prior to the while loop.
This may end up being a socket service, HTML, or something else that will entertain requests (in the form of strings), and spit back the parsing.
My ultimate goal is to use a mechanism in Nifi, via a processor that can send strings to be parsed, and have them returned in less than a second, versus 30+ seconds if a traditional web server threaded example (for instance) is used. Every request would load the whole thing for 30 seconds, THEN get down to business. I hope I made this clear!
How to do this?

Any of the mechanisms you list are perfectly reasonable routes forward for leveraging that service with Apache NiFi. Depending on your needs, some of the processors and extensions that are bundled with the standard release of NiFi may be sufficient to interact with your proposed web service or similar offering.
If you are striving for performing all of this within NiFi itself, a custom Controller Service might be a great path to provide this resource to NiFi that falls within the lifecycle of the application itself.
NiFi can be extended with items like controller services and custom processors and we have some documentation to get you started down that path.
Additional details could certainly help to provide some more information. Feel free to follow up on here with additional comments and/or reach out to the community via our mailing lists.
One item I did want to call out if it was unclear that NiFi is JVM driven and work would be done in Java or JVM friendly languages.

You should look at the new CoreNLP Server which Stanford NLP introduced in version 3.6.0. It seems like it does just what you want? Some other people such as ETS have done similar things.
Fine point: If using this heavily, you might (at present) want to grab the latest CoreNLP code from github HEAD, since it contains a few fixes to the server which will be in the next release.

Pentaho SDK, how to define a text file input

I'm trying to define a Pentaho Kettle (ktr) transformation via code. I would like to add to the transformation a Text File Input Step: http://wiki.pentaho.com/display/EAI/Text+File+Input.
I don't know how to do this (note that I want to achieve the result in a custom Java application, not using the standard Spoon GUI). I think I should use the TextFileInputMeta class, but when I try to define the filename the trasformation doesn't work anymore (it seems empty in Spoon).
This is the code I'm using. I think the third line has something wrong:
PluginRegistry registry = PluginRegistry.getInstance();
TextFileInputMeta fileInMeta = new TextFileInputMeta();
fileInMeta.setFileName(new String[] {myFileName});
String fileInPluginId = registry.getPluginId(StepPluginType.class, fileInMeta);
StepMeta fileInStepMeta = new StepMeta(fileInPluginId, myStepName, fileInMeta);
fileInStepMeta.setDraw(true);
fileInStepMeta.setLocation(100, 200);
transAWMMeta.addStep(fileInStepMeta);

To run a transformation programmatically, you should do the following:
Initialise Kettle
Prepare a TransMeta object
Prepare your steps
Don't forget about Meta and Data objects!
Add them to TransMeta
Create Trans and run it
By default, each transformation germinates a thread per step, so use trans.waitUntilFinished() to force your thread to wait until execution completes
Pick execution's results if necessary
Use this test as example: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/trans/steps/textfileinput/TextFileInputTests.java
Also, I would recommend you create the transformation manually and to load it from file, if it is acceptable for your circumstances. This will help to avoid lots of boilerplate code. It is quite easy to run transformations in this case, see an example here: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/TestUtilities.java#L346

How to run a .m (matlab) file through java and matlab control?

I have 2 .m files. One is the function and the other one (read.m) reads then function and exports the results into an excel file. I have a java program that makes some changes to the .m files. After the changes I want to automate the execution/running of the .m files. I have downloaded the matlabcontrol.jar and I am looking for a way to use it to invoke and run the read.m file that then reads the function.
Can anyone help me with the code? Thanks
I have tried this code but it does not work.
public static void tomatlab() throws MatlabConnectionException, MatlabInvocationException {
MatlabProxyFactoryOptions options =
new MatlabProxyFactoryOptions.Builder()
.setUsePreviouslyControlledSession(true)
.build();
MatlabProxyFactory factory = new MatlabProxyFactory(options);
MatlabProxy proxy = factory.getProxy();
proxy.eval("addpath('C:\\path_to_read.m')");
proxy.feval("read");
proxy.eval("rmpath('C:\\path_to_read.m')");
// close connection
proxy.disconnect();
}

Based on the official tutorial in the Wiki of the project, it seems quite straightforward to start with this API.
The path-manipulation might be a bit tricky, but I would give a try to loading the whole script into a string and passing it to eval (please note I have no prior experience with this specific Matlab library). That could be done quite easily (with joining Files.readAllLines() for example).
Hope that helps something.

Running Talend Job from within Java application

I am developing a web app using Spring MVC. Simply put, a user uploads a file which can be of different types (.csv, .xls, .txt, .xml) and the application parses this file and extracts data for further processing. The problem is that I format of the file can change frequently. So there must be some way for quick and easy customization. Being a bit familiar with Talend, I decided to give it a shot and use it as ETL tool for my app. This short tutorial shows how to run Talend job from within Java app - http://www.talendforge.org/forum/viewtopic.php?id=2901
However, jobs created using Talend can read from/write to physical files, directories or databases. Is it possible to modify Talend job so that it can be given some Java object as a parameter and then return Java object just as usual Java methods?
For example something like:
String[] param = new String[]{"John Doe"};
String talendJobOutput = teaPot.myjob_0_1.myJob.main(param);
where teaPot.myjob_0_1.myJob is the talend job integrated into my app

I did something similar I guess. I created a mapping in tallend using tMap and exported this as talend job (java se programm). If you include the libraries of that job, you can run the talend job as described by others.
To pass arbitrary java objects you can use the following methods which are present in every talend job:
public Object getValueObject() {
return this.valueObject;
}
public void setValueObject(Object valueObject) {
this.valueObject = valueObject;
}
In your job you have to cast this object. e.g. you can put in a List of HashMaps and use Java reflection to populate rows. Use tJavaFlex or a custom component for that.
Using this method I can adjust the mapping of my data visually in Talend, but still use the generated code as library in my java application.

Now I better understand your willing, I think this is NOT possible because Talend's architecture is made like a standalone app, with a "main" entry point merely as does the Java main() method :
public String[][] runJob(String[] args) {
int exitCode = runJobInTOS(args);
String[][] bufferValue = new String[][] { { Integer.toString(exitCode) } };
return bufferValue;
}
That is to say : the Talend execution entry point only accepts a String array as input and doesn't returns anything as output (except as a system return code).
So, you won't be able link to Talend (generated) code as a library but as an isolated tool that you can only parameterize (using context vars, see my other response) before launching.
You can see that in Talend help center or forum the only integration described is as an "external" job execution ... :
Talend knowledge base "Calling a Talend Job from an external Java application" article
Talend Community Forum "Java Object to Talend" topic
May be you have to rethink the architecture of your application if you want to use Talend as the ETL tool for your purpose.

Now from Talend ETL point of view : if you want to parameter the execution environment of your Jobs (for exemple the physical directory of the uploaded files), you should use context variables that can be loaded at execution time from a configuration file as mentioned here :
https://help.talend.com/display/TalendOpenStudioforDataIntegrationUserGuide53EN/2.6.6+Context+settings

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Generating Oozie Workflows using Java Code - java

There's a graphical tool to generate Oozi workflows via an eclipse plugin. Find it here Eclipse marketplace: https://marketplace.eclipse.org/content/oozie-eclipse-plugin It looks like this:

have a static oozie workflow in your HDFS which just takes 2 parameters and writes the content of parameter1(say content which user enters) to parameter2(say writing to HDFS). Now call the oozie CLI and specify app.path as the location created by workflow1

Related

Create AEM packages via code

Interacting with a large java program as a service?

Pentaho SDK, how to define a text file input

How to run a .m (matlab) file through java and matlab control?

Running Talend Job from within Java application

Categories

Resources