I am developing a web app using Spring MVC. Simply put, a user uploads a file which can be of different types (.csv, .xls, .txt, .xml) and the application parses this file and extracts data for further processing. The problem is that I format of the file can change frequently. So there must be some way for quick and easy customization. Being a bit familiar with Talend, I decided to give it a shot and use it as ETL tool for my app. This short tutorial shows how to run Talend job from within Java app - http://www.talendforge.org/forum/viewtopic.php?id=2901
However, jobs created using Talend can read from/write to physical files, directories or databases. Is it possible to modify Talend job so that it can be given some Java object as a parameter and then return Java object just as usual Java methods?
For example something like:
String[] param = new String[]{"John Doe"};
String talendJobOutput = teaPot.myjob_0_1.myJob.main(param);
where teaPot.myjob_0_1.myJob is the talend job integrated into my app
I did something similar I guess. I created a mapping in tallend using tMap and exported this as talend job (java se programm). If you include the libraries of that job, you can run the talend job as described by others.
To pass arbitrary java objects you can use the following methods which are present in every talend job:
public Object getValueObject() {
return this.valueObject;
}
public void setValueObject(Object valueObject) {
this.valueObject = valueObject;
}
In your job you have to cast this object. e.g. you can put in a List of HashMaps and use Java reflection to populate rows. Use tJavaFlex or a custom component for that.
Using this method I can adjust the mapping of my data visually in Talend, but still use the generated code as library in my java application.
Now I better understand your willing, I think this is NOT possible because Talend's architecture is made like a standalone app, with a "main" entry point merely as does the Java main() method :
public String[][] runJob(String[] args) {
int exitCode = runJobInTOS(args);
String[][] bufferValue = new String[][] { { Integer.toString(exitCode) } };
return bufferValue;
}
That is to say : the Talend execution entry point only accepts a String array as input and doesn't returns anything as output (except as a system return code).
So, you won't be able link to Talend (generated) code as a library but as an isolated tool that you can only parameterize (using context vars, see my other response) before launching.
You can see that in Talend help center or forum the only integration described is as an "external" job execution ... :
Talend knowledge base "Calling a Talend Job from an external Java application" article
Talend Community Forum "Java Object to Talend" topic
May be you have to rethink the architecture of your application if you want to use Talend as the ETL tool for your purpose.
Now from Talend ETL point of view : if you want to parameter the execution environment of your Jobs (for exemple the physical directory of the uploaded files), you should use context variables that can be loaded at execution time from a configuration file as mentioned here :
https://help.talend.com/display/TalendOpenStudioforDataIntegrationUserGuide53EN/2.6.6+Context+settings
Related
First of all, I want to clarify that my experience working with wikidata is very limited, so feel free to correct if any of my terminology is wrong.
I've been playing with wikidata toolkit, more specifically their wdtk-wikibaseapi. This allows you to get entity information and their different properties as such:
WikibaseDataFetcher wbdf = WikibaseDataFetcher.getWikidataDataFetcher();
EntityDocument q42 = wbdf.getEntityDocument("Q42");
List<StatementGroup> groups = ((ItemDocument) q42).getStatementGroups();
for(StatementGroup g : groups) {
List<Statement> statements = g.getStatements();
for(Statement s : statements) {
System.out.println(s.getMainSnak().getPropertyId().getId());
System.out.println(s.getValue());
}
}
The above would get me the entity Douglas Adams and all the properties under his site: https://www.wikidata.org/wiki/Q42
Now wikidata toolkit has the ability to load and process dump files, meaning you can download a dump to your local and process it using their DumpProcessingController class under the wdtk-dumpfiles library. I'm just not sure what is meant by processing.
Can anyone explain me what does processing mean in this context?
Can you do something similar to what was done using wdtk-wikibaseapi in the example above but using a local dump file and wdtk-dumpfiles i.e. get an entity and it's respective properties? I don't want to get the info from online source, only from the dump (offline).
If this is not possible using wikidata-toolkit, could you point me to somewhere that can get me started on getting entities and their properties from a dump file for wikidata please? I am using Java.
How can I do the following?
What I want to do is load Stanford NLP ONCE, then interact with it via an HTTP or other endpoint. The reason is that it takes a long time to load, and loading for every string to analyze is out of the question.
For example, here is Stanford NLP loading in a simple C# program that loads the jars... I'm looking to do what I did below, but in java:
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [9.3 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.all.3class.distsim.crf.ser.gz ... done [12.8 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.muc.7class.distsim.crf.ser.gz ... done [5.9 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.conll.4class.distsim.crf.ser.gz ... done [4.1 sec].
done [8.8 sec].
Sentence #1 ...
This is over 30 seconds. If these all have to load each time, yikes. To show what I want to do in java, I wrote a working example in C#, and this complete example may help someone some day:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using java.io;
using java.util;
using edu.stanford.nlp;
using edu.stanford.nlp.pipeline;
using Console = System.Console;
namespace NLPConsoleApplication
{
class Program
{
static void Main(string[] args)
{
// Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar`
var jarRoot = #"..\..\..\..\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models";
// Text for intial run processing
var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
// Annotation pipeline configuration
var props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
props.setProperty("ner.useSUTime", "0");
// We should change current directory, so StanfordCoreNLP could find all the model files automatically
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(jarRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
// loop
while (text != "quit")
{
// Annotation
var annotation = new Annotation(text);
pipeline.annotate(annotation);
// Result - Pretty Print
using (var stream = new ByteArrayOutputStream())
{
pipeline.prettyPrint(annotation, new PrintWriter(stream));
Console.WriteLine(stream.toString());
stream.close();
}
edu.stanford.nlp.trees.TreePrint tprint = new edu.stanford.nlp.trees.TreePrint("words");
Console.WriteLine();
Console.WriteLine("Enter a sentence to evaluate, and hit ENTER (enter \"quit\" to quit)");
text = Console.ReadLine();
} // end while
}
}
}
So it takes the 30 seconds to load, but each time you give it a string on the console, it takes the smallest fraction of a second to parse & tokenize that string.
You can see that I loaded the jar files prior to the while loop.
This may end up being a socket service, HTML, or something else that will entertain requests (in the form of strings), and spit back the parsing.
My ultimate goal is to use a mechanism in Nifi, via a processor that can send strings to be parsed, and have them returned in less than a second, versus 30+ seconds if a traditional web server threaded example (for instance) is used. Every request would load the whole thing for 30 seconds, THEN get down to business. I hope I made this clear!
How to do this?
Any of the mechanisms you list are perfectly reasonable routes forward for leveraging that service with Apache NiFi. Depending on your needs, some of the processors and extensions that are bundled with the standard release of NiFi may be sufficient to interact with your proposed web service or similar offering.
If you are striving for performing all of this within NiFi itself, a custom Controller Service might be a great path to provide this resource to NiFi that falls within the lifecycle of the application itself.
NiFi can be extended with items like controller services and custom processors and we have some documentation to get you started down that path.
Additional details could certainly help to provide some more information. Feel free to follow up on here with additional comments and/or reach out to the community via our mailing lists.
One item I did want to call out if it was unclear that NiFi is JVM driven and work would be done in Java or JVM friendly languages.
You should look at the new CoreNLP Server which Stanford NLP introduced in version 3.6.0. It seems like it does just what you want? Some other people such as ETS have done similar things.
Fine point: If using this heavily, you might (at present) want to grab the latest CoreNLP code from github HEAD, since it contains a few fixes to the server which will be in the next release.
I am trying to access the list of all jobs/projects in jenkins and their project files in java not groovy and parsing XML files.
I suggest you to use other ways to do this rather then use Java. Consider to use Ruby or Python API wrappers, Groovy, CLI API, Script Console etc. Refer also to Remote Access API for more information.
But if you still need Java, well, there is no Java API, but there is Rest API. And you may use some Java's http client to communicate, for example. Here are required steps:
1. Get a list of jobs.
This can be done requesting http://jenkins_url:port/api/json?tree=jobs[name,url].
Response example:
{
"jobs" : [
{
"name" : "JOB_NAME1",
"url" : "http://jenkins_url:port/job/JOB_NAME1/"
},
{
"name" : "JOB_NAME2",
"url" : "http://jenkins_url:port/job/JOB_NAME2/"
},
...
}
From there you can retrieve job names and urls.
2. Get build artifacts.
Having job url, download from job_url/lastSuccessfulBuild/artifact/*zip*/archive.zip
3. Or get workspace files.
Having job url, download from job_url/JOB_NAME1/ws/*zip*/workspace.zip
Beware, some of this operations require proper Jenkins credentials, anonymous access. Otherwise, request will fail.
More detailed information about Rest API available at your Jenkins: http://jenkins_url:port/api/
Like #Vitalii said its better to do in groovy or some other scripting languages or to parse the api/xml file to get the workspace job list.
For your case you can get by making your class extending the Trigger and using the job object of the class trigger.
Note: include all the other default classes the jenkins plugin requires and make sure that the plugin runs every minute for this code to execute properly.
public class xyz extends Trigger<BuildableItem>
{
#Override
public void run()
{
LOGGER.info("Project Name"+job.getName());
}
}
Looking through Oozie examples and documentation, it looks like you need a workflow file in order to run an oozie job from Java code. Is ther any way to submit a job directly fro Java code, without needing a workflow file? Is there any pre-existing way to dynamically generate these files through java code? Are there any pre-existing tools that will make generating them easier? Or will I have to write the entirety of the code to generate the file?
Current Situation
OozieClient wc = new OozieClient("http://bar:8080/oozie");
Properties conf = wc.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "workflow file path");
// set other properties
...
// submit and start the workflow job
wc.run(conf);
Ideal situation is something vaguely like this.
OozieAction action = new OozieAction("actionName");
action.setOkDestination("nextAction");
action.setErrorDestination("errorDestination");
//Rest of config for action
OozieWorkflow workflow = new Oozieworkflow();
workfow.setStartAction(action);
workflow.addAction(otherAction);
//rest of conf
OozieClient wc = new OozieClient("http://bar:8080/oozie");
wc.runWorkflow(workflow);
Another situation that would be desirable if the former is impossibble is:
OozieAction action = new OozieAction("actionName");
action.setOkDestination("nextAction");
action.setErrorDestination("errorDestination");
//Rest of config for action
OozieWorkflow workflow = new Oozieworkflow();
workfow.setStartAction(action);
workflow.addAction(otherAction);
//rest of conf
workflow.writeToFile("some localFile")
//load file to HDFS
//This would also work
// workflow.writeToHDFS("someHdfsLocation");
OozieClient wc = new OozieClient("http://bar:8080/oozie");
//run with created workflow
I have been in a similar situation.
What I would suggest is to use the oozie schema definition (xsd) and generating the java equivalent objects through xjc. Given these objects you can probably create the workflow (not trivial though)
There are scala based DSL's you can use
https://github.com/klout/scoozie does something similar with Scala->oozie generation
Oozie 5.1.0 added support for Fluent Job API which makes it possible to write java code instead of workflow XML files (under the hood, Oozie will generate the XML file for you).
Simple example for the java code which creates a workflow similar to the shell action demo of Oozie:
public class MyFirstWorkflowFactory implements WorkflowFactory {
#Override
public Workflow create() {
final ShellAction shellAction = ShellActionBuilder.create()
.withName("shell-action")
.withResourceManager("${resourceManager}")
.withNameNode("${nameNode}")
.withConfigProperty("mapred.job.queue.name", "${queueName}")
.withExecutable("echo")
.withArgument("my_output=Hello Oozie")
.withCaptureOutput(true)
.build();
final Workflow shellWorkflow = new WorkflowBuilder()
.withName("shell-workflow")
.withDagContainingNode(shellAction).build();
return shellWorkflow;
}
}
More detailed documentation can be found here: https://oozie.apache.org/docs/5.1.0/DG_FluentJobAPI.html
There's a graphical tool to generate Oozi workflows via an eclipse plugin. Find it here Eclipse marketplace: https://marketplace.eclipse.org/content/oozie-eclipse-plugin
It looks like this:
have a static oozie workflow in your HDFS which just takes 2 parameters and writes the content of parameter1(say content which user enters) to parameter2(say writing to HDFS). Now call the oozie CLI and specify app.path as the location created by workflow1
I have 2 .m files. One is the function and the other one (read.m) reads then function and exports the results into an excel file. I have a java program that makes some changes to the .m files. After the changes I want to automate the execution/running of the .m files. I have downloaded the matlabcontrol.jar and I am looking for a way to use it to invoke and run the read.m file that then reads the function.
Can anyone help me with the code? Thanks
I have tried this code but it does not work.
public static void tomatlab() throws MatlabConnectionException, MatlabInvocationException {
MatlabProxyFactoryOptions options =
new MatlabProxyFactoryOptions.Builder()
.setUsePreviouslyControlledSession(true)
.build();
MatlabProxyFactory factory = new MatlabProxyFactory(options);
MatlabProxy proxy = factory.getProxy();
proxy.eval("addpath('C:\\path_to_read.m')");
proxy.feval("read");
proxy.eval("rmpath('C:\\path_to_read.m')");
// close connection
proxy.disconnect();
}
Based on the official tutorial in the Wiki of the project, it seems quite straightforward to start with this API.
The path-manipulation might be a bit tricky, but I would give a try to loading the whole script into a string and passing it to eval (please note I have no prior experience with this specific Matlab library). That could be done quite easily (with joining Files.readAllLines() for example).
Hope that helps something.