How can I do the following?
What I want to do is load Stanford NLP ONCE, then interact with it via an HTTP or other endpoint. The reason is that it takes a long time to load, and loading for every string to analyze is out of the question.
For example, here is Stanford NLP loading in a simple C# program that loads the jars... I'm looking to do what I did below, but in java:
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [9.3 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.all.3class.distsim.crf.ser.gz ... done [12.8 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.muc.7class.distsim.crf.ser.gz ... done [5.9 sec].
Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.conll.4class.distsim.crf.ser.gz ... done [4.1 sec].
done [8.8 sec].
Sentence #1 ...
This is over 30 seconds. If these all have to load each time, yikes. To show what I want to do in java, I wrote a working example in C#, and this complete example may help someone some day:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using java.io;
using java.util;
using edu.stanford.nlp;
using edu.stanford.nlp.pipeline;
using Console = System.Console;
namespace NLPConsoleApplication
{
class Program
{
static void Main(string[] args)
{
// Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar`
var jarRoot = #"..\..\..\..\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models";
// Text for intial run processing
var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
// Annotation pipeline configuration
var props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
props.setProperty("ner.useSUTime", "0");
// We should change current directory, so StanfordCoreNLP could find all the model files automatically
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(jarRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
// loop
while (text != "quit")
{
// Annotation
var annotation = new Annotation(text);
pipeline.annotate(annotation);
// Result - Pretty Print
using (var stream = new ByteArrayOutputStream())
{
pipeline.prettyPrint(annotation, new PrintWriter(stream));
Console.WriteLine(stream.toString());
stream.close();
}
edu.stanford.nlp.trees.TreePrint tprint = new edu.stanford.nlp.trees.TreePrint("words");
Console.WriteLine();
Console.WriteLine("Enter a sentence to evaluate, and hit ENTER (enter \"quit\" to quit)");
text = Console.ReadLine();
} // end while
}
}
}
So it takes the 30 seconds to load, but each time you give it a string on the console, it takes the smallest fraction of a second to parse & tokenize that string.
You can see that I loaded the jar files prior to the while loop.
This may end up being a socket service, HTML, or something else that will entertain requests (in the form of strings), and spit back the parsing.
My ultimate goal is to use a mechanism in Nifi, via a processor that can send strings to be parsed, and have them returned in less than a second, versus 30+ seconds if a traditional web server threaded example (for instance) is used. Every request would load the whole thing for 30 seconds, THEN get down to business. I hope I made this clear!
How to do this?
Any of the mechanisms you list are perfectly reasonable routes forward for leveraging that service with Apache NiFi. Depending on your needs, some of the processors and extensions that are bundled with the standard release of NiFi may be sufficient to interact with your proposed web service or similar offering.
If you are striving for performing all of this within NiFi itself, a custom Controller Service might be a great path to provide this resource to NiFi that falls within the lifecycle of the application itself.
NiFi can be extended with items like controller services and custom processors and we have some documentation to get you started down that path.
Additional details could certainly help to provide some more information. Feel free to follow up on here with additional comments and/or reach out to the community via our mailing lists.
One item I did want to call out if it was unclear that NiFi is JVM driven and work would be done in Java or JVM friendly languages.
You should look at the new CoreNLP Server which Stanford NLP introduced in version 3.6.0. It seems like it does just what you want? Some other people such as ETS have done similar things.
Fine point: If using this heavily, you might (at present) want to grab the latest CoreNLP code from github HEAD, since it contains a few fixes to the server which will be in the next release.
Related
First of all, I want to clarify that my experience working with wikidata is very limited, so feel free to correct if any of my terminology is wrong.
I've been playing with wikidata toolkit, more specifically their wdtk-wikibaseapi. This allows you to get entity information and their different properties as such:
WikibaseDataFetcher wbdf = WikibaseDataFetcher.getWikidataDataFetcher();
EntityDocument q42 = wbdf.getEntityDocument("Q42");
List<StatementGroup> groups = ((ItemDocument) q42).getStatementGroups();
for(StatementGroup g : groups) {
List<Statement> statements = g.getStatements();
for(Statement s : statements) {
System.out.println(s.getMainSnak().getPropertyId().getId());
System.out.println(s.getValue());
}
}
The above would get me the entity Douglas Adams and all the properties under his site: https://www.wikidata.org/wiki/Q42
Now wikidata toolkit has the ability to load and process dump files, meaning you can download a dump to your local and process it using their DumpProcessingController class under the wdtk-dumpfiles library. I'm just not sure what is meant by processing.
Can anyone explain me what does processing mean in this context?
Can you do something similar to what was done using wdtk-wikibaseapi in the example above but using a local dump file and wdtk-dumpfiles i.e. get an entity and it's respective properties? I don't want to get the info from online source, only from the dump (offline).
If this is not possible using wikidata-toolkit, could you point me to somewhere that can get me started on getting entities and their properties from a dump file for wikidata please? I am using Java.
I am just writing an Interface between a java application and an AS400.
For this purpose I use jt400. I managed to get information about the systemstatus like CPU usage, as well I managed to receive the current status about subsystems and jobs.
Now I am searching for an option to have a look at the different job queues inside the AS400.
For example: I would like to know, how many jobs are in which queue.
Is there a solution via jt400 or a different approach to access those information via java?
The corresponding command inside AS400 is WRKJOBQ
Best
LStrike
[Edit]
The following code is my filter for JobList. But how do I configure QSYSObjectPathName that it is matching WRKJOBQ?
QSYSObjectPathName path = new QSYSObjectPathName(.....);
JobList jList = new JobList(as400);
jList.addJobSelectionCriteria(JobList.SELECTION_PRIMARY_JOB_STATUS_JOBQ, true);
jList.addJobSelectionCriteria(JobList.SELECTION_JOB_QUEUE, path.getPath());
Job[] jobs = jList.getJobs(-1, 1);
System.out.println("Jobs Size: " + jobs.length);
You can use a JobList object for that, using SELECTION_JOB_QUEUE to filter jobs.
Once your selection suits your need, JobList#getLength() will give you the number of jobs.
See also this question
I'm trying to define a Pentaho Kettle (ktr) transformation via code. I would like to add to the transformation a Text File Input Step: http://wiki.pentaho.com/display/EAI/Text+File+Input.
I don't know how to do this (note that I want to achieve the result in a custom Java application, not using the standard Spoon GUI). I think I should use the TextFileInputMeta class, but when I try to define the filename the trasformation doesn't work anymore (it seems empty in Spoon).
This is the code I'm using. I think the third line has something wrong:
PluginRegistry registry = PluginRegistry.getInstance();
TextFileInputMeta fileInMeta = new TextFileInputMeta();
fileInMeta.setFileName(new String[] {myFileName});
String fileInPluginId = registry.getPluginId(StepPluginType.class, fileInMeta);
StepMeta fileInStepMeta = new StepMeta(fileInPluginId, myStepName, fileInMeta);
fileInStepMeta.setDraw(true);
fileInStepMeta.setLocation(100, 200);
transAWMMeta.addStep(fileInStepMeta);
To run a transformation programmatically, you should do the following:
Initialise Kettle
Prepare a TransMeta object
Prepare your steps
Don't forget about Meta and Data objects!
Add them to TransMeta
Create Trans and run it
By default, each transformation germinates a thread per step, so use trans.waitUntilFinished() to force your thread to wait until execution completes
Pick execution's results if necessary
Use this test as example: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/trans/steps/textfileinput/TextFileInputTests.java
Also, I would recommend you create the transformation manually and to load it from file, if it is acceptable for your circumstances. This will help to avoid lots of boilerplate code. It is quite easy to run transformations in this case, see an example here: https://github.com/pentaho/pentaho-kettle/blob/master/test/org/pentaho/di/TestUtilities.java#L346
I'm currently working on a project to forward (and later transcode) a RTP-Stream from a IP-Webcam to a SIP-User in a videocall.
I came up with the following gstreamer pipeline:
gst-launch -v rtspsrc location="rtsp://user:pw#ip:554/axis-media/media.amp?videocodec=h264" ! rtph264depay ! rtph264pay ! udpsink sync=false host=xxx.xxx.xx.xx port=xxxx
It works very fine. Now I want to create this pipeline using java. This is my code for creating the pipe:
Pipeline pipe = new Pipeline("IPCamStream");
// Source
Element source = ElementFactory.make("rtspsrc", "source");
source.set("location", ipcam);
//Elements
Element rtpdepay = ElementFactory.make("rtph264depay", "rtpdepay");
Element rtppay = ElementFactory.make("rtph264pay", "rtppay");
//Sink
Element udpsink = ElementFactory.make("udpsink", "udpsink");
udpsink.set("sync", "false");
udpsink.set("host", sinkurl);
udpsink.set("port", sinkport);
//Connect
pipe.addMany(source, rtpdepay, rtppay, udpsink);
Element.linkMany(source, rtpdepay, rtppay, udpsink);
return pipe;
When I start/set up the pipeline, I'm able to see the input of the camera using wireshark, but unfortunately there is no sending to the UDP-Sink. I have checked the code for mistakes a couple of times, I even set a pipeline for streaming from a file (filesrc) to the same udpsink, and it also works fine.
But why is the "forwarding" of the IP-Cam to the UDP-Sink not working with this Java-Pipeline?
I haven't used the Java version of GStreamer, but something you need to be aware of when linking is that sometimes the source pad of an element is not immediately available.
If you do gst-inspect rtspsrc, and look at the pads, you'll see this:
Pad Templates:
SRC template: 'stream_%u'
Availability: Sometimes
Capabilities:
application/x-rtp
application/x-rdt
That "Availability: Sometimes" means your initial link will fail. The source pad you want will only appear after some number of RTP packets have arrived.
For this case you need to either link the elements manually by waiting for the pad-added event, or what I like to do in C is use the gst_parse_bin_from_description function. There is probably something similar in Java. It automatically adds listeners for the pad-added events and links up the pipeline.
gst-launch uses these same parse_bin functions, I believe. That's why it always links things up fine.
I am developing a web app using Spring MVC. Simply put, a user uploads a file which can be of different types (.csv, .xls, .txt, .xml) and the application parses this file and extracts data for further processing. The problem is that I format of the file can change frequently. So there must be some way for quick and easy customization. Being a bit familiar with Talend, I decided to give it a shot and use it as ETL tool for my app. This short tutorial shows how to run Talend job from within Java app - http://www.talendforge.org/forum/viewtopic.php?id=2901
However, jobs created using Talend can read from/write to physical files, directories or databases. Is it possible to modify Talend job so that it can be given some Java object as a parameter and then return Java object just as usual Java methods?
For example something like:
String[] param = new String[]{"John Doe"};
String talendJobOutput = teaPot.myjob_0_1.myJob.main(param);
where teaPot.myjob_0_1.myJob is the talend job integrated into my app
I did something similar I guess. I created a mapping in tallend using tMap and exported this as talend job (java se programm). If you include the libraries of that job, you can run the talend job as described by others.
To pass arbitrary java objects you can use the following methods which are present in every talend job:
public Object getValueObject() {
return this.valueObject;
}
public void setValueObject(Object valueObject) {
this.valueObject = valueObject;
}
In your job you have to cast this object. e.g. you can put in a List of HashMaps and use Java reflection to populate rows. Use tJavaFlex or a custom component for that.
Using this method I can adjust the mapping of my data visually in Talend, but still use the generated code as library in my java application.
Now I better understand your willing, I think this is NOT possible because Talend's architecture is made like a standalone app, with a "main" entry point merely as does the Java main() method :
public String[][] runJob(String[] args) {
int exitCode = runJobInTOS(args);
String[][] bufferValue = new String[][] { { Integer.toString(exitCode) } };
return bufferValue;
}
That is to say : the Talend execution entry point only accepts a String array as input and doesn't returns anything as output (except as a system return code).
So, you won't be able link to Talend (generated) code as a library but as an isolated tool that you can only parameterize (using context vars, see my other response) before launching.
You can see that in Talend help center or forum the only integration described is as an "external" job execution ... :
Talend knowledge base "Calling a Talend Job from an external Java application" article
Talend Community Forum "Java Object to Talend" topic
May be you have to rethink the architecture of your application if you want to use Talend as the ETL tool for your purpose.
Now from Talend ETL point of view : if you want to parameter the execution environment of your Jobs (for exemple the physical directory of the uploaded files), you should use context variables that can be loaded at execution time from a configuration file as mentioned here :
https://help.talend.com/display/TalendOpenStudioforDataIntegrationUserGuide53EN/2.6.6+Context+settings