Cascading tutorial word count example error - java

I am learning Cascading now. Now I am looking the second tutorial on its official website which is about Work Count example. I copy the code from it and try to run, it always gives me the following errors:
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [[token][com.starscriber.cascadingtest.Main.main(Main.java:44)]
unable to resolve argument selector: [{1}:'text'], with incoming: [{1}:'doc01 A rain shadow is a dry area on the lee back side of a mountainous area.']] at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:263)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
at com.starscriber.cascadingtest.Main.main(Main.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: cascading.pipe.OperatorException: [token][com.starscriber.cascadingtest.Main.main(Main.java:44)]
unable to resolve argument selector: [{1}:'text'], with incoming: [{1}:'doc01 A rain shadow is a dry area on the lee back side of a mountainous area.']
at cascading.pipe.Operator.resolveArgumentSelector(Operator.java:345)
at cascading.pipe.Each.outgoingScopeFor(Each.java:368)
at cascading.flow.planner.ElementGraph.resolveFields(ElementGraph.java:628)
at cascading.flow.planner.ElementGraph.resolveFields(ElementGraph.java:610)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:248)
... 8 more
Caused by: cascading.tuple.FieldsResolverException:
could not select fields: [{1}:'text'], from: [{1}:'doc01 A rain shadow is a dry area on the lee back side of a mountainous area.']
at cascading.tuple.Fields.indexOf(Fields.java:1008)
at cascading.tuple.Fields.select(Fields.java:1064)
at cascading.pipe.Operator.resolveArgumentSelector(Operator.java:341)
... 12 more
How come?? I copy the exactly same code which is from its official Github and don't change anything...
String docPath = args[0];
String wcPath = args[1];
Properties properties = new Properties();
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);
// create source and sink taps
Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath);
Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);
// specify a regex operation to split the "document" text lines into a token stream
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
// determine the word counts
Pipe wcPipe = new Pipe("wc", docPipe);
wcPipe = new GroupBy(wcPipe, token);
wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName("wc")
.addSource(docPipe, docTap)
.addTailSink(wcPipe, wcTap);
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect(flowDef);
wcFlow.writeDOT("dot/wc.dot");
wcFlow.complete();
Where is the problem??
And this is the input file:
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]

Once check if there is tab between the two fields docId and text in the input file. Program is expecting two fields with tab separated, but in your case it is reading whole line into one field.

As other people have already mentioned you need to have the same headers the example is expecting. Instead of copying the code, try to clone the repository so that you won't have any error related to file formatting

Related

How do I iterate over an entire document in OpenOffice/LibreOffice with UNO

I am writing java code to access a document open in Libre Office.
I now need to write some code which iterate over the entire document, hopefully in the same order it is shown in the editor.
I can use this code to iterate over all the normal text:
XComponent writerComponent=xComponentLoader.loadComponentFromURL(loadUrl, "_blank", 0, loadProps);
XTextDocument mxDoc=UnoRuntime.queryInterface(XTextDocument.class, writerComponent);
XText mxDocText=mxDoc.getText();
XEnumerationAccess xParaAccess = (XEnumerationAccess) UnoRuntime.queryInterface(XEnumerationAccess.class, mxDocText);
XEnumeration xParaEnum = xParaAccess.createEnumeration();
Object element = xParaEnum.nextElement();
while (xParaEnum.hasMoreElements()) {
XEnumerationAccess inlineAccess = (XEnumerationAccess) UnoRuntime.queryInterface(XEnumerationAccess.class, element);
XEnumeration inline = inlineAccess.createEnumeration();
// And I can then iterate over this inline element and get all the text and formatting.
}
But the problem is that this does not include any chart objects.
I can then use
XDrawPagesSupplier drawSupplier=UnoRuntime.queryInterface(XDrawPagesSupplier.class, writerComponent);
XDrawPages pages=drawSupplier.getDrawPages();
XDrawPage drawPage=UnoRuntime.queryInterface(XDrawPage.class,page);
for(int j=0;j!=drawPage.getCount();j++) {
Object sub=drawPage.getByIndex(j);
XShape subShape=UnoRuntime.queryInterface(XShape.class,sub);
// Now I got my subShape, but how do I know its position, relative to the text.
}
And this gives me all charts (And other figures I guess), but the problem is: How do I find out where these charts are positioned in relation to the text in the model. And how do I get a cursor which represent each chart?
Update:
I am now looking for an anchor for my XShape, but XShape don't have a getAnchor() method.
But If I use
XPropertySet prop=UnoRuntime.queryInterface(XPropertySet.class,shape);
I get the prop class.
And I call prop.getPropertyValue("AnchorType") which gives me an ancher type of TextContentAnchorType.AS_CHARACTER
but I just can't get the anchor itself. There are no anchor or textrange property.
btw: I tried looking into installing "MRI" for libre office, but the only version I could find hav libreoffice 3.3 as supported version, and it would not install on version 7.1
----- Update 2 -----
I managed to make it work. It turns out that my XShape also implements XTextContent (Thank you MRI), so all I had to do was:
XTextContent subContent=UnoRuntime.queryInterface(XTextContent.class,subShape);
XTextRange anchor=subContent.getAnchor();
XTextCursor cursor = anchor.getText().createTextCursorByRange(anchor.getStart());
cursor.goRight((short)50,true);
System.out.println("String=" + cursor.getString());
This gives me a cursor which point to the paragraph, which I can then move forward/backward to find out where the shape is. So this println call will print the 50 chars following the XShape.
How do I find out where these charts are positioned in relation to the text in the model. And how do I get a cursor which represent each chart?
Abridged comments
Anchors pin objects to a specific location. Does the shape have a method getAnchor() or property AnchorType? I would use an introspection tool such as MRI to determine this. Download MRI 1.3.4 from https://github.com/hanya/MRI/releases.
As far as a cursor, maybe it is similar to tables:
oText = oTable.getAnchor().getText()
oCurs = oText.createTextCursor()
Code solution given by OP
XTextContent subContent=UnoRuntime.queryInterface(XTextContent.class,subShape);
XTextRange anchor=subContent.getAnchor();
XTextCursor cursor = anchor.getText().createTextCursorByRange(anchor.getStart());
cursor.goRight((short)50,true);
System.out.println("String=" + cursor.getString());

Getting Stanford NLP to recognise named entities with multiple words

First off let me say that I am a complete newbie with NLP. Although, as you read on, that is probably going to become strikingly apparent.
I'm parsing Wikipedia pages to find all mentions of the page title. I do this by going through the CorefChainAnnotations to find "proper" mentions - I then assume that the most common ones are talking about the page title. I do it by running this:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String content = "Abraham Lincoln was an American politician and lawyer who served as the 16th President of the United States from March 1861 until his assassination in April 1865. Lincoln led the United States through its Civil War—its bloodiest war and perhaps its greatest moral, constitutional, and political crisis.";
Annotation document = new Annotation(content);
pipeline.annotate(document);
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentions) {
if (cm.mentionType == Dictionaries.MentionType.PROPER) {
log("Proper ref using " + cm.mentionSpan + ", " + cm.mentionType);
}
}
}
This returns:
Proper ref using the United States
Proper ref using the United States
Proper ref using Abraham Lincoln
Proper ref using Lincoln
I know already that "Abraham Lincoln" is definitely what I am looking for and I can surmise that because "Lincoln" appears a lot as well then that must be another way of talking about the main subject. (I realise right now the most common named entity is "the United States", but once I've fed it the whole page it works fine).
This works great until I have a page like "Gone with the Wind". If I change my code to use that:
String content = "Gone with the Wind has been criticized as historical revisionism glorifying slavery, but nevertheless, it has been credited for triggering changes to the way African-Americans are depicted cinematically.";
then I get no Proper mentions back at all. I suspect this is because none of the words in the title are recognised as named entities.
Is there any way I can get Stanford NLP to recognise "Gone with the Wind" as an already-known named entity? From looking around on the internet it seems to involve training a model, but I want this to be a known named entitity just for this single run and I don't want the model to remember this training later.
I can just imagine NLP experts rolling their eyes at the awfulness of this approach, but it gets better! I came up with the great idea of changing any occurences of the page title to "Thingamijig" before passing the text to Stanford NLP, which works great for "Gone with the Wind" but then fails for "Abraham Lincoln" because (I think) the NER longer associates "Lincoln" with "Thingamijig" in the corefMentions.
In my dream world I would do something like:
pipeline.addKnownNamedEntity("Gone with the Wind");
But that doesn't seem to be something I can do and I'm not exactly sure how to go about it.
You can submit a dictionary with any phrases you want and have them recognized as named entities.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping additional.rules -file example.txt -outputFormat text
additional.rules
Gone With The Wind MOVIE MISC 1
Note that the columns above should be tab-delimited. You can have as many lines as you'd like in the additional.rules file.
One warning, EVERY TIME that token pattern occurs it will be tagged.
More details here: https://stanfordnlp.github.io/CoreNLP/ner.html

CNN for Sentiment Analysis using TFLearn model for Android to classify user input

I have a CNN model for text classification which uses a pre-trained embedding of the glove. I have frozen that graph optimized for inference and using it on the android studio. The problem is when I try to pass the weights into the model for inference. I have a JSON file with the key-value pairs between the words and the embedding which I use to create an input of embeddings from the text that the user types in. I can already get the embeddings from the JSON file but when I try to feed it into the graph for inference, it gives me the following error:
java.lang.IllegalArgumentException: indices[0,3891] = -2 is not in [0,
7459)
[[Node: EmbeddingLayer/embedding_lookup = Gather[Tindices=DT_INT32,
Tparams=DT_FLOAT, _class=["loc:#EmbeddingLayer/W"],
validate_indices=false,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]
(EmbeddingLayer/W/read, EmbeddingLayer/Cast)]]
The Android code is in my GitHub
https://github.com/sushiboo/testNN1
The main code that gives me problem is the Classify method:
private void classify(float[] input){
TFInference = new TensorFlowInferenceInterface(getAssets(), MODEL_FILE);
TFInference.feed(INPUT_NODE, input, 1, input.length);
TFInference.run(OUTPUT_NODES);
float[] resu = new float[2];
TFInference.fetch(OUTPUT_NODE, resu);
tvResult.setText("Programmer: " + Float.toString(resu[0]) + "\n Construction" + Float.toString(resu[1]));
Log.e("Result: ", Float.toString(resu[0]));
}
The problem is in the
TFInference.run(OUTPUT_NODES);
On the Error message, the number '7459' represents the input dimension of the embedding layer.
I am really confused as to what is happening here but I know that the indices[0,3891] = -2 plays some part in this.
The problem was with the model guys. I have fixed this one and now I am stuck on another error.

Stanford Core NLP: Entity type non deterministic

I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.
class StanfordNLPService {
//private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
private StanfordCoreNLP nerPipeline;
/*
Initialize the nlp instances for ner and sentiments.
*/
public void init() {
Properties nerAnnotators = new Properties();
nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
nerPipeline = new StanfordCoreNLP(nerAnnotators);
}
/**
* #param text Text from entities to be extracted.
*/
public void printEntities(String text) {
// boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
try {
// Properties nerAnnotators = new Properties();
// nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
// nerPipeline = new StanfordCoreNLP(nerAnnotators);
Annotation document = nerPipeline.process(text);
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// Get the entity type and offset information needed.
String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); // Ner type
int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class); // token offset_start
int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class); // token offset_end.
String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // POS type
System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
}
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC: Appropriate 100
(Type:value:offset) MISC: Time 112
Iteration 2:
(Type:value:offset) O: Appropriate 100
(Type:value:offset) O: Time 112
Here is the answer from the NER FAQ:
http://nlp.stanford.edu/software/crf-faq.shtml
Is the NER deterministic? Why do the results change for the same data?
Yes, the underlying CRF is deterministic. If you apply the NER to the same sentence more than once, though, it is possible to get different answers the second time. The reason for this is the NER remembers whether it has seen a word in lowercase form before.
The exact way this is used as a feature is in the word shape feature, which treats words such as "Brown" differently if it has or has not seen "brown" as a lowercase word before. If it has, the word shape will be "Initial upper, have seen all lowercase", and if it has not, the word shape will be "Initial upper, have not seen all lowercase".
This feature can be turned off in recent versions with the flag -useKnownLCWords false
I've looked over the code some, and here is a possible way to resolve this:
What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.
Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:
java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz
Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.
This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz
Then in your original code:
nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");
Please let me know if this works or if it's not working, and I can look more deeply into this!
After doing some research, I found the issue is in ClassifierCombiner.classify() method. One of the baseClassifiers edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz loaded by default is returning different type on some occasion. I am trying to load only the first model to resolve this issue.
The problem is the following area of the code
CRFClassifier.classifyMaxEnt()
int[] bestSequence = tagInference.bestSequence(model); Line 1249
ExactBestSequenceFinder.bestSequence() is returning different sequence for for the above model for the same input when called multiple times.
Not sure if this needs code fix or some configuration changes to the model. Any additional insight is appreciated.

How Cascading TextDelimited the log file

I am following the guide of Cascading on its website. I have the following TSV format input:
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
I use the following code to process it:
Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
It looks like just split the second part of each line (ignore doc_id part). How does Cascading ignore the first doc_id part and just process the second part? is that because of TextDelimited ??
If you see the pipe statement
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
The second argument is the only field you are sending to splitter function. Here you are sending 'text' field. SO only the text is sent to splitter and returns the tokens.
Below explains the Each method clearly.
Each
#ConstructorProperties(value={"name","argumentSelector","function","outputSelector"})
public Each(String name,
Fields argumentSelector,
Function function,
Fields outputSelector)
Only pass argumentFields to the given function, only return fields selected by the outputSelector.
Parameters:
name - name for this branch of Pipes
argumentSelector - field selector that selects Function arguments from the input Tuple
function - Function to be applied to each input Tuple
outputSelector - field selector that selects the output Tuple from the input and Function results Tuples
The answer is in these 2 lines
1. The way Tap was created, program was told that first line contains header ("true").
Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
2. And second, in this line the column name was provided as "text". If you look closely in your input file, "text" is the column name for the data you are trying to base your word count on.
Fields text = new Fields( "text" );

Categories

Resources