For a project I am currently working on, I need to annotate sentences with FrameNet annotations. This is achieved well by the SEMAFOR semantic parser (https://github.com/Noahs-ARK/semafor). I installed and configured this tool as described on the git repository. However, if I run the runSemafor.sh script with the cygwin terminal, it throws and IllegalArgumentException indicating that the generated pos.tagged file cannot be parsed.
Here is the complete console output in cygwin (running it on windows):
$ ./runSemafor.sh D:/XFrame/Libs/Semafor/semafor/temp/sample.txt D:/XFrame/Libs/Semafor/semafor/temp/output 2
**********************************************************************
Tokenizing file: D:/XFrame/Libs/Semafor/semafor/temp/neu.txt
real 0m0.140s
user 0m0.015s
sys 0m0.108s
Finished tokenization.
**********************************************************************
**********************************************************************
Part-of-speech tagging tokenized data....
/cygdrive/d/XFrame/Libs/Semafor/semafor/scripts/jmx/cygdrive/d/XFrame/Libs/Semafor/semafor/bin
Read 11692 items from tagger.project/word.voc
Read 45 items from tagger.project/tag.voc
Read 42680 items from tagger.project/tagfeatures.contexts
Read 42680 contexts, 117558 numFeatures from tagger.project/tagfeatures.fmap
Read model tagger.project/model : numPredictions=45, numParams=117558
Read tagdict from tagger.project/tagdict
*This is MXPOST (Version 1.0)*
*Copyright (c) 1997 Adwait Ratnaparkhi*
Sentence: 0 Length: 9 Elapsed Time: 0.007 seconds.
real 0m0.762s
user 0m0.046s
sys 0m0.171s
/cygdrive/d/XFrame/Libs/Semafor/semafor/bin
Finished part-of-speech tagging tokenized data.
**********************************************************************
**********************************************************************
Converting postagged input to conll.
Exception in thread "main" java.lang.IllegalArgumentException:
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec.decode(Sentence Codec.java:83)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec$SentenceIterato r.computeNext(SentenceCodec.java:115)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec$SentenceIterato r.computeNext(SentenceCodec.java:100)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractI terator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.j ava:138)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.ConvertFormat.convertStream(C onvertFormat.java:94)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.ConvertFormat.main(ConvertFor mat.java:76)
Caused by: java.lang.IllegalArgumentException: PosToken must have 2 "_"-separate d fields
at com.google.common.base.Preconditions.checkArgument(Preconditions.java :92)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.Token.fromPosTagged(Token.jav a:248)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec$2.decodeToken(S entenceCodec.java:28)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec.decode(Sentence Codec.java:79)
... 6 more
As a sample file for the annotation I use the sample file from the repository:
This is a test for SEMAFOR, a frame-semantic parser.
This is just a dummy line.
There's a Santa Claus!
The generated pos.tagged file however looks as if there is no error. Why does this exception occur?
This_DT is_VBZ a_DT test_NN for_IN SEMAFOR_NNP ,_, a_DT frame-semantic_JJ parser_NN ._.
This_DT is_VBZ just_RB a_DT dummy_JJ line_NN ._.
There_EX 's_VBZ a_DT Santa_NNP Claus_NNP !_.
I came across the exactly same issue as you stated, and solved it moments ago by myself. It is because that the parser only takes the right formatted input file with one sentence per line.
What you need to do is: when writing each sentence into the file, add the following lines in your codes, to remove line breakers or tabs. Then you should be good to go!
line = line.replace('\n', '')
line = line.replace('\t', '')
Related
I am testing to see if ANTLR-4.7.1 is working properly by using a sample, provided by my professor, to match these results for the same printed set of tokens:
% java -jar ./antlr-4.7.1-complete.jar HelloExample.g4
% javac -cp antlr-4.7.1-complete.jar HelloExample*.java
% java -cp .:antlr-4.7.1-complete.jar org.antlr.v4.gui.TestRig HelloExample greeting helloworld.greeting -tokens
[#0,0:4='Hello',<1>,1:0]
[#1,6:10='World',<3>,1:6]
[#2,12:12='!',<2>,1:12]
[#3,14:13='<EOF>',<-1>,2:0]
(greeting Hello World !)
However, after getting to the 3rd command, my output was instead:
[#0,0:4='Hello',<'Hello'>,1:0]
[#1,6:10='World',<Name>,1:6]
[#2,12:12='!',<'!'>,1:12]
[#3,13:12='<EOF>',<EOF>,1:13]
In my output, there are no numbers inside < >, which I believe should be defined from the HelloExample.tokens file that contain:
Hello=1
Bang=2
Name=3
WS=4
'Hello'=1
'!'=2
I get no error information and antlr seemed to have generated all the files I needed, so I don't know where I should be looking to resolve this, please help. And I'm not sure if it'll be of use, but my working directory started with helloworld.greeting and HelloExample.g4 and final directory now contains
helloworld.greeting
HelloExample.g4
HelloExample.interp
HelloExample.tokens
HelloExampleBaseListener.class
HelloExampleBaseListener.java
HelloExampleLexer.class
HelloExampleLexer.inerp
HelloExampleLexer.java
HelloExampleLexer.tokens
HelloExampleListener.class
HelloExampleListener.java
HelloExampleParser$GreetingContext.class
HelloExampleParser.class
HelloExampleParser.java
As rici already pointed out in the comments, getting the actual rule names instead of their numbers in the token output is a feature and shouldn't worry you.
In order to get the (greeting Hello World !) output at the end, you'll want to add the -tree flag after -tokens.
I am trying to understand the result generated via cTAKES parser. I am unable to understand certain points-
cTAKES parser is invoked via TIKa-app
we get following result-
ctakes:AnatomicalSiteMention: liver:77:82:C1278929,C0023884
ctakes:ProcedureMention: CT scan:24:31:C0040405,C0040405,C0040405,C0040405
ctakes:ProcedureMention: CT:24:26:C0009244,C0009244,C0040405,C0040405,C0009244,C0009244,C0040405,C0009244,C0009244,C0009244,C0040405
ctakes:ProcedureMention: scan:27:31:C0034606,C0034606,C0034606,C0034606,C0441633,C0034606,C0034606,C0034606,C0034606,C0034606,C0034606
ctakes:RomanNumeralAnnotation: did:47:50:
ctakes:SignSymptomMention: lesions:62:69:C0221198,C0221198
ctakes:schema: coveredText:start:end:ontologyConceptArr
resourceName: sample
and document parsed contains -
The patient underwent a CT scan in April which did not reveal lesions in his liver
i have following questions-
why UMLS id is repeated like in ctakes:ProcedureMention: scan:27:31:C0009244,C0009244,C0040405,C0040405,C0009244,C0009244,C0040405,C0009244,C0009244,C0009244,C0040405? (cTAKES configuration properties file has annotationProps=BEGIN,END,ONTOLOGY_CONCEPT_ARR)
what does RomanNumeralAnnotation indicate?
In concept unique identifier like C0040405, do these 7 numbers have any meaning. How are these generated?
System information:
Apache tika 1.10
Apache ctakes 3.2.2
I have 2 JAR files coming from an SDK that I have to use.
Generation problem
I succeeded in generating the first .pas file, but Java2OP fails to generate the second .pas I need, with the message
Access violation at address 0042AF4A dans the 'Java2OP.exe' module. Read at address 09D00000
Would this come from a common issue? There's no other hints about what causes the problem in the SDK .jar.
I'm using Java2OP located in C:\Program Files (x86)\Embarcadero\Studio\18.0\bin\converters\java2op, but I had to first use the solutions in Delphi 10.1 Berlin - Java2OP: class or interface expected before generating the 1st file.
Compilation problem
Anyway, I tried to generate a .hpp file from the generated .pas.
I don't know much about Delphi. Does the problem come from the SDK itself, or the generation of the .pas file?
1st issue solved
Java2OP included Androidapi.JNI.Java.Util
and not Androidapi.JNI.JavaUtil. I had to import Androidapi.JNI.JavaUtil myself, though it is present in the /Program Files(x86)/Embarcadero/... folders.
2nd issue
The same 4 compilation errors happen multiple times across the .pas file on parts using the word this.
Do I have to replace every use of this with self?
Errors
E2023 The function require a type of result : Line 4
E2147 The property 'this' doesn't exist in the base class : Line 5
E2029 ',' or ':' awaited but found 'read' identificator : Line 5
E2029 ',' or ':' awaited but found number : Line 5
[JavaSignature('com/hsm/barcode/DecoderConfigValues$SymbologyFlags')]
JDecoderConfigValues_SymbologyFlags = interface(JObject)
['{BCF30FD2-B650-433C-8A4E-8B638A508487}']
function _Getthis$0: JDecoderConfigValues; cdecl;
property this$0: JDecoderConfigValues read _Getthis$0;
end;
[JavaSignature('com/hsm/barcode/ExposureValues$ExposureSettingsMinMax')]
JExposureValues_ExposureSettingsMinMax = interface(JObject)
['{A576F85F-A021-475C-9741-06D92DBC205F}']
function _Getthis$0: JExposureValues; cdecl;
property this$0: JExposureValues read _Getthis$0;
end;
I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.
I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:
import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')
print("Executing tag_corpus.py...\n")
# Import corpus file
print("Importing data...\n")
file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()
print("Tagging the corpus. Please wait...\n")
tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))
IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:
Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!
I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!
I really appreciate your kind help.
After your import lines add this line:
nltk.internals.config_java(options='-xmx2G')
This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.
See What are the Xms and Xmx parameters when starting JVMs? for more information
If you're interested in how to debug your code, read on.
So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :
from nltk.internals import find_file, find_jar, config_java, java, _java_options
We see that the nltk.internals package is handling the different Java configurations and parameters.
Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.
In version 3.9.2, the StanfordTagger class constructor accepts a parameter called java_options which can be used to set the memory for the POSTagger and also the NERTagger.
E.g. pos_tagger = StanfordPOSTagger('models/english-bidirectional-distsim.tagger', path_to_jar='stanford-postagger-3.9.2.jar', java_options='-mx3g')
I found the answer by #alvas to not work because the StanfordTagger was overriding my memory setting with the built-in default of 1000m. Perhaps using nltk.internals.config_java after initializing StanfordPOSTagger might work but I haven't tried that.
I'm trying to follow this tutorial to analyze Apache access log files using Pig:
http://venkatarun-n.blogspot.com/2013/01/analyzing-apache-logs-with-pig.html
And i'm stuck with this Pig script:
grpd = GROUP logs BY DayExtractor(dt) as day;
When i execute that in grunt terminal, i get the following error:
ERROR 1200: mismatched input 'as' expecting
SEMI_COLON Failed to parse: mismatched input 'as'
expecting SEMI_COLON
Function DayExtractor is defined from piggybank.jar in this manner:
DEFINE DayExtractor
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
Ideas anyone?
I've been searching for awhile about this. Any help would be greatly be appreciated.
I am not sure how the author of the blog post got it to work, but as far as I know, you cannot use as in GROUP BY in pig. Also, I don't think you cannot use UDFs in GROUP BY. May be the author had a different version of pig that supported such operations. To get the same effect, you can split it into two steps:
logs_day = FOREACH logs GENERATE ....., DayExtractor(dt) as day;
grpd = GROUP logs_day BY day;