Can not identify text in Spanish with Lingpipe - java

Some days ago, I am developing an java server to keep a bunch of data and identify its language, so I decided to use lingpipe for such task. But I have facing an issue, after training code and evaluating it with two languages(English and Spanish) by getting that I can't identify spanish text, but I got a successful result with english and french.
The tutorial that I have followed in order to complete this task is:
http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
An the next steps I have made in order to complete the task:
Steps followed to train a Language Classifier
~1.First place and unpack the english and spanish metadata inside a folder named leipzig, as follow (Note: Metadata and Sentences are provided from http://wortschatz.uni-leipzig.de/en/download):
leipzig //Main folder
1M sentences //Folder with data of the last trial
eng_news_2015_1M
eng_news_2015_1M.tar.gz
spa-hn_web_2015_1M
spa-hn_web_2015_1M.tar.gz
ClassifyLang.java //Custom program to try the trained code
dist //Folder
eng_news_2015_300K.tar.gz //unpackaged english sentences
spa-hn_web_2015_300K.tar.gz //unpackaged spanish sentences
EvalLanguageId.java
langid-leipzig.classifier //trained code
lingpipe-4.1.2.jar
munged //Folder
eng //folder containing the sentences.txt for english
sentences.txt
spa //folder containing the sentences.txt for spanish
sentences.txt
Munge.java
TrainLanguageId.java
unpacked //Folder
eng_news_2015_300K //Folder with the english metadata
eng_news_2015_300K-co_n.txt
eng_news_2015_300K-co_s.txt
eng_news_2015_300K-import.sql
eng_news_2015_300K-inv_so.txt
eng_news_2015_300K-inv_w.txt
eng_news_2015_300K-sources.txt
eng_news_2015_300K-words.txt
sentences.txt
spa-hn_web_2015_300K //Folder with the spanish metadata
sentences.txt
spa-hn_web_2015_300K-co_n.txt
spa-hn_web_2015_300K-co_s.txt
spa-hn_web_2015_300K-import.sql
spa-hn_web_2015_300K-inv_so.txt
spa-hn_web_2015_300K-inv_w.txt
spa-hn_web_2015_300K-sources.txt
spa-hn_web_2015_300K-words.txt
~2.Second unpack the language metadata compressed into a unpack folder
unpacked //Folder
eng_news_2015_300K //Folder with the english metadata
eng_news_2015_300K-co_n.txt
eng_news_2015_300K-co_s.txt
eng_news_2015_300K-import.sql
eng_news_2015_300K-inv_so.txt
eng_news_2015_300K-inv_w.txt
eng_news_2015_300K-sources.txt
eng_news_2015_300K-words.txt
sentences.txt
spa-hn_web_2015_300K //Folder with the spanish metadata
sentences.txt
spa-hn_web_2015_300K-co_n.txt
spa-hn_web_2015_300K-co_s.txt
spa-hn_web_2015_300K-import.sql
spa-hn_web_2015_300K-inv_so.txt
spa-hn_web_2015_300K-inv_w.txt
spa-hn_web_2015_300K-sources.txt
spa-hn_web_2015_300K-words.txt
~3.Then Munge the sentences of each one in order to remove the line numbers, tabs and replacing line breaks with single space characters. The output is uniformly written using the UTF-8 unicode encoding (Note:the munge.java at Lingpipe site).
/-----------------Command line----------------------------------------------/
javac -cp lingpipe-4.1.2.jar: Munge.java
java -cp lingpipe-4.1.2.jar: Munge /home/samuel/leipzig/unpacked /home/samuel/leipzig/munged
----------------------------------------Results-----------------------------
spa
reading from=/home/samuel/leipzig/unpacked/spa-hn_web_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/spa/spa.txt charset=utf-8
total length=43267166
eng
reading from=/home/samuel/leipzig/unpacked/eng_news_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/eng/eng.txt charset=utf-8
total length=35847257
/---------------------------------------------------------------/
<---------------------------------Folder------------------------------------->
munged //Folder
eng //folder containing the sentences.txt for english
sentences.txt
spa //folder containing the sentences.txt for spanish
sentences.txt
<-------------------------------------------------------------------------->
~4.Next we start by training the language(Note:the TrainLanguageId.java at Lingpipe LanguageId tutorial).
/---------------Command line--------------------------------------------/
javac -cp lingpipe-4.1.2.jar: TrainLanguageId.java
java -cp lingpipe-4.1.2.jar: TrainLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 5
-----------------------------------Results-----------------------------------
nGram=100000 numChars=5
Training category=eng
Training category=spa
Compiling model to file=/home/samuel/leipzig/langid-leipzig.classifier
/----------------------------------------------------------------------------/
~5. We evaluated our trained code with the next result, having some issues on the confusion matrix (Note:the EvalLanguageId.java at Lingpipe LanguageId tutorial).
/------------------------Command line---------------------------------/
javac -cp lingpipe-4.1.2.jar: EvalLanguageId.java
java -cp lingpipe-4.1.2.jar: EvalLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 50 1000
-------------------------------Results-------------------------------------
Reading classifier from file=/home/samuel/leipzig/langid-leipzig.classifier
Evaluating category=eng
Evaluating category=spa
TEST RESULTS
BASE CLASSIFIER EVALUATION
Categories=[eng, spa]
Total Count=2000
Total Correct=1000
Total Accuracy=0.5
95% Confidence Interval=0.5 +/- 0.02191346617949794
Confusion Matrix
reference \ response
,eng,spa
eng,1000,0 <---------- not diagonal sampling
spa,1000,0
Macro-averaged Precision=NaN
Macro-averaged Recall=0.5
Macro-averaged F=NaN
Micro-averaged Results
the following symmetries are expected:
TP=TN, FN=FP
PosRef=PosResp=NegRef=NegResp
Acc=Prec=Rec=F
Total=4000
True Positive=1000
False Negative=1000
False Positive=1000
True Negative=1000
Positive Reference=2000
Positive Response=2000
Negative Reference=2000
Negative Response=2000
Accuracy=0.5
Recall=0.5
Precision=0.5
Rejection Recall=0.5
Rejection Precision=0.5
F(1)=0.5
Fowlkes-Mallows=2000.0
Jaccard Coefficient=0.3333333333333333
Yule's Q=0.0
Yule's Y=0.0
Reference Likelihood=0.5
Response Likelihood=0.5
Random Accuracy=0.5
Random Accuracy Unbiased=0.5
kappa=0.0
kappa Unbiased=0.0
kappa No Prevalence=0.0
chi Squared=0.0
phi Squared=0.0
Accuracy Deviation=0.007905694150420948
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence =0.0
Reference Entropy=1.0
Response Entropy=NaN
Cross Entropy=Infinity
Joint Entropy=1.0
Conditional Entropy=0.0
Mutual Information=0.0
Kullback-Liebler Divergence=Infinity
chi Squared=NaN
chi-Squared Degrees of Freedom=1
phi Squared=NaN
Cramer's V=NaN
lambda A=0.0
lambda B=NaN
ONE VERSUS ALL EVALUATIONS BY CATEGORY
CATEGORY[0]=eng VERSUS ALL
First-Best Precision/Recall Evaluation
Total=2000
True Positive=1000
False Negative=0
False Positive=1000
True Negative=0
Positive Reference=1000
Positive Response=2000
Negative Reference=1000
Negative Response=0
Accuracy=0.5
Recall=1.0
Precision=0.5
Rejection Recall=0.0
Rejection Precision=NaN
F(1)=0.6666666666666666
Fowlkes-Mallows=1414.2135623730949
Jaccard Coefficient=0.5
Yule's Q=NaN
Yule's Y=NaN
Reference Likelihood=0.5
Response Likelihood=1.0
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence=0.0
chi Squared=NaN
phi Squared=NaN
Accuracy Deviation=0.011180339887498949
CATEGORY[1]=spa VERSUS ALL
First-Best Precision/Recall Evaluation
Total=2000
True Positive=0
False Negative=1000
False Positive=0
True Negative=1000
Positive Reference=1000
Positive Response=0
Negative Reference=1000
Negative Response=2000
Accuracy=0.5
Recall=0.0
Precision=NaN
Rejection Recall=1.0
Rejection Precision=0.5
F(1)=NaN
Fowlkes-Mallows=NaN
Jaccard Coefficient=0.0
Yule's Q=NaN
Yule's Y=NaN
Reference Likelihood=0.5
Response Likelihood=0.0
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence=0.0
chi Squared=NaN
phi Squared=NaN
Accuracy Deviation=0.011180339887498949
/-----------------------------------------------------------------------/
~6.Then we tried to make a real evaluation with spanish text:
/-------------------Command line----------------------------------/
javac -cp lingpipe-4.1.2.jar: ClassifyLang.java
java -cp lingpipe-4.1.2.jar: ClassifyLang
/-------------------------------------------------------------------------/
<---------------------------------Result------------------------------------>
Text: Yo soy una persona increíble y muy inteligente, me admiro a mi mismo lo que me hace sentir ansiedad de lo que viene, por que es algo grandioso lleno de cosas buenas y de ahora en adelante estaré enfocado y optimista aunque tengo que aclarar que no lo haré por querer algo, sino por que es mi pasión.
Best Language: eng <------------- Wrong Result
<----------------------------------------------------------------------->
Code for ClassifyLang.java:
import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.ConfusionMatrix;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.JointClassifier;
import com.aliasi.classify.JointClassifierEvaluator;
import com.aliasi.classify.LMClassifier;
import com.aliasi.lm.NGramProcessLM;
import com.aliasi.util.AbstractExternalizable;
import java.io.File;
import java.io.IOException;
import com.aliasi.util.Files;
public class ClassifyLang {
public static String text = "Yo soy una persona increíble y muy inteligente, me admiro a mi mismo"
+ " estoy ansioso de lo que viene, por que es algo grandioso lleno de cosas buenas"
+ " y de ahora en adelante estaré enfocado y optimista"
+ " aunque tengo que aclarar que no lo haré por querer algo, sino por que no es difícil serlo. ";
private static File MODEL_DIR
= new File("/home/samuel/leipzig/langid-leipzig.classifier");
public static void main(String[] args)
throws ClassNotFoundException, IOException {
System.out.println("Text: " + text);
LMClassifier classifier = null;
try {
classifier = (LMClassifier) AbstractExternalizable.readObject(MODEL_DIR);
} catch (IOException | ClassNotFoundException ex) {
// Handle exceptions
System.out.println("Problem with the Model");
}
Classification classification = classifier.classify(text);
String bestCategory = classification.bestCategory();
System.out.println("Best Language: " + bestCategory);
}
}
~7.I tried with a 1 million metadata file, but it got the same result and also changing the ngram number by getting the same results.
I will be so thankfull for your help.

Well, after days working in Natural Language Processing I found a way to determine the language of one text using OpenNLP.
Here is the Sample Code:
https://github.com/samuelchapas/languagePredictionOpenNLP/tree/master/TrainingLanguageDecOpenNLP
and over here is the training Corpus for the model created to make language predictions.
I decided to use OpenNLP for the issue described in this question, really this library has a complete stack of functionalities.
Here is the sample for model training>
https://mega.nz/#F!HHYHGJ4Q!PY2qfbZr-e0w8tg3cUgAXg

Related

Supported Locales - ga_IE

while setting locale for google sheet api, throws the followinge error
Invalid requests[0].updateSpreadsheetProperties: Unsupported locale: ga_IE", "status" : "INVALID_ARGUMENT"
Reviewing the API doc, it seems to be not all locales are supported.
The locale of the spreadsheet in one of the following formats:
an ISO 639-1 language code such as en
an ISO 639-2 language code such as fil, if no 639-1 code exists
a combination of the ISO language code and country code, such as en_US
Note: when updating this field, not all locales/languages are supported.
Where can I find the list of supported locale?
As you quotes over Spreadsheet Properties, ISO 639-1 codes are preferred in first instance, ISO 639-2 are used when no ISO 639-1 exists, and, if no code exists for a given language on those ISOs, the combination of language_COUNTRY is used. This later case varies depending on the context. I assume that your code lays in any of the ISOs 639-1/2, so here you have the full lists:
ISO 639-1
Language 639-1 code
Abkhazian ab
Afar aa
Afrikaans af
Akan ak
Albanian sq
Amharic am
Arabic ar
Aragonese an
Armenian hy
Assamese as
Avaric av
Avestan ae
Aymara ay
Azerbaijani az
Bambara bm
Bashkir ba
Basque eu
Belarusian be
Bengali bn
Bihari languages bh
Bislama bi
Bosnian bs
Breton br
Bulgarian bg
Burmese my
Catalan, Valencian ca
Chamorro ch
Chechen ce
Chichewa, Chewa, Nyanja ny
Chinese zh
Chuvash cv
Cornish kw
Corsican co
Cree cr
Croatian hr
Czech cs
Danish da
Divehi, Dhivehi, Maldivian dv
Dutch, Flemish nl
Dzongkha dz
English en
Esperanto eo
Estonian et
Ewe ee
Faroese fo
Fijian fj
Finnish fi
French fr
Fulah ff
Galician gl
Georgian ka
German de
Greek, Modern (1453-) el
Guarani gn
Gujarati gu
Haitian, Haitian Creole ht
Hausa ha
Hebrew he
Herero hz
Hindi hi
Hiri Motu ho
Hungarian hu
Interlingua(International Auxiliary Language Association) ia
Indonesian id
Interlingue, Occidental ie
Irish ga
Igbo ig
Inupiaq ik
Ido io
Icelandic is
Italian it
Inuktitut iu
Japanese ja
Javanese jv
Kalaallisut, Greenlandic kl
Kannada kn
Kanuri kr
Kashmiri ks
Kazakh kk
Central Khmer km
Kikuyu, Gikuyu ki
Kinyarwanda rw
Kirghiz, Kyrgyz ky
Komi kv
Kongo kg
Korean ko
Kurdish ku
Kuanyama, Kwanyama kj
Latin la
Luxembourgish, Letzeburgesch lb
Ganda lg
Limburgan, Limburger, Limburgish li
Lingala ln
Lao lo
Lithuanian lt
Luba-Katanga lu
Latvian lv
Manx gv
Macedonian mk
Malagasy mg
Malay ms
Malayalam ml
Maltese mt
Maori mi
Marathi mr
Marshallese mh
Mongolian mn
Nauru na
Navajo, Navaho nv
North Ndebele nd
Nepali ne
Ndonga ng
Norwegian Bokmål nb
Norwegian Nynorsk nn
Norwegian no
Sichuan Yi, Nuosu ii
South Ndebele nr
Occitan oc
Ojibwa oj
Church Slavic, Old Slavonic, Church Slavonic, Old Bulgarian,Old Church Slavonic cu
Oromo om
Oriya or
Ossetian, Ossetic os
Punjabi, Panjabi pa
Pali pi
Persian fa
Polish pl
Pashto, Pushto ps
Portuguese pt
Quechua qu
Romansh rm
Rundi rn
Romanian, Moldavian, Moldovan ro
Russian ru
Sanskrit sa
Sardinian sc
Sindhi sd
Northern Sami se
Samoan sm
Sango sg
Serbian sr
Gaelic, Scottish Gaelic gd
Shona sn
Sinhala, Sinhalese si
Slovak sk
Slovenian sl
Somali so
Southern Sotho st
Spanish, Castilian es
Sundanese su
Swahili sw
Swati ss
Swedish sv
Tamil ta
Telugu te
Tajik tg
Thai th
Tigrinya ti
Tibetan bo
Turkmen tk
Tagalog tl
Tswana tn
Tonga(Tonga Islands) to
Turkish tr
Tsonga ts
Tatar tt
Twi tw
Tahitian ty
Uighur, Uyghur ug
Ukrainian uk
Urdu ur
Uzbek uz
Venda ve
Vietnamese vi
Volapük vo
Walloon wa
Welsh cy
Wolof wo
Western Frisian fy
Xhosa xh
Yiddish yi
Yoruba yo
Zhuang, Chuang za
Zulu zu
ISO 639-2 not covered by ISO 639-1
Language ISO 639-2
Achinese ace
Acoli ach
Adangme ada
Adyghe; Adygei ady
Afro-Asiatic languages afa
Afrihili afh
Ainu ain
Akkadian akk
Aleut ale
Algonquian languages alg
Southern Altai alt
English, Old(ca.450–1100) ang
Angika anp
Apache languages apa
Official Aramaic(700–300 BCE);Imperial Aramaic(700–300 BCE) arc
Mapudungun;Mapuche arn
Arapaho arp
Artificial languages art
Arawak arw
Asturian;Bable;Leonese;Asturleonese ast
Athapascan languages ath
Australian languages aus
Awadhi awa
Banda languages bad
Bamileke languages bai
Baluchi bal
Balinese ban
Basa bas
Baltic languages bat
Beja;Bedawiyet bej
Bemba bem
Berber languages ber
Bhojpuri bho
Bikol bik
Bini;Edo bin
Siksika bla
Bantu (Other) bnt
Braj bra
Batak languages btk
Buriat bua
Buginese bug
Blin; Bilin byn
Caddo cad
Central American Indian languages cai
Galibi Carib car
Caucasian languages cau
Cebuano ceb
Celtic languages cel
Chibcha chb
Chagatai chg
Chuukese chk
Mari chm
Chinook jargon chn
Choctaw cho
Chipewyan;Dene Suline chp
Cherokee chr
Cheyenne chy
Chamic languages cmc
Montenegrin cnr
Coptic cop
Creolesandpidgins, English based cpe
Creolesand pidgins, French-based cpf
Creolesand pidgins, Portuguese-based cpp
Crimean Tatar;Crimean Turkish crh
Creolesandpidgins crp
Kashubian csb
Cushitic languages cus
Dakota dak
Dargwa dar
Land Dayak languages day
Delaware del
Slave (Athapascan) den
Dogrib dgr
Dinka din
Dogri doi
Dravidian languages dra
Lower Sorbian dsb
Duala dua
Dutch, Middle(ca. 1050–1350) dum
Dyula dyu
Efik efi
Egyptian (Ancient) egy
Ekajuk eka
Elamite elx
English, Middle(1100–1500) enm
Ewondo ewo
Fang fan
Fanti fat
Filipino;Pilipino fil
Finno-Ugrian languages fiu
Fon fon
French, Middle(ca. 1400–1600) frm
French, Old(842–ca. 1400) fro
Northern Frisian frr
Eastern Frisian frs
Friulian fur
Ga gaa
Gayo gay
Gbaya gba
Germanic languages gem
Geez gez
Gilbertese gil
German, Middle High(ca. 1050–1500) gmh
German, Old High(ca. 750–1050) goh
Gondi gon
Gorontalo gor
Gothic got
Grebo grb
Greek, Ancient(to 1453) grc
Swiss German;Alemannic;Alsatian gsw
Gwich'in gwi
Haida hai
Hawaiian haw
Hiligaynon hil
Himachali languages; Pahari languages him
Hittite hit
Hmong;Mong hmn
Upper Sorbian hsb
Hupa hup
Iban iba
Ijo languages ijo
Iloko ilo
Indic languages inc
Indo-European languages ine
Ingush inh
Iranian languages ira
Iroquoian languages iro
Lojban jbo
Judeo-Persian jpr
Judeo-Arabic jrb
Kara-Kalpak kaa
Kabyle kab
Kachin;Jingpho kac
Kamba kam
Karen languages kar
Kawi kaw
Kabardian kbd
Khasi kha
Khoisan languages khi
Khotanese;Sakan kho
Kimbundu kmb
Konkani kok
Kosraean kos
Kpelle kpe
Karachay-Balkar krc
Karelian krl
Kru languages kro
Kurukh kru
Kumyk kum
Kutenai kut
Ladino lad
Lahnda lah
Lamba lam
Lezghian lez
Mongo lol
Lozi loz
Luba-Lulua lua
Luiseno lui
Lunda lun
Luo (Kenya and Tanzania) luo
Lushai lus
Madurese mad
Magahi mag
Maithili mai
Makasar mak
Mandingo man
Austronesian languages map
Masai mas
Moksha mdf
Mandar mdr
Mende men
Irish, Middle(900–1200) mga
Mi'kmaq;Micmac mic
Minangkabau min
Uncoded languages mis
Mon-Khmer languages mkh
Manchu mnc
Manipuri mni
Manobo languages mno
Mohawk moh
Mossi mos
Multiple languages mul
Munda languages mun
Creek mus
Mirandese mwl
Marwari mwr
Mayan languages myn
Erzya myv
Nahuatl languages nah
North American Indian languages nai
Neapolitan nap
Low German; Low Saxon; German, Low; Saxon, Low nds
Nepal Bhasa;Newari new
Nias nia
Niger-Kordofanian languages nic
Niuean niu
Nogai nog
Norse, Old non
N'Ko nqo
Pedi;Sepedi;Northern Sotho nso
Nubian languages nub
Classical Newari;Old Newari;Classical Nepal Bhasa nwc
Nyamwezi nym
Nyankole nyn
Nyoro nyo
Nzima nzi
Osage osa
Turkish, Ottoman(1500–1928) ota
Otomian languages oto
Papuan languages paa
Pangasinan pag
Pahlavi pal
Pampanga;Kapampangan pam
Papiamento pap
Palauan pau
Persian, Old(ca. 600–400 B.C.) peo
Philippine languages phi
Phoenician phn
Pohnpeian pon
Prakrit languages pra
Provençal, Old(to 1500);Old Occitan (to 1500) pro
Reserved for local use qaa-qtz
Rajasthani raj
Rapanui rap
Rarotongan;Cook Islands Maori rar
Romance languages roa
Romany rom
Aromanian;Arumanian;Macedo-Romanian rup
Sandawe sad
Yakut sah
South American Indian (Other) sai
Salishan languages sal
Samaritan Aramaic sam
Sasak sas
Santali sat
Sicilian scn
Scots sco
Selkup sel
Semitic languages sem
Irish, Old(to 900) sga
Sign Languages sgn
Shan shn
Sidamo sid
Siouan languages sio
Sino-Tibetan languages sit
Slavic languages sla
Southern Sami sma
Sami languages smi
Lule Sami smj
Inari Sami smn
Skolt Sami sms
Soninke snk
Sogdian sog
Songhai languages son
Sranan Tongo srn
Serer srr
Nilo-Saharan languages ssa
Sukuma suk
Susu sus
Sumerian sux
Classical Syriac syc
Syriac syr
Tai languages tai
Timne tem
Tereno ter
Tetum tet
Tigre tig
Tiv tiv
Tokelau tkl
Klingon;tlhIngan-Hol tlh
Tlingit tli
Tamashek tmh
Tonga (Nyasa) tog
Tok Pisin tpi
Tsimshian tsi
Tumbuka tum
Tupi languages tup
Altaic languages tut
Tuvalu tvl
Tuvinian tyv
Udmurt udm
Ugaritic uga
Umbundu umb
Undetermined und
Vai vai
Votic vot
Wakashan languages wak
Walamo wal
Waray war
Washo was
Sorbian languages wen
Kalmyk;Oirat xal
Yao yao
Yapese yap
Yupik languages ypk
Zapotec zap
Blissymbols;Blissymbolics;Bliss zbl
Zenaga zen
Standard Moroccan Tamazight zgh
Zande languages znd
Zuni zun
No linguistic content; Not applicable zxx
Zaza;Dimili;Dimli;Kirdki;Kirmanjki;Zazaki zza
Just for clarification: you can see the full list of ISO 639 codes on the Library of Congress. As I said before, I assume that your language is one of the former. If that is not the case, please ask for further help so I can assist you better.

How to train Chunker in Opennlp?

I need to train the Chunker in Opennlp to classify the training data as a noun phrase. How do I proceed? The documentation online does not have an explanation how to do it without the command line, incorporated in a program. It says to use en-chunker.train, but how do you make that file?
EDIT: #Alaye
After running the code you gave in your answer, I get the following error that I cannot fix:
Indexing events using cutoff of 5
Computing event counts... done. 3 events
Dropped event B-NP:[w_2=bos, w_1=bos, w0=He, w1=reckons, w2=., w_1=bosw0=He, w0=Hew1=reckons, t_2=bos, t_1=bos, t0=PRP, t1=VBZ, t2=., t_2=bost_1=bos, t_1=bost0=PRP, t0=PRPt1=VBZ, t1=VBZt2=., t_2=bost_1=bost0=PRP, t_1=bost0=PRPt1=VBZ, t0=PRPt1=VBZt2=., p_2=bos, p_1=bos, p_2=bosp_1=bos, p_1=bost_2=bos, p_1=bost_1=bos, p_1=bost0=PRP, p_1=bost1=VBZ, p_1=bost2=., p_1=bost_2=bost_1=bos, p_1=bost_1=bost0=PRP, p_1=bost0=PRPt1=VBZ, p_1=bost1=VBZt2=., p_1=bost_2=bost_1=bost0=PRP, p_1=bost_1=bost0=PRPt1=VBZ, p_1=bost0=PRPt1=VBZt2=., p_1=bosw_2=bos, p_1=bosw_1=bos, p_1=bosw0=He, p_1=bosw1=reckons, p_1=bosw2=., p_1=bosw_1=bosw0=He, p_1=bosw0=Hew1=reckons]
Dropped event B-VP:[w_2=bos, w_1=He, w0=reckons, w1=., w2=eos, w_1=Hew0=reckons, w0=reckonsw1=., t_2=bos, t_1=PRP, t0=VBZ, t1=., t2=eos, t_2=bost_1=PRP, t_1=PRPt0=VBZ, t0=VBZt1=., t1=.t2=eos, t_2=bost_1=PRPt0=VBZ, t_1=PRPt0=VBZt1=., t0=VBZt1=.t2=eos, p_2=bos, p_1=B-NP, p_2=bosp_1=B-NP, p_1=B-NPt_2=bos, p_1=B-NPt_1=PRP, p_1=B-NPt0=VBZ, p_1=B-NPt1=., p_1=B-NPt2=eos, p_1=B-NPt_2=bost_1=PRP, p_1=B-NPt_1=PRPt0=VBZ, p_1=B-NPt0=VBZt1=., p_1=B-NPt1=.t2=eos, p_1=B-NPt_2=bost_1=PRPt0=VBZ, p_1=B-NPt_1=PRPt0=VBZt1=., p_1=B-NPt0=VBZt1=.t2=eos, p_1=B-NPw_2=bos, p_1=B-NPw_1=He, p_1=B-NPw0=reckons, p_1=B-NPw1=., p_1=B-NPw2=eos, p_1=B-NPw_1=Hew0=reckons, p_1=B-NPw0=reckonsw1=.]
Dropped event O:[w_2=He, w_1=reckons, w0=., w1=eos, w2=eos, w_1=reckonsw0=., w0=.w1=eos, t_2=PRP, t_1=VBZ, t0=., t1=eos, t2=eos, t_2=PRPt_1=VBZ, t_1=VBZt0=., t0=.t1=eos, t1=eost2=eos, t_2=PRPt_1=VBZt0=., t_1=VBZt0=.t1=eos, t0=.t1=eost2=eos, p_2B-NP, p_1=B-VP, p_2B-NPp_1=B-VP, p_1=B-VPt_2=PRP, p_1=B-VPt_1=VBZ, p_1=B-VPt0=., p_1=B-VPt1=eos, p_1=B-VPt2=eos, p_1=B-VPt_2=PRPt_1=VBZ, p_1=B-VPt_1=VBZt0=., p_1=B-VPt0=.t1=eos, p_1=B-VPt1=eost2=eos, p_1=B-VPt_2=PRPt_1=VBZt0=., p_1=B-VPt_1=VBZt0=.t1=eos, p_1=B-VPt0=.t1=eost2=eos, p_1=B-VPw_2=He, p_1=B-VPw_1=reckons, p_1=B-VPw0=., p_1=B-VPw1=eos, p_1=B-VPw2=eos, p_1=B-VPw_1=reckonsw0=., p_1=B-VPw0=.w1=eos]
Indexing... done.
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at opennlp.tools.ml.model.AbstractDataIndexer.sortAndMerge(AbstractDataIndexer.java:89)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:105)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.ml.model.TrainUtil.train(TrainUtil.java:53)
at opennlp.tools.chunker.ChunkerME.train(ChunkerME.java:253)
at com.oracle.crm.nlp.CustomChunker2.main(CustomChunker2.java:91)
Sorting and merging events... Process exited with exit code 1.
(My en-chunker.train had only the first 2 and last line of your sample data set.)
Could you please tell me why this is happening and how to fix it?
EDIT2: I got the Chunker to work, however it gives an error when I change the sentence in the training set to any sentence other than the one you've given in your answer. Can you tell me why that could be happening?
As said in Opennlp Documentation
Sample sentence of the training data:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
This is how you make your en-chunk.train file and you can create the corresponding .bin file using CLI:
$ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding
or using API
public class SentenceTrainer {
public static void trainModel(String inputFile, String modelFile)
throws IOException {
Objects.nonNull(inputFile);
Objects.nonNull(modelFile);
MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(
new File(inputFile));
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("en-chunker.train"),charset);
ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream);
ChunkerModel model;
try {
model = ChunkerME.train("en", sampleStream,
new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());
}
finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
and the main method will be:
public class Main {
public static void main(String args[]) throws IOException {
String inputFile = "//path//to//data.train";
String modelFile = "//path//to//.bin";
SentenceTrainer.trainModel(inputFile, modelFile);
}
}
reference: this blog
hope this helps!
PS: collect/write the data as above in a .txt file and rename it with .train extension or even the trainingdata.txt will work. that is how you make a .train file.

finding features from a large data set by stanford corenlp

I am new Stanford NLP. I can not find any good and complete documentation or tutorial. My work is to do sentiment analysis. I have a very large dataset of product reviews. I already distinguished them by positive and negative according to "starts" given by the users. Now I need to find the most occurred positive and negative adjectives as the features of my algorithm. I understand how to do tokenzation, lemmatization and POS taging from here. I got files like this.
The review was
Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic
and the output was.
Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]
I already have processed 10000 positive and 10000 negative file like this. Now How can I easily find the most occurred positive and negative features(adjectives)? Do i need to read all the output(processed) file and make a list frequency count of the adjectives like this or is there any easy way by stanford corenlp?
Here is an example of processing an annotated review and storing the adjectives in a Counter.
In the example the movie review "The movie was great! It was a great film." has a sentiment of "positive".
I would suggest altering my code to load in each file and build an Annotation with the file's text and recording the sentiment for that file.
Then you can go through each file and build up a Counter with positive and negative counts for each adjective.
The final Counter has the adjective "great" with a count of 2.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.util.CoreMap;
import java.util.Properties;
public class AdjectiveSentimentExample {
public static void main(String[] args) throws Exception {
Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();
Annotation review = new Annotation("The movie was great! It was a great film.");
String sentiment = "positive";
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(review);
for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
if (sentiment.equals("positive")) {
adjectivePositiveCounts.incrementCount(cl.word());
} else if (sentiment.equals("negative")) {
adjectiveNegativeCounts.incrementCount(cl.word());
}
}
}
}
System.out.println("---");
System.out.println("positive adjective counts");
System.out.println(adjectivePositiveCounts);
}
}

Tuprolog and defining infix operators

So I have some prolog...
cobrakai$more operator.pl
be(a,c).
:-op(35,xfx,be).
+=(a,c).
:-op(35,xfx,+=).
cobrakai$
Which defines some infix operators. I run it using SWI prolog and get the following (perfectly expected) results
?- halt.
cobrakai$swipl -s operator.pl
% library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 3,992 bytes
% /Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator.pl compiled 0.00 sec, 992 bytes
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 5.10.5)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?- be(a,c).
true.
?- a be c.
true.
?- +=(a,c).
ERROR: toplevel: Undefined procedure: (+=)/2 (DWIM could not correct goal)
?- halt.
cobrakai$swipl -s operator.pl
% library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 3,992 bytes
% /Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator.pl compiled 0.00 sec, 1,280 bytes
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 5.10.5)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?- be(a,c).
true.
?- a be c.
true.
?- +=(a,c).
true.
?- a += c.
true.
?- halt.
However, when I use Tuprolog to process the same file from Java (using the following code)
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import alice.tuprolog.Prolog;
import alice.tuprolog.SolveInfo;
import alice.tuprolog.Theory;
public class Testinfixoperatorconstruction {
public static void main(String[] args) throws Exception {
Prolog engine = new Prolog();
engine.loadLibrary("alice.tuprolog.lib.DCGLibrary");
engine.addTheory(new Theory(readFile("/Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator.pl")));
SolveInfo info = engine.solve("be(a,c).");
System.out.println(info.getSolution());
info = engine.solve("a be c.");
System.out.println(info.getSolution());
}
private static String readFile(String file) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
return stringBuilder.toString();
}
}
The prolog file does not parse - failing on the '+=' token.
Exception in thread "main" alice.tuprolog.InvalidTheoryException: Unexpected token '+='
at alice.tuprolog.TheoryManager.consult(TheoryManager.java:193)
at alice.tuprolog.Prolog.addTheory(Prolog.java:242)
at Testinfixoperatorconstruction.main(Testinfixoperatorconstruction.java:14)
We can try a slightly different approach, adding the operator directly in the java code with...
public static void main(String[] args) throws Exception {
Prolog engine = new Prolog();
engine.loadLibrary("alice.tuprolog.lib.DCGLibrary");
engine.getOperatorManager().opNew("be", "xfx", 35);
engine.getOperatorManager().opNew("+=", "xfx", 35);
engine.addTheory(new Theory(
readFile("/Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator2.pl")));
SolveInfo info = engine.solve("be(a,c).");
System.out.println(info.getSolution());
info = engine.solve("a be c.");
System.out.println(info.getSolution());
}
but we get the same error... :(
Can anyone tell me why this is happening? (and solutions would also be welcome).
SWI-Prolog could be too much permissive while parsing directives. Try enclosing operators between parenthesis:
:-op(35,xfx,(+=)).
edit I tried using 2p.jar, that allowed me to spot the problem. Need to quote operator' atom:
:-op(35,xfx, '+=').
X += Y.
p :- a += b.
interactive 2p console accepts this syntax. Note that 2p.jar by default load tuprolog libraries

Precision recall in lucene java

I want to use Lucene to calculate Precision and Recall.
I did these steps:
Made some index files. To do this I used indexer code and indexed .txt files which exist in this path C:/inn (there are 4 text files in this folder) and take them in "outt" folder by setting the indexpath to C:/outt in the Indexer code.
Created a package called lia.benchmark and a class inside it which is called "PrecisionRecall" and add externaljars (rightclick --> Java build path --> add external jars) and added Lucene-benchmark-.3.2.0jar and Lucene-core-3.3.0jar
Set the topicsfile path in code to C:/lia2e/src/lia/benchmark/topics.txt and
qrelsfile to C:/lia2e/src/lia/benchmark/qrels.txt and dir to "C:/outt".
Here is code:
package lia.benchmark;
import java.io.File;
import java.io.PrintWriter;
import java.io.BufferedReader;
import java.io.FileReader;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import org.apache.lucene.benchmark.quality.*;
import org.apache.lucene.benchmark.quality.utils.*;
import org.apache.lucene.benchmark.quality.trec.*;
public class PrecisionRecall {
public static void main(String[] args) throws Throwable {
File topicsFile = new File("C:/lia2e/src/lia/benchmark/topics.txt");
File qrelsFile = new File("C:/lia2e/src/lia/benchmark/qrels.txt");
Directory dir = FSDirectory.open(new File("C:/outt"));
IndexSearcher searcher = new IndexSearcher(dir, true);
String docNameField = "filename";
PrintWriter logger = new PrintWriter(System.out, true);
TrecTopicsReader qReader = new TrecTopicsReader();
QualityQuery qqs[] = qReader.readQueries(
new BufferedReader(new FileReader(topicsFile)));
Judge judge = new TrecJudge(new BufferedReader(
new FileReader(qrelsFile)));
judge.validateData(qqs, logger);
QualityQueryParser qqParser = new SimpleQQParser("title", "contents");
QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
SubmissionReport submitLog = null;
QualityStats stats[] = qrun.execute(judge,
submitLog, logger);
QualityStats avg = QualityStats.average(stats);
avg.log("SUMMARY",2,logger, " ");
dir.close();
}
}
Initialized qrels and topics. In documents folder (C:\inn) I have 4 txt files which 2 of them is relevance to my query ( query is apple) so I filled qrels and topics.
the qrels file like this:
<top>
<num> Number: 0
<title> apple
<desc> Description:
<narr> Narrative:
</top>
and topics file like this:
0 0 789.txt 1
0 0 101.txt 1
I tried also the Path format namely for example "C:\inn\789.txt" instead of "789.txt"
but results are zero:
0 - contents:apple
0 Stats:
Search Seconds: 0.016
DocName Seconds: 0.000
Num Points: 2.000
Num Good Points: 0.000
Max Good Points: 2.000
Average Precision: 0.000
MRR: 0.000
Recall: 0.000
Precision At 1: 0.000
SUMMARY
Search Seconds: 0.016
DocName Seconds: 0.000
Num Points: 2.000
Num Good Points: 0.000
Max Good Points: 2.000
Average Precision: 0.000
MRR: 0.000
Recall: 0.000
Precision At 1: 0.000
Can you tell me what is wrong with me?
I really need to know why results are zero.
I'm afraid that the qrels.txt format is wrong: the javadoc suggests the following:
Expected input format:
qnum 0 doc-name is-relevant
Two sample lines:
19 0 doc303 1
19 0 doc7295 0
(I know it's 2.3.0 javadoc, but the format wasn't changed in 3.0)
So it seems that you've swapped the files: TrecTopicsReader expects what you have in qrels.txt; TrecJudge expects what you have in topics.txt.

Categories

Resources