Classification using Bernoulli classifier in Lingpipe - java

I want to classify my data through bernoulli classifier in lingpipe
If someone has the working method for it, please share

http://java2s.com/Open-Source/Java/Natural-Language-Processing/LingPipe/com/aliasi/test/unit/classify/BernoulliClassifierTest.java.htm
I.e.
FeatureExtractor FEATURE_EXTRACTOR
= new TokenFeatureExtractor(IndoEuropeanTokenizerFactory.INSTANCE);
...
BernoulliClassifier classifier
= new BernoulliClassifier(FEATURE_EXTRACTOR);
Then use handle() to add training data, and classify() to get answers out.
To find this I googled on "Bernoulli classifier in Lingpipe" (without the quotes). I found the API docs, and saw no example usage and that it is poor quality. So, I guessed there might be a unit test, as Java programmers are quite anal about testing. So then I googled for "bernoulliClassifier lingpipe test" (again, without the quotes).
(By "poor quality" docs, I mean the function descriptions just repeat the function names, and add no information.)

Related

How to create a simple Italian Model for a Named Entity Extraction of Persons using OpenNLP?

I have to do a project with OpenNLP, strictly in italian language. Since it's almost impossible to find some existing structures in this language, my idea is to create a simple model myself. Reading some posts on this platform, my idea is try to do this using model-builder addon.
First of all, it's possible to obtain my goal with this addon?
If so, referring to this other post, what kind of file is meant by "modelOutFile"? In my case I don't have an existing model.
N.B.: the addon uses some deprecated functions (such as nameFinderME.train()).
Naively, I tried to pass as a "modelOutFile" a simple empty file "model.bin", but, of course I bumped into an error:
Cannot invoke "java.util.Properties.getProperty(String)" because "manifest" is null
Furthermore, I used a few names and sentences for the test (I only wanted to know if this worked), not the large amount requested (15000 sentences at least).
I'm open to other suggestions instead of the use of modelbuilder addons.
Hope someone can help me.

Exact Dictionary based Named Entity Recognition with Stanford

I have a dictionary of named entities, extracted from Wikipedia. I want to use it as the dictionary of an NER. I wanted to know how can I use Stanford-NER with this data of mine.
I have also downloaded Lingpipe, although I have no idea how can I use it. I would appreciate all kinds of information.
Thanks for your helps.
You can use dictionary (or regular expression-based) named entity recognition with Stanford CoreNLP. See the RegexNER annotator. For some applications, we run this with quite large dictionaries of entities. Nevertheless, for us this is typically a secondary tool to using statistical (CRF-based) NER.
Stanford-NER is based on CRFs, which is a statistic model. I'm afraid it doesn't support extra dictionary or lexicon. However, you can train a new model according to your own task.
you can use MER: http://labs.fc.ul.pt/mer/
a minimal entity recognizer developed in bash: https://github.com/lasigeBioTM/MER
that only requires a lexicon (text file) as input

How to find words sysnonyms using GATE ANNIE?

I have been working on information extraction and was able to run standAloneAnnie.java
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java
My question is, How can I use GATE ANNIE to get similar words like if I input (dine) will get result like (food, eat, dinner, restaurant) ?
More Information:
I am doing a project where I was assigned to develop a simple webpage to take user input and pass to GATE components which will tokenize the query and return a semantic grouping for each phrase in order to make some recommendation.
For example user would enter "I want to have dinner in Kuala Lumpur" and the system will break it down to (Search for :dinner - Required: restaurant, dinner, eat, food - Location: Kuala Lumpur.
ANNIE by default has like 15 annotations, see demo
http://services.gate.ac.uk/annie/
Now I already implemented everything as the demo but my question is. Can I do that using GATE ANNIE, i mean is it possible to find words synonyms or group words based on their type (noun, verbs)?
Plain vanilla ANNIE doesn't support this kind of thing but there are third party plugins such as Phil Gooch's WordNet Suggester that might help. Or if your domain is fairly restricted you might get better results with less effort by simply creating your own gazetteer lists of related terms and a few simple JAPE rules. You may find the training materials available on the GATE Wiki useful if you haven't done much of this before.

Are there APIs for text analysis/mining in Java? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.
It looks like you're looking for a Named Entity Recogniser.
You have got a couple of choices.
CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.
GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.
You may need to customise one of them to fit your needs.
You also have other options:
simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.
In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:
...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...
You can also find a code snippet at the javadoc of CRFClassifier:
Typical command-line usage
For running a trained model with a provided serialized classifier on a
text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > output
For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).
If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.
I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs
I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!
As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?
I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).
Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Categories

Resources