How to find words sysnonyms using GATE ANNIE?

How to find words sysnonyms using GATE ANNIE? - java

I have been working on information extraction and was able to run standAloneAnnie.java
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java
My question is, How can I use GATE ANNIE to get similar words like if I input (dine) will get result like (food, eat, dinner, restaurant) ?
More Information:
I am doing a project where I was assigned to develop a simple webpage to take user input and pass to GATE components which will tokenize the query and return a semantic grouping for each phrase in order to make some recommendation.
For example user would enter "I want to have dinner in Kuala Lumpur" and the system will break it down to (Search for :dinner - Required: restaurant, dinner, eat, food - Location: Kuala Lumpur.
ANNIE by default has like 15 annotations, see demo
http://services.gate.ac.uk/annie/
Now I already implemented everything as the demo but my question is. Can I do that using GATE ANNIE, i mean is it possible to find words synonyms or group words based on their type (noun, verbs)?

Plain vanilla ANNIE doesn't support this kind of thing but there are third party plugins such as Phil Gooch's WordNet Suggester that might help. Or if your domain is fairly restricted you might get better results with less effort by simply creating your own gazetteer lists of related terms and a few simple JAPE rules. You may find the training materials available on the GATE Wiki useful if you haven't done much of this before.

Related

Use google-api/mediawiki-api to retrieve information

I am currently working on a University project under the theme of "search-engine".
For this purpose we were given access to a database of scientific publications
(http://dblp.uni-trier.de)
It is a 2GB XML file which looks something like this:
<article key="GottlobSR96">
<author>Georg Gottlob</author>
<author>Michael Schrefl</author>
<author>Brigitte Röck</author>
<title>Extending Object-Oriented Systems with Roles.</title>
<pages>268-296</pages>
<year>1996</year>
<volume>14</volume>
<journal>TOIS</journal>
<number>3</number>
<url>db/journals/tois/tois14.html#GottlobSR96</url>
</article>
As you can see the "article"-tag contains various information such as author,title of the paper,year of publication. My job now is to implement a Java solution which takes search terms of different categories (author, university,title) as input and provides the user with additional information.
For example if you enter the name of a professor it should return data like his date of birth, the University he works at, number of publications, etc..
I suppose this would work using google api to find for a persons entry on the University homepage and then somehow parsing through the page to find the needed information. For Universities there should be a Wikipedia page.
I already tried using mediawiki api but couldn't figure out how to get only the specific information I want.(I could only get the intro paragraph)
I've never worked on a project of this scale so I don't really have a clue on how to implement foreign API's/libraries etc. into my own code.
So i guess my question is:
How do i get specific information based on a google-search? May it be through wikipedia or otherwise.

Identifying all the names from a given text

I want to identify all the names written in any text, currently I am using IMDB movie reviews.
I am using stanford POS tagger, and analysing all the proper nouns (as proper noun are names of person,things,places), but this is slow.
Firstly I am tagging all the input lines, then I am checking for all the words with NNP in the end, which is a slow process.
Is there any efficient substitute to achieve this task? ANy library (preferably in JAVA).
Thanks.

Do you know the input language? If yes you could match each word against a dictionnary and flag the word as proper noun if it is not in the dictionnary. It would require a complete dictionnary with all the declensions of each word of the language, and pay attention to numbers and other special cases.
EDIT: See also this answer in the official FAQ: have you tried to change the model used?

A (paid) web service called GlobalNLP can do it in multiple languages: https://nlp.linguasys.com/docs/services/54131f001c78d802f0f2b28f/operations/5429f9591c78d80a3cd66926

How to get the finalized text after resolving co-references using StandfordNLP

Hi I just started learning NLP and chose Stanford api to do all my required tasks. I am able to do POS and NER tasks but I am stuck with co-reference resolution. I am even able to get the 'corefChaingraph' and able to print all the representative mention and corresponding mentions to console. But, I really would like to know how to get the finalized text after resolving the co-references. Can some one help me regarding this?
example:
Input sentence:
John Smith talks about the EU. He likes the family of nations.
Expected ouput:
John Smith talks about the EU. John Smith likes the family of nations.

It depends a lot on what approach you take. I personally would try and solve this looking at what role a word plays in a sentence and what is the context carried forward. Based on the POS tags, try and map subject-verb-object model. Once you have subject and objects identified you can build a simple context carry forward rule system to achieve what you want.
e.g.
Based on the tags below:
[('John', 'NNP'), ('Smith', 'NNP'), ('talks', 'VBZ'), ('about', 'IN'), ('the', 'DT'), ('EU.', 'NNP'), ('He', 'NNP'), ('likes', 'VBZ'), ('the', 'DT'), ('family', 'NN'), ('of', 'IN'), ('nations', 'NNS'), ('.', '.')]
You can create chunks:
[['noun_type', 'John', 'Smith'], ['verb_type', 'talks'], ['in_type', 'about'], ['noun_type', 'the', 'EU']]
[['noun_type', 'He'], ['verb_type', 'likes'], ['noun_type', 'the', 'family'], ['in_type', 'of'], ['noun_type', 'nations']]
Once you have these chunks, parse them left to right putting them in Subject-Verb-Object form.
Now based on this, you know what is the context carry forward.
e.g.: "He" means the subject is getting carry forward. "It" means the object (this is a very basic example. You can build a robust rule based systems for patterns.) I have tried many approaches in past and this one gave me best results.
I hope I helped.

In my experience that problem you are trying to solve is not completely solved but there are many people working on it. I tried "karaka" approach. Not just to get subject-verb-object but also the other references from sentence.
Here is how I approached a problem:
step 1: Detect the voice of the sentence.
Step 2: For active voice, parse a POS tagged sentence from left to right to get
subject-verb-object (It will be always in that form for active voice). For passive voice look for "by" and take the next noun as a subject.
Looking at your example:
In both the sentences you have Noun-Verb-In-Noun structure. Which you can easily parse as first noun is subject then verb then IN (about is indicative to object) and then noun again. From there rules: John Smith is subject, Talks is action and EU is object.
Karaka theory in linguistic will also help you with other roles.
E.g.: John Smith talks about EU in Paris.
Here when you enouter in (IN tag) Paris (NNP tag) you can have a rule that tells you "in/on/around/inside/outside" are locative references.
Similarly "with/without" and instrumentative "for" is dative.
I basically trust this deep parsing and rule systems when I have to deal with a single word and the rule it plays in a sentence.
I have good amount of accuracy with this approach.

Bold the user input in autocompletion stuff

Just wanted to know how you would do it.
I have a webservice that permits me to complete an user address while he's writing it.
When the suggestions are shown, i'd like the part of the suggestion label that match the user input to be surrounded with bold tags.
I want the "matching" to be clever, and not just a simple/replace, since the WS we use is clever too but i don't have that code).
For exemple:
Input: 3 OxFôr sTrE
Ws result: 3 Oxford Street
Formatted: <b>3 Oxford Stre</b>et
Formatted: [bold]3 Oxford Stre[/bold]et
I can do it in JS or Java.
I'd rather do it in JS but with Java perhaps Lucene can help?
Do you see how it can be handled?

Index your Text using NGrams using a search engine or a custom data structure. I am implementing Auto Recommendation stuff by indexing around 1 billion query words using NGrams & then while displaying I sort them as per frequency of each typed query. Lucene/Solr can help you here. Highlighting stuff (as you asked) will be enclosed in tags by default if you use Lucene/Solr and you can also exploit ngram indexing feature provided by Lucene/Solr
LinkedIn Engineering recently open sourced Cleo (the open source technology behind LinkedIn's typeahead search) : Link.
Great stuff by LinkedIn. Check out for clever matching and highlighting stuff as desired by you

Stock symbol auto-complete API

I am looking for some stock symbol look-up API. I could able to query yahoo finance with a symbol & could able to retrieve the stock price & other details.
I am looking for some auto-complete stock look-up API like if i query fo "Go*" ... how can i get all stock symbols starting with GO* = Goog etc ... is there any APi for wildcard stock symbol searches
Any help would be great ..
Thanks

There's a simple problem here and a more complex problem.
If your list of symbols is static, then you could use any typical autocomplete API against a file that you maintain locally.
However, the list of symbols is rarely static. Symbols are constantly being added or erased and changed due to a variety of financial market situations (IPO, acquisitions, mergers, renames, bankrupcies, etc.). Many symbols are also traded on specific exchanges, and some symbols are cross listed. There are also financial instruments that are not simply stocks, such as indices, commodities, etc. The general term for this is "instruments definitions" and a complete list of such definitions is a service provided by companies such as Reuters or Bloomberg.
I am not familiar with any free and open instrument lists that you could get for free, and you need to make sure that you are complying with the licenses of services that allow you to get a current
list.
If you can tolerate a delay of one business day, you might be able to scrape the list from a variety of sources that provide close-of-business listings for all stocks in the US. The WSJ has a printed list (likely also an electronic one). Eoddata provides such a list, etc. But make sure that you are complying with their terms.

yahoo has an api for this:
http://d.yimg.com/autoc.finance.yahoo.com/autoc?query=GOO&callback=YAHOO.Finance.SymbolSuggest.ssCallback

Any autocomplete API or toolkit should work if you put the tickers in as a source. You will probably have to host it yourself as I don't know of any public ones.

You can try using the "Company Search" operation in the Company Fundamentals API here: http://www.mergent.com/servius/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to find words sysnonyms using GATE ANNIE? - java

Related

Use google-api/mediawiki-api to retrieve information

Identifying all the names from a given text

How to get the finalized text after resolving co-references using StandfordNLP

Bold the user input in autocompletion stuff

Stock symbol auto-complete API

Categories

Resources