Exact Dictionary based Named Entity Recognition with Stanford

Exact Dictionary based Named Entity Recognition with Stanford - java

I have a dictionary of named entities, extracted from Wikipedia. I want to use it as the dictionary of an NER. I wanted to know how can I use Stanford-NER with this data of mine.
I have also downloaded Lingpipe, although I have no idea how can I use it. I would appreciate all kinds of information.
Thanks for your helps.

You can use dictionary (or regular expression-based) named entity recognition with Stanford CoreNLP. See the RegexNER annotator. For some applications, we run this with quite large dictionaries of entities. Nevertheless, for us this is typically a secondary tool to using statistical (CRF-based) NER.

Stanford-NER is based on CRFs, which is a statistic model. I'm afraid it doesn't support extra dictionary or lexicon. However, you can train a new model according to your own task.

you can use MER: http://labs.fc.ul.pt/mer/
a minimal entity recognizer developed in bash: https://github.com/lasigeBioTM/MER
that only requires a lexicon (text file) as input

Related

How to create a simple Italian Model for a Named Entity Extraction of Persons using OpenNLP?

I have to do a project with OpenNLP, strictly in italian language. Since it's almost impossible to find some existing structures in this language, my idea is to create a simple model myself. Reading some posts on this platform, my idea is try to do this using model-builder addon.
First of all, it's possible to obtain my goal with this addon?
If so, referring to this other post, what kind of file is meant by "modelOutFile"? In my case I don't have an existing model.
N.B.: the addon uses some deprecated functions (such as nameFinderME.train()).
Naively, I tried to pass as a "modelOutFile" a simple empty file "model.bin", but, of course I bumped into an error:
Cannot invoke "java.util.Properties.getProperty(String)" because "manifest" is null
Furthermore, I used a few names and sentences for the test (I only wanted to know if this worked), not the large amount requested (15000 sentences at least).
I'm open to other suggestions instead of the use of modelbuilder addons.
Hope someone can help me.

Apache OpenNLP Part of Speech Tagger: Trained on which data set?

I am using the Apache OpenNLP Part-of-Speech Tagger for word class recognition in a collection of text.
I am trying to evaluate the tagger for its performance and I wondered on which data it might have been trained?
The name of the models that exist for English give no hint about the used training data.
The Apache OpenNLP documentation mentions several corpora which potentially might have been used for training the POS-Tagger, too.
http://opennlp.apache.org/documentation/manual/opennlp.html#tools.corpora
Does anyone know how to find out on which training data the English POS-Models have been trained?

Yes, you are right, that there are several corpora used in Opennlp.
But if you'll see the OpenNLP Model page, it's mentioned which dataset is used to train the model like below.

Training Data set for Name entity recognition using StanfordNLP

I have used OpenNLP to create an model. But now i am looking into StanfordNLP which uses Condition Random Field. I want to know how to train a data for NER using stanfordNLP.
For OpenNLp we use START and END tag but i do no how to train using StanfordNLP. please give me an example.

Look at http://nlp.stanford.edu/software/crf-faq.shtml#b
That should explain what you need to know to get started. There are then many options mainly only documented in the code.

unsupervised Named entity recognition (NER) with custom controlled vocabulary for crosslink-suggestions in Java

I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.
Consider the following use-case:
an editor is inputting articles in a CMS (about 500 words).
the text may contain references (in plain text) to entities of a specific domain. e.g:
names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
a controlled vocabulary of these entities exist (about 5.000 entities) .
I imagine an entity to be a -tuple in the vocabulary
after finishing the text, the user should be able to save the document.
This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
Hits are returned to the controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.
Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.
All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.

For people looking this up in the future:
"Approximate Dictionary-Based Chunking"
see: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
(URL edited.)

Unsure if these might be helpful:
http://www-nlp.stanford.edu/software/CRF-NER.shtml
http://cogcomp.cs.illinois.edu/page/software

writing data in to files with java

I am writing a server in java that allows clients to play a game similar to 20 questions. The game itself is basically a binary tree with nodes that are questions about an object and leaves that are guesses at the object's identity. When the game guesses wrong it needs to be able to get the right answer from the player and add it to the tree. This data is then saved to a random access file.
The question is: How do you go about representing a tree within a file so that the data can be reaccessed as a tree at a later time.
If you know where I can find information on keeping data structures like trees organized as such when writing/reading to files then please link it. Thanks a lot.
Thanks for the quick answers everyone. This is a school project so it has some odd requirements like using random access files and telnet.

This data is then saved to a random access file.
That's the hard way to solve your problem (the "random access" bit, I mean).
The problem you are really trying to solve is how to persist a "complicated" data structure. In fact, there are a number of ways that this can be done. Here are some of them ...
Use Java persistence. This is simple to implement; make sure that your data structure is serializable, and then its just a few lines of code to serialize and few more lines to deserialize. The downsides are:
Serialized objects can be fragile in the face of code changes.
Serialization is not incremental. You write/read the whole graph each time.
If you have multiple separate serialized graphs, you need some scheme to name and manage them.
Use XML. This is more work to implement than Java persistence, but it has the advantage of being less fragile. And if something does go wrong, there's a chance you can fix it with XSLT or a text editor. (There are XML "binding" libraries that eliminate a lot of the glue coding.)
Use an SQL database. This addresses all of the downsides of Java persistence, but involves more coding ... and using a different computational model to access the persistent data (query versus graph navigation).
Use a database and an Object Relational Mapping technology; e.g. a JPA or JDO implementation. (Hibernate is a popular choice). These bridge between the database and in-memory views of data in a more or less transparent fashion, and avoids a lot of the glue code that you need to write in the SQL database and XML cases.

I think you're looking for serialization. Try this:
http://java.sun.com/developer/technicalArticles/Programming/serialization/

As mentioned, serialization is what you are looking for. It allows you to write an object to a file, and read it back later with minimal effort. The file will automatically be read back in as your object type. This makes things much easier than trying to store the object yourself using XML.

Java serialization has some pitfalls (like when you update your class). I would serialize in a text format. Json is my first choice here but xml and yaml would work as well.
This way you would have a file that doesn't rely on the binary version of your class.
There are several java libraries: http://www.json.org
Some examples:
http://code.google.com/p/json-simple/wiki/DecodingExamples
http://code.google.com/p/json-simple/wiki/EncodingExamples
And to save and read from the file you can use the Commons Io:
import org.apache.commons.io.FileUtis;
import java.io.File;
...
File dataFile = new File("yourfile.json");
String data = FileUtils.readFileToString(dataFile);
FileUtils.writeStringToFile(dataFile, content);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.