Creating and training a model for OpenNlp using BRAT?

Creating and training a model for OpenNlp using BRAT? - java

I may need to create a custom training set for OpenNLP, and this will require me to manually annotate a lot of entries.
To make things easier, a GUI solution may be the best idea (manually writing annotation tags it's not cool), and I've just discovered BRAT which looks like what I need.
BRAT can export an annotated file (.ann), but I'm not finding any reference to this filetype in OpenNLP's manual and I'm not sure that this can work.
What I'd like to do is to export this annotated file from BRAT and use it to train an OpenNLP's model, and I don't really care if it can be done using code or CLI.
Can someone point me in the right direction?

OpenNLP has native support for the BRAT format for training and evaluation of the Name Finder. Other components are not supported currently. Adding support for other components would probably not be difficult and in case you are interested you should ask for it on the opennlp-dev list.
The CLI can be used to train a model with brat, here is the command which will show you the usage:
bin/opennlp TokenNameFinderTrainer.brat
The following arguments are mandatory to train a model:
bratDataDir this should point to a folder containing your .ann and .txt files
annotationConfig this has to point to the config file brat is using for annotation project
lang the language of your text documents (e.g. en)
model the name of the created model file
The Name Finder needs its input cut into sentences and into tokens. By default it assumes one sentence per line and applies white space tokenization. This behavior can be adjusted with the ruleBasedTokenizer or tokenizerModel arguments. Additional it is possible to use a custom sentence detector model via the sentenceDetector Model argument.
To evaluate your model the cross validation and evaluation tools can be used in a simliar way by attaching .brat to their names.
bin/opennlp TokenNameFinderCrossValidator.brat
bin/opennlp TokenNameFinderEvaluator.brat
To speed up your annotation project you can use the opennlp-brat-annotator. It can load the Name Finder model and integrates with BRAT to automatically annotate your documents. This can speed up your annotation effort. You can find that component in the opennlp sandbox.

Related

How to create a simple Italian Model for a Named Entity Extraction of Persons using OpenNLP?

I have to do a project with OpenNLP, strictly in italian language. Since it's almost impossible to find some existing structures in this language, my idea is to create a simple model myself. Reading some posts on this platform, my idea is try to do this using model-builder addon.
First of all, it's possible to obtain my goal with this addon?
If so, referring to this other post, what kind of file is meant by "modelOutFile"? In my case I don't have an existing model.
N.B.: the addon uses some deprecated functions (such as nameFinderME.train()).
Naively, I tried to pass as a "modelOutFile" a simple empty file "model.bin", but, of course I bumped into an error:
Cannot invoke "java.util.Properties.getProperty(String)" because "manifest" is null
Furthermore, I used a few names and sentences for the test (I only wanted to know if this worked), not the large amount requested (15000 sentences at least).
I'm open to other suggestions instead of the use of modelbuilder addons.
Hope someone can help me.

Exact Dictionary based Named Entity Recognition with Stanford

I have a dictionary of named entities, extracted from Wikipedia. I want to use it as the dictionary of an NER. I wanted to know how can I use Stanford-NER with this data of mine.
I have also downloaded Lingpipe, although I have no idea how can I use it. I would appreciate all kinds of information.
Thanks for your helps.

You can use dictionary (or regular expression-based) named entity recognition with Stanford CoreNLP. See the RegexNER annotator. For some applications, we run this with quite large dictionaries of entities. Nevertheless, for us this is typically a secondary tool to using statistical (CRF-based) NER.

Stanford-NER is based on CRFs, which is a statistic model. I'm afraid it doesn't support extra dictionary or lexicon. However, you can train a new model according to your own task.

you can use MER: http://labs.fc.ul.pt/mer/
a minimal entity recognizer developed in bash: https://github.com/lasigeBioTM/MER
that only requires a lexicon (text file) as input

unsupervised Named entity recognition (NER) with custom controlled vocabulary for crosslink-suggestions in Java

I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.
Consider the following use-case:
an editor is inputting articles in a CMS (about 500 words).
the text may contain references (in plain text) to entities of a specific domain. e.g:
names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
a controlled vocabulary of these entities exist (about 5.000 entities) .
I imagine an entity to be a -tuple in the vocabulary
after finishing the text, the user should be able to save the document.
This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
Hits are returned to the controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.
Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.
All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.

For people looking this up in the future:
"Approximate Dictionary-Based Chunking"
see: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
(URL edited.)

Unsure if these might be helpful:
http://www-nlp.stanford.edu/software/CRF-NER.shtml
http://cogcomp.cs.illinois.edu/page/software

Writing a format updater in java?

I have an application that reads an input xml file and builds an emf/ecore model (which can be stores as a xmi file).
The input format file is "locked" meaning that no new tags, attributes etc not already defined in the file can appear. But the number of existing tags or values of attributes can change.
Now I would like to support the following scenario:
1) User imports xml_01 and an emf model is build.
2) User modifies the model and store it to disk.
3) User imports xml_02 which is almost identical to xml_01 but with some additional nodes.
4) During the second import the existing model should be updated based on the additional content from xml_02 and possible conflicts reported to the user.
Now I have an idea on how to get started with this - basically writing the updater from scratch.
But are there any tools/libraries that can be used to help writing this kind of updater - especially when it comes to modifying an emf model?

I do not know of any third party libraries that can directly do this. But from what I understand you can use SAX parsers to parse the XMLs and implement your own Handler for the required functionality.

Convert xml to xsd using java

I am looking for a tool or java code or class library/API that can generate XSD from XML files. (Something like the xsd.exe utility in the .NET Framework sdk)

These tools can provide a good starting point, but they aren't a substitute for thinking through what the actual schema constraints ought to be. You get the opportunity for two kinds of errors: (1) allowing XML that shouldn't be allowed and (2) disallowing XML that should be ok.
As an example, pretend that you want to infer an XSD from a few thousand patient records that include a 'gender' tag (I used to work on medical records software). The tool would likely encounter 'M' and 'F' as values and might deduce that the element is an enumeration. However, other valid (although rarely used) values are B (both), U (unknown), or N (none). These are rare, of course. So, if you used your derived schema as an input validator, it would perform well until a patient with multiple sex organs was admitted to the hospital.
Conversely, to avoid this error, an XSD generator might not add enumerated type restrictions (I can't remember what these are called in schemas), and your application would work well until it encountered an errant record with gender=X.
So, beware. It's best to use these tools only as a starting point. Also, they tend to produce verbose and redundant schemas because they can't figure out patterns as well as humans.

Check Castor, I think it has the functionality you are looking for. They also provide you with an ant task that creates XSD schemas from XML files.
PS I suggest you to add more specific tags in the future: For instance, using xml, xsd and java will increment the possibility of getting answers.

You can use xsd-gen-0.2.0-jar-with-dependencies.jar file to convert xml to xsd.
And Command for it is "java -jar xsd-gen-VERSION-jar-with-dependencies.jar /path/to/xml.xml > /path/to/my.xsd"

Try the xsd-gen project from Google.
https://code.google.com/p/xsd-gen/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.