OpenNLP vs Stanford CoreNLP

OpenNLP vs Stanford CoreNLP - java

I've been doing a little comparison of these two packages and am not sure which direction to go in. What I am looking for briefly is:
Named Entity Recognition (people, places, organizations and such).
Gender identification.
A decent training API.
From what I can tell, OpenNLP and Stanford CoreNLP expose pretty similar capabilities. However, Stanford CoreNLP looks like it has a lot more activity whereas OpenNLP has only had a few commits in the last six months.
Based on what I saw, OpenNLP appears to be easier to train new models and might be more attractive for that reason alone. However, my question is what would others start with as the basis for adding NLP features to a Java app? I'm mostly worried as to whether OpenNLP is "just mature" versus semi-abandoned.

In full disclosure, I'm a contributor to CoreNLP, so this is a biased answer. But, in my view on your three criteria:
Named Entity Recognition: I think CoreNLP clearly wins here, both on accuracy and ease-of-use. For one, OpenNLP has a model per NER tag, whereas CoreNLP detects all tags with a single Annotator. Furthermore, temporal resolution with SUTime is a nice perk in CoreNLP. Accuracy-wise, my anecdotal experience is that CoreNLP does better on general-purpose text.
Gender identification. I think both tools are kind of poorly documented on this front. OpenNLP seems to have a GenderModel class; CoreNLP has a gender Annotator.
Training API. I suspect the OpenNLP training API is easier-to-use for not off-the-shelf training. But, if all you want to do is, e.g., train a model from a CoNLL file, both should be straightforward. Training speed tends to be faster with CoreNLP than other tools I've tried, but I haven't benchmarked it formally, so take that with a grain of salt.

A bit late here, but I recently looking at OpenNLP based just on the fact that Stanford is GPL licenced - if thats ok for your project then Stanford is often referred to as the benchmark/state-of-the-art for NLP.
That said, the performance for the pre-trained models will depend on your target text as it is very domain specific. If your target text is similar to the data that the models were trained against then you should get decent results, but if not then you will have to train the models yourself and it will depend on the training data.
A strength of OpenNlp it that it is very extensible and is written for easy use with other libraries and has a good API for integrating - the training is very simple (once you have your training data) with OpenNLP (I wrote about it here - with a pretty lousy generated data set I was able to get ok results identifying foods), and it is very configurable - you can configure all the parameters around training very easily and there are a range of algorithms you can use (perceptron, max entropy, and in the snapshot version they have added Naive Bayes)
If you find that you do need to train the models yourself, I would consider trying out OpenNlp and seeing how it performs just for comparison, as with fine tuning you can get pretty decent results.

That depends on your purpose and need, what i know about these two is OpenNLP is opensource and CoreNLP is not of course.
But If you will look at the accuracy level Stanford CoreNLP have more accurate detection than OpenNLP. Recently I did comparison for the Part Of Speech (POS) tagging for both and yes which is the most imp part in any NLP task, So in my analysis the winner was CoreNLP.
Going forward for NER there as well CoreNLP have the more accurate results compare to OpenNLP.
So if you are just starting you can take up OpenNLP later if needed you can migrate to Stanford CoreNLP.

Related

NLP - Named Entity Recognition

What algorithm does Named Entity Recognition (NER) use? I mean how does it match and tag all the entities?

Basically, There are lot of algorithm Named Entity Recognition but the most famous are
Conditional Random Field(CRF) Algorithm is used.
Stanford CoreNLP for Java
Spacy
NLTK
Another most effective one is using Deep Learning like Recurrent Neural Network (LSTM).
http://nlp.town/blog/ner-and-the-road-to-deep-learning/
https://towardsdatascience.com/deep-learning-for-ner-1-public-datasets-and-annotation-
methods-8b1ad5e98caf
there are lot other articles and paper out there
Hope this will help :)

NER can be performed by different algorithms, from simple string matching using grep: http://labs.fc.ul.pt/mer/ to advanced machine learning techniques: https://nlp.stanford.edu/software/CRF-NER.shtml

Basically, It Depends on the approach that you are following.
For Name Entity Recognition There are n no.of approaches available. Most typically Conditional Random Field(CRF) Algorithm is used. Another most effective one is using Deep Learning like Recurrent Neural Network (LSTM). There might be other algorithms as well.
Hope this will help :)

If you want to play around with a nice, pre-built Bidirectional CRF-LSTM (currently one of the better performing models) in Keras, I would recommend the python package Anango as a very quick way to play around with an NER system.
Although this initial question is tagged with Java. It is important to point out that you are going to find all the latest and greatest algorithms for this type of Machine Learning implemented in Python.
Spacy (a Python package) is also a very good way to get started although I found it took a little more time to set something up quickly. Also, I'm not sure how easy it would be to alter their NER algorithm that is pre-built. I believe it uses CNNs.
A good overview of where things are at with NER:
A Survey on Recent Advances in Named Entity Recognition from Deep
Learning models

CoreNLP MaxentTagger architecture options - meaning and effectiveness

I'm trying to train a custom parts-of-speech tagger from the CoreNLP library (using the edu.stanford.nlp.tagger.maxent.MaxentTagger class, to be specific), and am struggling with what the options mean (am not a linguist) and what's the most effect combination(s). I've tried with some default options that came with the out of box download of CoreNLP library, and also tweaked it with some changes such as bidrectional, etc. but don't see visible improvements in the accuracy of the tags. I've read through the ExtractorFrames JavaDoc page, but they seem to be using shorthand that I don't quite understand. So:
What do the different option groups really mean?
Are there combinations that make sense from practice? I'd like to avoid spending a lot of time trying random combinations if certain ones don't make sense.

Chris Manning explains some of the most commonly used features for POS taggers in more detail in this Coursera video.
Regarding sensible feature sets: This heavily depends on the language. You can check out the configurations for the various models that we ship with the tagger on GitHub and if there is one for the language that you build a tagger for, then I'd use that configuration as a starting point for running your experiments.

Machine Learning Library Specialized for documents

I am doing a project and I need to find out a machine learning library written in java specialized for document classification. Can anyone please give me some examples?

Here are two famous Java libraries
Stanford core NLP - http://nlp.stanford.edu/software/classifier.shtml
GATE - http://osdir.com/ml/ai.gate.general/2007-05/msg00003.html, https://gate.ac.uk/sale/tao/splitch19.html#chap:ml

Depends on the kind of ML you are looking for.
There is the linguistic part of the problem (parsing documents, extracting entities, etc.) which can significantly improve the result, and the ML algorithms part.
For the latter look at Apache Mahout, for example - it also has examples of document classifications coming with it. Especially if you plan to deal with a lot of data. Stanford classifier is also a good choice to start with.

Both machine learning frameworks MALLET (http://mallet.cs.umass.edu/classification.php) and Weka (http://www.cs.waikato.ac.nz/ml/weka/) can do document classification. They are both easy to get started with compared to say Mahout or Spark.

Maximum Entropy Markov Model for Named Entity Recognition in Java

I have a parsing problem that would be solved really well by a MEMM. But I have spent far to much time trying to find a good implementation of the algorithm (ideally in java). Has anyone done this before? Alternatively I could implement it myself if some-one has some readable documentation.
Thanks!
(I have already tried Mallet and the trainer in the jar was unimplemented)

Have you looked into Stanford NLP Group's CMMClassifier, found in Stanford CoreNLP suite of NLP tools?
I'm afraid I cannot speak to the quality of the underlying MEMM implementation, but it is in Java, and I've used several other parts of Stanford NLP with relative success.
I find that sometimes the drawback of CoreNLP is its extensive object model and the very many dependencies that most modules have. When one wishes to focus on a single tool/class the distraction and learning curve associated with these dependencies can be annoying. On the other hand, this object model effectively corresponds to actual lower and mid-level processes which are common to many NLP tasks and hence can be quite useful.

What is your reason for thinking MEMMs are particularly good for your problem? Usually it is very hard to find theoretical justifications why something would work better than something else and the question is resolved empirically.
If you have Mallet already, try using the Conditional Random Field implementation. Recent research, starting with Lafferty, McCallum and Pereira's Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data shows that CRF is often superior to MEMM for sequence tagging.

Machine learning challenge: diagnosing program in java/groovy (datamining, machine learning)

I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in new line the last column will be diagnosis 0 or 1, in the testing part of data diagnosis column will be empty - data set contain about 1000 records) and then make predictions in testing part of data :/
I've never done anything similar so I'll appreciate any advice or information about solution to similar problem.
I was thinking about Java Machine Learning Library or Java Data Mining Package but I'm not sure if it's right direction... ? and I'm still not sure how to tackle this challenge...
Please advise.
All the best!

I strongly recommend you use Weka for your task
Its a collection of machine learning algorithms with a user friendly front-end which facilitates a lot of different kinds of feature and model selection strategies
You can do a lot of really complicated stuff using this without really having to do any coding or math
The makers have also published a pretty good textbook that explains the practical aspects of data mining
Once you get the hang of it, you could use its API to integrate any of its classifiers into your own java programs

Hi As Gann Bierner said, this is a classification problem. The best classification algorithm for your needs I know of is, Ross Quinlan algorithm. It's conceptually very easy to understand.
For off-the-shelf implementations of the classification algorithms, the best bet is Weka. http://www.cs.waikato.ac.nz/ml/weka/. I have studied Weka but not used, as I discovered it a little too late.
I used a much simpler implementation called JadTi. It works pretty good for smaller data sets such as yours. I have used it quite a bit, so can confidently tell so. JadTi can be found at:
http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/
Having said all that, your challenge will be building a usable interface over web. To do so, the dataset will be of limited use. The data set basically works on the premise that you have the training set already, and you feed the new test dataset in one step, and you get the answer(s) immediately.
But my application, probably yours also, was a step by step user discovery, with features to go back and forth on the decision tree nodes.
To build such an application, I created a PMML document from my training set, and built a Java Engine that traverses each node of the tree asking the user to give an input (text/radio/list) and use the values as inputs to the next possible node predicate.
The PMML standard can be found here: http://www.dmg.org/ Here you need the TreeModel only. NetBeans XML Plugin is a good schema-aware editor for PMML authoring. Altova XML can do a better job, but costs $$.
It is also possible to use an RDBMS to store your dataset and create the PMML automagically! I have not tried that.
Good luck with your project, please feel free to let me know if you need further inputs.

There are various algorithms that fall into the category of "machine learning", and which is right for your situation depends on the type of data you're dealing with.
If your data essentially consists of mappings of a set of questions to a set of diagnoses each of which can be yes/no, then I think methods that could potentially work include neural networks and methods for automatically building a decision tree based on the test data.
I'd have a look at some of the standard texts such as Russel & Norvig ("Artificial Intelligence: A Modern Approach") and other introductions to AI/machine learning and see if you can easily adapt the algorithms they mention to your particular data. See also O'Reilly, "Programming Collective Intelligence" for some sample Python code of one or two algorithms that might be adaptable to your case.
If you can read Spanish, the Mexican publishing house Alfaomega have also published various good AI-related introductions in recent years.

This is a classification problem, not really data mining. The general approach is to extract features from each data instance and let the classification algorithm learn a model from the features and the outcome (which for you is 0 or 1). Presumably each of your 30 questions would be its own feature.
There are many classification techniques you can use. Support vector machines is popular as is maximum entropy. I haven't used the Java Machine Learning library, but at a glance I don't see either of these. The OpenNLP project has a maximum entropy implementation. LibSVM has a support vector machine implementation. You'll almost certainly have to modify your data to something that the library can understand.
Good luck!
Update: I agree with the other commenter that Russel and Norvig is a great AI book which discusses some of this. Bishop's "Pattern Recognition and Machine Learning" discusses classification issues in depth if you're interested in the down and dirty details.

Your task is classical for neural networks, which are intended first of all to solve exactly classification tasks. Neural network has rather simple realization in any language, and it is the "mainstream" of "machine learning", closer to AI than anything other.
You just implement (or get existing implementation) standart neural network, for example multilayered network with learning by error back propagation, and give it learning examples in cycle. After some time of such learning you will get it working on real examples.
You can read more about neural networks starting from here:
http://en.wikipedia.org/wiki/Neural_network
http://en.wikipedia.org/wiki/Artificial_neural_network
Also you can get links to many ready implementations here:
http://en.wikipedia.org/wiki/Neural_network_software

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.