Machine Learning Library Specialized for documents - java

I am doing a project and I need to find out a machine learning library written in java specialized for document classification. Can anyone please give me some examples?

Here are two famous Java libraries
Stanford core NLP - http://nlp.stanford.edu/software/classifier.shtml
GATE - http://osdir.com/ml/ai.gate.general/2007-05/msg00003.html, https://gate.ac.uk/sale/tao/splitch19.html#chap:ml

Depends on the kind of ML you are looking for.
There is the linguistic part of the problem (parsing documents, extracting entities, etc.) which can significantly improve the result, and the ML algorithms part.
For the latter look at Apache Mahout, for example - it also has examples of document classifications coming with it. Especially if you plan to deal with a lot of data. Stanford classifier is also a good choice to start with.

Both machine learning frameworks MALLET (http://mallet.cs.umass.edu/classification.php) and Weka (http://www.cs.waikato.ac.nz/ml/weka/) can do document classification. They are both easy to get started with compared to say Mahout or Spark.

Related

NLP - Named Entity Recognition

What algorithm does Named Entity Recognition (NER) use? I mean how does it match and tag all the entities?
Basically, There are lot of algorithm Named Entity Recognition but the most famous are
Conditional Random Field(CRF) Algorithm is used.
Stanford CoreNLP for Java
Spacy
NLTK
Another most effective one is using Deep Learning like Recurrent Neural Network (LSTM).
http://nlp.town/blog/ner-and-the-road-to-deep-learning/
https://towardsdatascience.com/deep-learning-for-ner-1-public-datasets-and-annotation-
methods-8b1ad5e98caf
there are lot other articles and paper out there
Hope this will help :)
NER can be performed by different algorithms, from simple string matching using grep: http://labs.fc.ul.pt/mer/ to advanced machine learning techniques: https://nlp.stanford.edu/software/CRF-NER.shtml
Basically, It Depends on the approach that you are following.
For Name Entity Recognition There are n no.of approaches available. Most typically Conditional Random Field(CRF) Algorithm is used. Another most effective one is using Deep Learning like Recurrent Neural Network (LSTM). There might be other algorithms as well.
Hope this will help :)
If you want to play around with a nice, pre-built Bidirectional CRF-LSTM (currently one of the better performing models) in Keras, I would recommend the python package Anango as a very quick way to play around with an NER system.
Although this initial question is tagged with Java. It is important to point out that you are going to find all the latest and greatest algorithms for this type of Machine Learning implemented in Python.
Spacy (a Python package) is also a very good way to get started although I found it took a little more time to set something up quickly. Also, I'm not sure how easy it would be to alter their NER algorithm that is pre-built. I believe it uses CNNs.
A good overview of where things are at with NER:
A Survey on Recent Advances in Named Entity Recognition from Deep
Learning models

OpenNLP vs Stanford CoreNLP

I've been doing a little comparison of these two packages and am not sure which direction to go in. What I am looking for briefly is:
Named Entity Recognition (people, places, organizations and such).
Gender identification.
A decent training API.
From what I can tell, OpenNLP and Stanford CoreNLP expose pretty similar capabilities. However, Stanford CoreNLP looks like it has a lot more activity whereas OpenNLP has only had a few commits in the last six months.
Based on what I saw, OpenNLP appears to be easier to train new models and might be more attractive for that reason alone. However, my question is what would others start with as the basis for adding NLP features to a Java app? I'm mostly worried as to whether OpenNLP is "just mature" versus semi-abandoned.
In full disclosure, I'm a contributor to CoreNLP, so this is a biased answer. But, in my view on your three criteria:
Named Entity Recognition: I think CoreNLP clearly wins here, both on accuracy and ease-of-use. For one, OpenNLP has a model per NER tag, whereas CoreNLP detects all tags with a single Annotator. Furthermore, temporal resolution with SUTime is a nice perk in CoreNLP. Accuracy-wise, my anecdotal experience is that CoreNLP does better on general-purpose text.
Gender identification. I think both tools are kind of poorly documented on this front. OpenNLP seems to have a GenderModel class; CoreNLP has a gender Annotator.
Training API. I suspect the OpenNLP training API is easier-to-use for not off-the-shelf training. But, if all you want to do is, e.g., train a model from a CoNLL file, both should be straightforward. Training speed tends to be faster with CoreNLP than other tools I've tried, but I haven't benchmarked it formally, so take that with a grain of salt.
A bit late here, but I recently looking at OpenNLP based just on the fact that Stanford is GPL licenced - if thats ok for your project then Stanford is often referred to as the benchmark/state-of-the-art for NLP.
That said, the performance for the pre-trained models will depend on your target text as it is very domain specific. If your target text is similar to the data that the models were trained against then you should get decent results, but if not then you will have to train the models yourself and it will depend on the training data.
A strength of OpenNlp it that it is very extensible and is written for easy use with other libraries and has a good API for integrating - the training is very simple (once you have your training data) with OpenNLP (I wrote about it here - with a pretty lousy generated data set I was able to get ok results identifying foods), and it is very configurable - you can configure all the parameters around training very easily and there are a range of algorithms you can use (perceptron, max entropy, and in the snapshot version they have added Naive Bayes)
If you find that you do need to train the models yourself, I would consider trying out OpenNlp and seeing how it performs just for comparison, as with fine tuning you can get pretty decent results.
That depends on your purpose and need, what i know about these two is OpenNLP is opensource and CoreNLP is not of course.
But If you will look at the accuracy level Stanford CoreNLP have more accurate detection than OpenNLP. Recently I did comparison for the Part Of Speech (POS) tagging for both and yes which is the most imp part in any NLP task, So in my analysis the winner was CoreNLP.
Going forward for NER there as well CoreNLP have the more accurate results compare to OpenNLP.
So if you are just starting you can take up OpenNLP later if needed you can migrate to Stanford CoreNLP.

Semantic search system in Java

I want to implement a semantic search system in Java. Sesame will be embedded into my system to store and manipulate rdf data directly, and I want to use Tomcat, JSP and Servlet. But I also need to do natural language processing, which I know Python is really good at. So it there any way that I can merge Python code in my Java web codes? Or is there any good tools dealing with NLP in java?
I think I'm a little confused since I know little about NLP area.
Thanks in advance!
Apace Lucene is the best http://lucene.apache.org/core/ - also read about it extension like a Solr
NLP can be done with java too..here is the link
http://opennlp.apache.org/
And also if you want to go for python for NLP then you can use NLTK

library for text classification in java

I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?
I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.
Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.

Perl or Java Sentiment Analysis

I was wondering if anybody knew of any good Perl modules and/or Java classes for sentiment analysis. I have read about LingPipe, but the program would eventually need to be used for commercial use so something open-source would be better. I also looked into GATE, but their documentation on sentiment analysis is sparse at best.
Have a look at Rate_Sentiment in the WebService::GoogleHack module at CPAN. There's more information about the project at SourceForge.
I just added a sentiment analysis library to my Social Media Analytics Research Toolkit. The blog post / announcement is here. It's in R, not in Java, but there's a good interface between R and Java in the toolkit, so you can write your "glue code" in Java to call the R library. There's also an R - Python interface in the toolkit.
There's supposed to be an R / Perl interface too, but I haven't been able to contact the maintainer about bugs, so I took it out of the build.
You might want to take a look at LingPipe (Java) based sentiment analysis at:
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
and GATE (http://gate.ac.uk/sentiment/)
For more generalized NLP parsers see The Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml), NLTK (Python) (http://www.nltk.org/), etc.
I'm not aware of any similar open source tools for Perl, although there are some good basic references out there to get you started, e.g.:
Billisoly, R. (2008) Practical Text Mining with Perl. Wiley. ISBN 978-0-470-17643-6.

Categories

Resources