Sentence Classification (Categorization) - java

I have been reading about text classification and found several Java tools which are available for classification, but I am still wondering: Is text classification the same as sentence classification!
Is there any tool which focuses on sentence classification?

Theres no formal difference between 'Text classification' and 'Sentence classification'. After all, a sentence is a type of text. But generally, when people talk about text classification, IMHO they mean larger units of text such as an essay, review or speech. Classifying a politician's speech into democrat or republican is a lot easier than classifying a tweet. When you have a lot of text per instance, you don't need to squeeze each training instance for all the information it can give you and get pretty good performance out a bag-of-words naive-bayes model.
Basically you might not get the required performance numbers if you throw off-the-shelf weka classifiers at a corpora of sentences. You might have to augment the data in the sentence with POS tags, parse trees, word ordering, ngrams, etc. Also get any related metadata such as creation time, creation location, attributes of sentence author, etc. Obviously all of this depends on what exactly are you trying to classify.. the features that will work out for you need to be intuitively meaningful to the problem at hand.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Text Classification, How to convert text strings to vector representation

I am working on a text classification program. My training data is 700+ text categories and each categories contains 1-350 text phrases. 16k+ phrases total. The data that needs to be classified are text phrases. I want to classify the data so it gives me 5 most similar categories. The training data shares a lot of common words.
My first attempt was using Naive Bayes Theorem using this library on github because this library was very easy to use and allowed me to load my training data as strings. But other users reported issues and when I tried to classify my data, my input is either classified wrong or not classified.
https://github.com/ptnplanet/Java-Naive-Bayes-Classifier
So I think the library was the issue, so Im going try different libraries and look into k means clustering since my data is high variance.
So when I looking at other libraries, they all require input and training data as a vector matrix. I looked at word2vec and td-idf to convert text vectors. I understand tf-idf, and that I can get the weight of the word compared to the rest of the documents. But how can I use it classify my input data to categories? Would each category be a document? Or would all categories be a single document?
edit:data sample
SEE_BILL-see bill
SEE_BILL-bill balance
SEE_BILL-wheres my bill
SEE_BILL-cant find bill
PAY_BILL-pay bill
PAY_BILL-make payment
PAY_BILL-lower balance
PAY_BILL-remove balance
PAST_BILL-last bill
PAST_BILL-previous bill
PAST_BILL-historical bill
PAST_BILL-bill last year
First of all, the end of your question doesn't make any sense because you didn't say what are the classes you want to to classify the text phrases to. Now, I can help you with the vectorization of the text phrases.
Tf-idf is pretty good but you have to have a good preprocessing to do it. Also, you would have to create the vectors yourself. The problem with it is that you will give the vector of length of all the distinct words in your dataset, even the same words in different forms in which they occur in the dataset. So if you have the word go in your dataset, it's likely that there will be several forms of that word including going, Go, gone, went and so on. That's why you have to have a good preprocessing tho put all of those forms of word go to it's root form. Also, you have to lowercase the whole dataset because words go and Go are not the same. But even if you do all of that and make a perfect preprocessing pipeline, you will get the vector of length 20k+. You would then have to manually select the features (words) you want to leave in the vector and delete the others. That means, if you want to have vector of size 300 you would have to delete the 19 700 words from the vector. Of course, you would be left with the 300 best distinctive. If you want to dive into it deeper and see how exactly it works, you can check it out here
On the other hand, word2vec maps any word to a 300 dimensional vector. Of course, you would have to do some preprocessing, similar to the tf-idf, but this method is much less sensitive. You can find how word2vec works here
In conclusion, I would recommend you go with word2vec because it's much easier to start with. There is pretrained model from Google which you can download here
The two most popular approaches would be to:
represent each phrase/sentence as a bag of words where you basically one-hot encode each word of the phrase and the dimension of the encoding is the dimension of your vocabulary (total number of words)
use embeddings based on popular models, like word2vec which put each word in a X-dimensional vector space (e.g. 300-dimensional) so each of your phrases/sentences would be a sequence of X-dimensional vectors
An even more extreme approach would be to embed whole sentences using models like universal-sentence-encoder. In short: it's similar to word2vec but instead of words converts whole sentences to (512-dimensional) vector space. Than it's easier to find "similar" sentences.

Text clustering program in java

I am having to develop a project in core java in which i am going to take some 100 lines of text from the user. Now, I want to break the whole text into clusters wherein each cluster will relate to a keyword for example suppose i have text like:
"Java is an object oriented language. It uses classes for modularisation. bla bla bla...
C++ is also an object oriented language. bla bla bla...
Something about OOPS concepts here..."
Now, if i give this whole text as input to the program, i want that the program shall create directories with the name of the keywords and it also shall choose the keywords on its own. I am expecting that the keyword in this text are Java, Modularisation, C++, OOPS. In the later stages of this program, I would be dealing with different texts so i have to make this program intelligent enough to understand which words are keywords and which are not. So that it can work with any piece of text.
So, I have looked up many places, asked many people, and watched many tutorials only to find that they are mostly clustering numerical data. But, rarely anyone is dealing with text clustering. I am looking for an algorithm or a way which can do this work.
Thanks
The reason why you are only finding tutorials is because algorithms of the area of machine learning need numerical data. So you have to convert your data in a numerical format.
To create a numerical represantation of text there are a number of algorithms. As example the Levenshtein distnace.
With this distance measures you have a numerical represantation and the clustering algorithms are applicable.
As example you can use the k-Means algorithm or any other to cluster your text data.
You should also google a bit about text mining, there are many good examples in the web. This link could be a good resource
There are a variety of approaches that you can use to pre-process your text and then to cluster that processed data. An example would be to generate the bag-of-words representation of the text and the apply clustering methods.
However, I would personally choose LDA topic modeling. This algorithm by itself does not 'cluster' your text, but can be used as a pre-processing step for text clustering. It is another unsupervised approach that gives you a list of 'topic's associated with a set of documents or sentences. These topic are actually a set of words that are deemed to be relevant to each other based on how they appear in the underlying text. For instance, the following are three topics extracted from a set of tweets:
food, wine, beer, lunch, delicious, dining
home, real estate, house, tips, mortgage, real estate
stats, followers, unfollowers, checked, automatically
Then you can calculate the probability of a sentence belonging to each of these topics by counting the number times these words appear in the sentence and the total word count. Finally, these probability values can be used for text clustering. I should also note that these words generated by LDA are weighted, so you can use the one with the largest weight as your main keyword. For instance, 'food', 'home', and 'stats' have the largest weight in the above lists, respectively.
For LDA implementation, check out Mallet library developed in Java.

How to train chunker in OpenNLP to Predict sequence of Words

In my project i need to predict the word sequence in sentence. I used OpenNLP sentence detection, tokenization with their trained model. But i need to classify sequence of words in a sentence as one token for my related group. But their chunker is not predicting the patterns.
For example if my group is Food items, then chunker should predict
chicken pizza as one token.
can anybody explain how to train their model for our domain.
OpenNLP is open source, a quick poke through the source code shows me that they're using a Naive Bayes Classifier [source here]. Somewhere in there will be the code they used to train it. That will tell you both how to train it, and what type of corpus you need.
Re-training it will not be an afternoon project though, these things tend to be time-sinks. So depending on what you're doing it may be a better use of your time to use their classifier as is, even if this is not exactly what you're looking for. I'm not sure exactly what you're trying to do, but it may be possible to use some hack, like co-occurrence scores between your sequences of words (ie how often "chicken" and "pizza" appear together), as an approximation of what you're hoping to do with a re-trained classifier.

How to validate a chapter heading for text using fuzzy logic in Java

I need a solution for identifying incorrect chapter headings in a book.
We are developing an ingestion system for books that does all sorts of validation, like spell-checking and offensive-language-filtering. Now we'd like to flag chapter headings that seem inaccurate given the chapter body. For example, if the heading was "The Function of the Spleen", I would not expect the chapter to be about the liver.
I am familiar with fuzzy string matching algorithms but this seems like more like an NLP or classification problem. If I could match (or closely match) the phrase "function of the spleen", then that's great -- high confidence. Otherwise, a high occurrence of both "function" and "spleen" in the text also yields confidence. And of course, the closer they are together the better.
This needs to be done in-memory, on the fly, and in Java.
My current naive approach is to simply tokenize all the words, remove noise words (like prepositions), stem what's left, and then count the number of matches. At a minimum I'd expect each word in the heading to appear at least once in the text.
Is there a different approach, ideally one that would take into account things like proximity and ordering?
I think that it is a classification problem, as such take a look at WEKA

Categories

Resources