I am working on a text classification program. My training data is 700+ text categories and each categories contains 1-350 text phrases. 16k+ phrases total. The data that needs to be classified are text phrases. I want to classify the data so it gives me 5 most similar categories. The training data shares a lot of common words.
My first attempt was using Naive Bayes Theorem using this library on github because this library was very easy to use and allowed me to load my training data as strings. But other users reported issues and when I tried to classify my data, my input is either classified wrong or not classified.
https://github.com/ptnplanet/Java-Naive-Bayes-Classifier
So I think the library was the issue, so Im going try different libraries and look into k means clustering since my data is high variance.
So when I looking at other libraries, they all require input and training data as a vector matrix. I looked at word2vec and td-idf to convert text vectors. I understand tf-idf, and that I can get the weight of the word compared to the rest of the documents. But how can I use it classify my input data to categories? Would each category be a document? Or would all categories be a single document?
edit:data sample
SEE_BILL-see bill
SEE_BILL-bill balance
SEE_BILL-wheres my bill
SEE_BILL-cant find bill
PAY_BILL-pay bill
PAY_BILL-make payment
PAY_BILL-lower balance
PAY_BILL-remove balance
PAST_BILL-last bill
PAST_BILL-previous bill
PAST_BILL-historical bill
PAST_BILL-bill last year
First of all, the end of your question doesn't make any sense because you didn't say what are the classes you want to to classify the text phrases to. Now, I can help you with the vectorization of the text phrases.
Tf-idf is pretty good but you have to have a good preprocessing to do it. Also, you would have to create the vectors yourself. The problem with it is that you will give the vector of length of all the distinct words in your dataset, even the same words in different forms in which they occur in the dataset. So if you have the word go in your dataset, it's likely that there will be several forms of that word including going, Go, gone, went and so on. That's why you have to have a good preprocessing tho put all of those forms of word go to it's root form. Also, you have to lowercase the whole dataset because words go and Go are not the same. But even if you do all of that and make a perfect preprocessing pipeline, you will get the vector of length 20k+. You would then have to manually select the features (words) you want to leave in the vector and delete the others. That means, if you want to have vector of size 300 you would have to delete the 19 700 words from the vector. Of course, you would be left with the 300 best distinctive. If you want to dive into it deeper and see how exactly it works, you can check it out here
On the other hand, word2vec maps any word to a 300 dimensional vector. Of course, you would have to do some preprocessing, similar to the tf-idf, but this method is much less sensitive. You can find how word2vec works here
In conclusion, I would recommend you go with word2vec because it's much easier to start with. There is pretrained model from Google which you can download here
The two most popular approaches would be to:
represent each phrase/sentence as a bag of words where you basically one-hot encode each word of the phrase and the dimension of the encoding is the dimension of your vocabulary (total number of words)
use embeddings based on popular models, like word2vec which put each word in a X-dimensional vector space (e.g. 300-dimensional) so each of your phrases/sentences would be a sequence of X-dimensional vectors
An even more extreme approach would be to embed whole sentences using models like universal-sentence-encoder. In short: it's similar to word2vec but instead of words converts whole sentences to (512-dimensional) vector space. Than it's easier to find "similar" sentences.
Related
I successfully followed deeplearning4j.org tutorial on Word2Vec, so I am able to load already trained model or train a new one based on some raw text (more specifically, I am using GoogleNews-vectors-negative300 and Emoji2Vec pre-trained model).
However, I would like to combine these two above models for the following reason: Having a sentence (for example, a comment from Instagram or Twitter, which consists of emoji), I want to identify the emoji in the sentence and then map it to the word it is related to. In order to do that, I was planning to iterate over all the words in the sentence and calculate the closeness (how near the emoji and the word are located in the vector space).
I found the code how to uptrain the already existing model. However, it is mentioned that new words are not added in this case and only weights for the existing words will be updated based on a new text corpus.
I would appreciate any help or ideas on the problem I have. Thanks in advance!
Combining two models trained from different corpuses is not a simple, supported operation in the word2vec libraries with which I'm most familiar.
In particular, even if the same word appears in both corpuses, and even in similar contexts, the randomization that's used by this algorithm during initialization and training, and extra randomization injected by multithreaded training, mean that word may appear in wildly different places. It's only the relative distances/orientation with respect to other words that should be roughly similar – not the specific coordinates/rotations.
So to merge two models requires translating one's coordinates to the other. That in itself will typically involve learning-a-projection from one space to the other, then moving unique words from a source space to the surviving space. I don't know if DL4J has a built-in routine for this; the Python gensim library has a TranslationMatrix example class in recent versions which can do this, as motivated by the use of word-vectors for language-to-language translations.
I am having to develop a project in core java in which i am going to take some 100 lines of text from the user. Now, I want to break the whole text into clusters wherein each cluster will relate to a keyword for example suppose i have text like:
"Java is an object oriented language. It uses classes for modularisation. bla bla bla...
C++ is also an object oriented language. bla bla bla...
Something about OOPS concepts here..."
Now, if i give this whole text as input to the program, i want that the program shall create directories with the name of the keywords and it also shall choose the keywords on its own. I am expecting that the keyword in this text are Java, Modularisation, C++, OOPS. In the later stages of this program, I would be dealing with different texts so i have to make this program intelligent enough to understand which words are keywords and which are not. So that it can work with any piece of text.
So, I have looked up many places, asked many people, and watched many tutorials only to find that they are mostly clustering numerical data. But, rarely anyone is dealing with text clustering. I am looking for an algorithm or a way which can do this work.
Thanks
The reason why you are only finding tutorials is because algorithms of the area of machine learning need numerical data. So you have to convert your data in a numerical format.
To create a numerical represantation of text there are a number of algorithms. As example the Levenshtein distnace.
With this distance measures you have a numerical represantation and the clustering algorithms are applicable.
As example you can use the k-Means algorithm or any other to cluster your text data.
You should also google a bit about text mining, there are many good examples in the web. This link could be a good resource
There are a variety of approaches that you can use to pre-process your text and then to cluster that processed data. An example would be to generate the bag-of-words representation of the text and the apply clustering methods.
However, I would personally choose LDA topic modeling. This algorithm by itself does not 'cluster' your text, but can be used as a pre-processing step for text clustering. It is another unsupervised approach that gives you a list of 'topic's associated with a set of documents or sentences. These topic are actually a set of words that are deemed to be relevant to each other based on how they appear in the underlying text. For instance, the following are three topics extracted from a set of tweets:
food, wine, beer, lunch, delicious, dining
home, real estate, house, tips, mortgage, real estate
stats, followers, unfollowers, checked, automatically
Then you can calculate the probability of a sentence belonging to each of these topics by counting the number times these words appear in the sentence and the total word count. Finally, these probability values can be used for text clustering. I should also note that these words generated by LDA are weighted, so you can use the one with the largest weight as your main keyword. For instance, 'food', 'home', and 'stats' have the largest weight in the above lists, respectively.
For LDA implementation, check out Mallet library developed in Java.
I'm creating a mini search engine in Java which basically grabs all of the RSS feeds that a user specifies and then allows him or her to choose a single word to search for. Since the RSS feed documents are fairly limited in number, I'm thinking about processing the documents first before the user enters his or her search term. I want to process them by creating hashmaps linking certain keywords to a collection of records which contain the articles themselves and the number of times the word appears in the article. But, how would I determine the keywords? How can I tell which words are meaningless and which aren't?
The concept of "what words should I ignore?" is generally named stopwords. The best search engines do not use stopwords. If I am a fan of the band "The The", I would be bummed if your search engine couldn't find them. Also, searching for exact phrases can be screwed up by a naive stopwords implementation.
By the way, the hashmap you're talking about is called an inverted index. I recommend reading this (free, online) book to get an introduction to how search engines are built: http://nlp.stanford.edu/IR-book/information-retrieval-book.html
In Solr, I believe these are called 'stopwords'.
I believe they just use a text file to define all the words that they will not search on.
A small extract re. stopwords from NLTK from Ch. 2:
There is also a corpus of stopwords, that is, high-frequency words
like the, to and also that we sometimes want to filter out of a
document before further processing. Stopwords usually have little
lexical content, and their presence in a text fails to distinguish it
from other texts.
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]
Stopwords are one thing you should use. Lots of stopword lists are available on the web.
However I'm writing an answer because the previous ones didn't mention TF-IDF which is a metric for how important a word is in the context of your corpus of documents.
A word is more likely to be a keyword foe a document if it appears a lot in it (term frequency) and doesn't appear frequently in other documents (inverse document frequency). This way words like a, the, where, are naturally ignored, because they appear in every document.
P.S. On a related topic, you'll probably be interested in other lists, i.e. swearwords :)
P.P.S. Hashmaps are a good thing, but you should also check suffix trees for your task.
I'm building a text classifier in java with Weka library.
First i remove stopwords, then I'm using a stemmer (e.g convert cars to car).
Right now i have 6 predefined categories. I train the classifier on
5 documents for every category. The length of the documents are similar.
The results are ok when the text to be classified is short. But when the text is longer
than 100 words the results getting stranger and stranger.
I return the probabilities for each category as following:
Probability:
[0.0015560238056109177, 0.1808919321002592, 0.6657404531908249, 0.004793498469427115, 0.13253647895234325, 0.014481613481534815]
which is a pretty reliable classification.
But when I use texts longer than around 100 word I get results like:
Probability: [1.2863123678314889E-5, 4.3728547754744305E-5, 0.9964710903856974, 5.539960514402068E-5, 0.002993481218084141, 4.234371196414616E-4]
Which is to good.
Right now Im using Naive Bayes Multinomial for classifying the documents. I have read
about it and found out that i could act strange on longer text. Might be my problem right now?
Anyone has any good idea why this is happening?
There can be multiple factors for this behavior. If your training and test texts are not of the same domain, this can happen. Also, I believe adding more documents for every category should do some good. 5 documents in every category is seeming very less. If you do not have more training documents or it is difficult to have more training documents, then you can synthetically add positive and negative instances in your training set (see SMOTE algorithm in detail). Keep us posted the update.
i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.
If I want to find one word I can do it by just putting all the text into the hashmap and store each word's occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?
It depends a lot on the size of the text file. There are usually several cases you should consider:
Lot's of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.
Lot's of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.
If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.
There are probably lot's of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.
Some more detail is required before suggesting an approach:
Are you searching for whole words only or any substring?
Are you going to search for many different words in the same unchanged file?
Do you know the words you want to search for all at once?
There are many efficient (linear) search algorithms for strings. If possible I'd suggest using one that's already been written for you.
http://en.wikipedia.org/wiki/String_searching_algorithm
One simple idea is to use a sliding window hash with the window the same size as the search string. Then in a single pass you can quickly check to see where the window hash matches the hash of your search string. Where it matches you double check to see if you've got a real match.