Text classification with weka - java

I'm building a text classifier in java with Weka library.
First i remove stopwords, then I'm using a stemmer (e.g convert cars to car).
Right now i have 6 predefined categories. I train the classifier on
5 documents for every category. The length of the documents are similar.
The results are ok when the text to be classified is short. But when the text is longer
than 100 words the results getting stranger and stranger.
I return the probabilities for each category as following:
Probability:
[0.0015560238056109177, 0.1808919321002592, 0.6657404531908249, 0.004793498469427115, 0.13253647895234325, 0.014481613481534815]
which is a pretty reliable classification.
But when I use texts longer than around 100 word I get results like:
Probability: [1.2863123678314889E-5, 4.3728547754744305E-5, 0.9964710903856974, 5.539960514402068E-5, 0.002993481218084141, 4.234371196414616E-4]
Which is to good.
Right now Im using Naive Bayes Multinomial for classifying the documents. I have read
about it and found out that i could act strange on longer text. Might be my problem right now?
Anyone has any good idea why this is happening?

There can be multiple factors for this behavior. If your training and test texts are not of the same domain, this can happen. Also, I believe adding more documents for every category should do some good. 5 documents in every category is seeming very less. If you do not have more training documents or it is difficult to have more training documents, then you can synthetically add positive and negative instances in your training set (see SMOTE algorithm in detail). Keep us posted the update.

Related

Text Classification, How to convert text strings to vector representation

I am working on a text classification program. My training data is 700+ text categories and each categories contains 1-350 text phrases. 16k+ phrases total. The data that needs to be classified are text phrases. I want to classify the data so it gives me 5 most similar categories. The training data shares a lot of common words.
My first attempt was using Naive Bayes Theorem using this library on github because this library was very easy to use and allowed me to load my training data as strings. But other users reported issues and when I tried to classify my data, my input is either classified wrong or not classified.
https://github.com/ptnplanet/Java-Naive-Bayes-Classifier
So I think the library was the issue, so Im going try different libraries and look into k means clustering since my data is high variance.
So when I looking at other libraries, they all require input and training data as a vector matrix. I looked at word2vec and td-idf to convert text vectors. I understand tf-idf, and that I can get the weight of the word compared to the rest of the documents. But how can I use it classify my input data to categories? Would each category be a document? Or would all categories be a single document?
edit:data sample
SEE_BILL-see bill
SEE_BILL-bill balance
SEE_BILL-wheres my bill
SEE_BILL-cant find bill
PAY_BILL-pay bill
PAY_BILL-make payment
PAY_BILL-lower balance
PAY_BILL-remove balance
PAST_BILL-last bill
PAST_BILL-previous bill
PAST_BILL-historical bill
PAST_BILL-bill last year
First of all, the end of your question doesn't make any sense because you didn't say what are the classes you want to to classify the text phrases to. Now, I can help you with the vectorization of the text phrases.
Tf-idf is pretty good but you have to have a good preprocessing to do it. Also, you would have to create the vectors yourself. The problem with it is that you will give the vector of length of all the distinct words in your dataset, even the same words in different forms in which they occur in the dataset. So if you have the word go in your dataset, it's likely that there will be several forms of that word including going, Go, gone, went and so on. That's why you have to have a good preprocessing tho put all of those forms of word go to it's root form. Also, you have to lowercase the whole dataset because words go and Go are not the same. But even if you do all of that and make a perfect preprocessing pipeline, you will get the vector of length 20k+. You would then have to manually select the features (words) you want to leave in the vector and delete the others. That means, if you want to have vector of size 300 you would have to delete the 19 700 words from the vector. Of course, you would be left with the 300 best distinctive. If you want to dive into it deeper and see how exactly it works, you can check it out here
On the other hand, word2vec maps any word to a 300 dimensional vector. Of course, you would have to do some preprocessing, similar to the tf-idf, but this method is much less sensitive. You can find how word2vec works here
In conclusion, I would recommend you go with word2vec because it's much easier to start with. There is pretrained model from Google which you can download here
The two most popular approaches would be to:
represent each phrase/sentence as a bag of words where you basically one-hot encode each word of the phrase and the dimension of the encoding is the dimension of your vocabulary (total number of words)
use embeddings based on popular models, like word2vec which put each word in a X-dimensional vector space (e.g. 300-dimensional) so each of your phrases/sentences would be a sequence of X-dimensional vectors
An even more extreme approach would be to embed whole sentences using models like universal-sentence-encoder. In short: it's similar to word2vec but instead of words converts whole sentences to (512-dimensional) vector space. Than it's easier to find "similar" sentences.

Text clustering program in java

I am having to develop a project in core java in which i am going to take some 100 lines of text from the user. Now, I want to break the whole text into clusters wherein each cluster will relate to a keyword for example suppose i have text like:
"Java is an object oriented language. It uses classes for modularisation. bla bla bla...
C++ is also an object oriented language. bla bla bla...
Something about OOPS concepts here..."
Now, if i give this whole text as input to the program, i want that the program shall create directories with the name of the keywords and it also shall choose the keywords on its own. I am expecting that the keyword in this text are Java, Modularisation, C++, OOPS. In the later stages of this program, I would be dealing with different texts so i have to make this program intelligent enough to understand which words are keywords and which are not. So that it can work with any piece of text.
So, I have looked up many places, asked many people, and watched many tutorials only to find that they are mostly clustering numerical data. But, rarely anyone is dealing with text clustering. I am looking for an algorithm or a way which can do this work.
Thanks
The reason why you are only finding tutorials is because algorithms of the area of machine learning need numerical data. So you have to convert your data in a numerical format.
To create a numerical represantation of text there are a number of algorithms. As example the Levenshtein distnace.
With this distance measures you have a numerical represantation and the clustering algorithms are applicable.
As example you can use the k-Means algorithm or any other to cluster your text data.
You should also google a bit about text mining, there are many good examples in the web. This link could be a good resource
There are a variety of approaches that you can use to pre-process your text and then to cluster that processed data. An example would be to generate the bag-of-words representation of the text and the apply clustering methods.
However, I would personally choose LDA topic modeling. This algorithm by itself does not 'cluster' your text, but can be used as a pre-processing step for text clustering. It is another unsupervised approach that gives you a list of 'topic's associated with a set of documents or sentences. These topic are actually a set of words that are deemed to be relevant to each other based on how they appear in the underlying text. For instance, the following are three topics extracted from a set of tweets:
food, wine, beer, lunch, delicious, dining
home, real estate, house, tips, mortgage, real estate
stats, followers, unfollowers, checked, automatically
Then you can calculate the probability of a sentence belonging to each of these topics by counting the number times these words appear in the sentence and the total word count. Finally, these probability values can be used for text clustering. I should also note that these words generated by LDA are weighted, so you can use the one with the largest weight as your main keyword. For instance, 'food', 'home', and 'stats' have the largest weight in the above lists, respectively.
For LDA implementation, check out Mallet library developed in Java.

Understand if two different pdf are the same research paper

I'm thinking to write a simple research paper manager.
The idea is to have a repository containing for each paper its metadata
paper_id -> [title, authors, journal, comments...]
Since it would be nice to have the possibility to import the paper dump of a friend,
I'm thinking on how to generate the paper_id of a paper: IMHO should be produced
by the text of the pdf, to garantee that two different collections have the same ids only for the same papers.
At the moment, I extract the text of the first page using the iText library (removing the possible annotations), and i compute a simhash footprint from the text.
the main problem is that sometime text is slightly different (yes, it happens! for example this and this) so i would like to be tolerant.
With simhash i can compute how much the are similar the original document, so in case the footprint is not in the repo, i'll have to iterate over the collection looking for
'near' footprints.
I'm not convinced by this method, could you suggest some better way to produce a signature
(short, numerical or alphanumerical) for those kind of documents?
UPDATE I had this idea: divide the first page in 8 (more or less) not-overlapping squares, covering all the page, then consider the text in each square
and generate a simhash signature. At the end I'll have a 8x64=512bit signature and I can consider
two papers the same if the sum of the differences between their simhash signatures sets is under a certain treshold.
In case you actually have a function that inputs two texts and returns a measure of their similarity, you do not have to iterate the entire Repository.
Given an article that is not in the repository, you can iterate only articles that have approximately the same length. for example, given an article that have 1000 characters, you will compare it to articles having between 950 and 1050 characters. For this you will need to have a data structure that maps ranges to articles and you will have to fine tune the size of the range. Range too large- too many items in each range. Range too small- higher potential of a miss.
Of course this will fail on some edge cases. For example, if you have two documents that the second is simply the first that was copy pasted twice: you would probably want them to be considered equal, but you will not even compare them since they are too far apart in length. There are methods to deal with that also, but you probably 'Ain't gonna need it'.

Mahout - Clustering - "naming" the cluster elements

I'm doing some research and I'm playing with Apache Mahout 0.6
My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don't know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories.
For example:
Lets say I've collected a N documents, that belong to 3 different groups :
Politics
Madonna (pop-star)
Science fiction
I don't know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs)
So, I came up with the following idea:
Apply mahout clustering (for example k-mean with k=3 on these documents)
This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don't know which document really belongs to which group, but at least the documents are clustered now by group
Ask the user to find any document in the web that should be about 'Madonna' (I can't show to the user none of my N documents, its a restriction). Then I want to measure 'similarity' of this document and each one of 3 groups.
I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics.
I've managed to produce the cluster of documents using 'Mahout in Action' book.
But I don't understand how should I use Mahout to measure similarity between the 'ready' cluster group of document and one given document.
I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that?
Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good)
Thanks a lot and sorry for a long question (couldn't describe it better)
Any help is highly appreciated
P.S. This is not a home-work project :)
This might be possible, but a much easier solution (IMHO) would be to hand-label a few documents from each category, then use those to bootstrap k-means. I.e., compute the centroids of the hand-labeled politics/Madonna/scifi documents and let k-means take it from there.
(In fancy terms, you would be doing semisupervised nearest centroids classification.)

How can I build an algorithm to classify an HTML page based on keywords?

I'm trying to create an algorithm that set some relevance to a webpage based on keywords that it finds on the page.
I'm doing this at the moment:
I set some words and a value for they: "movie"(10), "cinema"(6), "actor"(5) and "hollywood"(4) and search on some parts of the page giving a weight for each part and multiplying the words weight.
Example: the "movie" word word was found in the URL(1.5) * 10 and in title(2.5) * 10 = 40
This is trash! It's my first attempt, and it return some relevant results, but I don't think that a relevance determined by a value like 244, 66, 30, 15 is useful.
I want to do something that be inside a range, from 0 to 1 or 1 to 100.
What type of weighting for words can I use?
Besides it, there are ready algorithms to set some relevance of an HTML page based in things like URL, keywords, title, etc., except the main content?
EDIT 1: All of this can be rebuilt, the weights are random, I want to use some weights concise, not ramdon numbers to represent the weight like 10, 5 and 3.
Something like: low importance = 1, medium importance = 2, high importante = 4, deterministic importance = 8.
Title > Link Part of URL > Domain > Keywords
movie > cinema> actor > hollywood
EDIT 2: At the moment, I want to analyze the page relevance for words excluding the body content of the page. I will include in the analysus the domain, the link part of the url, the title, keywords (and another meta informations I judge useful).
The reason for this is that the HTML content is dirty. I can find much words like 'movie' in menus and advertisements, but the main content of the page doesn't contains nothing relevant to the theme.
Another reason is that some pages has meta information indicating that pages contains info about a movie, but the main content no. Example: a page that contains the plot of the film telling the history, the characters, etc., but don't refers in that text nothing that can indicate that this is about a movie, only the page meta information.
Later, after running a relevance analysis on the HTML page, I will do a relevance analysis on the content (filtered) separatedly.
Are you able to index these documents in a search engine? If you are then maybe you should consider using this latent semantic library.
You can get the actual project from here: https://github.com/algoriffic/lsa4solr
What you are trying to do, is determine the meaning of a text corpus, and classify it based on it's meaning. However, words are not individually unique or to be considered in abstract away from the overall article.
For example, suppose that you have an article which talks a lot about "Windows". This word is used 7 times in a 300 word article. So you know that it is important. However, what you don't know, is if it is talking about the Operating System "Windows" or the things that you look through.
Suppose then that you also see words such as "Installation", well, that doesn't help you at all either. Because people install windows into their houses much like they install windows operating system. However, if the very same article talks about defragmentation, operating systems, command line and Windows 7, then you can guess that the meaning of this document is actual about the Windows operating system.
However, how can you determine this?
This is where Latent Semantic Indexing comes in. What you want to do, is extract the entire documents text and then apply some clever analysis to that document.
The matrix'es that you build (see here) are way above my head, and although I have looked at some libraries and used them, I have never been able to fully understand the complex math that goes behind building the space aware matrix that is unsed by Latent Semantic Analysis... so in my advice, I would recommend, just using an already existing library to do this for you.
Happy to remove this answer if you aren't looking for external libraries and want to do this yourself
A simple way to convert anything into a 0-100 range (for any positive value X):
(1-1/(1+X))*100
A higher X gives you a value closer to 100.
But this won't promise you a fair or correct distribution. That's up to your algorithm of deciding the actual X value.
your_sum / (max_score_per_word * num_words) * 100
Should work. But you'll get very small scores most of the time since few of the words will match those that have a non-zero score. Nonetheless I don't see an alternative. And it's not a bad thing that you get small scores: you will be comparing scores between webpages. You try many different webpages and you can figure out what a "high score" is with your system.
Check out this blog post on classifying webpages by topic, it talks about how to implement something that relates closely to your requirements. How do you define relevance in your scenario? No matter what weights you apply to the different inputs you will still be choosing a somewhat arbitrary value, once you've cleaned the raw data you would be better served to apply machine learning to generate a classifier for you. This is difficult if relevance is a scalar value, but it's trivial if it's a boolean value (ie. a page is or isn't relevant to a particular movie, for example).

Categories

Resources