I am doing project on question classification using SVM. Given a question, the system must be able to allocate the class to question. For eg for the question “Where is Tajmahal located?” the task of question classification is to assign label “Location” to this question, since the answer to this question is a named entity of type “Location”. So I know that first i have to provide training dataset from which the model will learn and then testing dataset.
My training dataset contains class, index and value(in terms of question) all categorical.
class Index value(question)
DESC manner How did serfdom develop in and then leave Russia ?
Likewise i have 6 classes 50 index and 1000 question
SVM takes as input numerical values
For implementing SVM i have downloaded LIBSVM. Libsvm is a libraray for SVM
www.csie.ntu.edu.tw/~cjlin/libsvm/
I dont know how should i convert this data to Libsvm format. Please help
Related
I have a list of documents and I am indexing those documents based on User's query on Apache SOLR. I want to extract some news articles by using the keywords from the relevant indexed documents and display it along with the indexed documents to the user. Is there any algorithm or procedure by which we can extract the relevant keywords from the documents and use it for extracting the news?
You should research TF-IDF keyword extraction. I did a similar process with this about 2 years ago using the English Wiki and a simple Python Script. You need to answer a few questions though before proceeding with this. You can find a neat little writeup on using TF-IDF keyword extraction here
Do you only care about single keywords or will you evaluate phrases as well and up to what length?
Will you do any natural language processing on incoming data such as tagging and stemming?
Will you restrict the keywords to certain article types? Certain categories of article can have their own TF-IDF scores so you might want to experiment with what you need.
I have seen questions related to this post but it is little bit confusing.
I have gone through kaggle site
Here we need to distingush between Dog and cat (Looking into training only).
I want to apply my svm java implementation in these "train" data.
How to do that. My Normal svm takes only numeric values with 1/-1 classification.
Here it is images.
So i need to first convert the images into numerical data?
How will be the flow : convert to numeric data then do svm then what will be final result
Where can i find a large file (1GB) for training svm.Only numeric value not images.
I dont want to use libraries for that like libsvm. I need to do only training.
Any suggestion.
I have some high dimensional (30000 dimensions) vectors of integer numbers. I have 2 classes: [YES, NO]. I have 6000 samples of the YES-class and 50000 samples of the NO-class. I would like to train a classifier, to classify new samples in future automatically to one of these classes.
I know how to use the Weka Java API, but I am not sure which algorithms in which order to use. Can anyone give me advice on the following questions:
Are the vectors too high dimensional or do I have too many samples to do this efficiently in Weka?
Should I reduce the dimensionality before I start? What algorithm can I use to identify significant elements of my feature vector?
What classifier would be best to classify this kind of data? I think a decision tree should work fine, but maybe a naive bayes is faster to train, is it?
Since every element must have a name in weka, how can I assign a name to each of my 30000 features?
Any advice is appreciated. Thanks.
The number of dimensions to this problem most certainly are quite large, but I believe that Weka should be able to handle a large number of dimensions. The number of samples should not be a problem, but there are a lot more NO class samples than there are YES Class, so balancing the two might assist in classifying the NO Class cases better.
If you believe that there are redundant dimensions or some of the dimensions may contain noise, then it would certainly help.
A decision tree shouldn't be too much of a problem. There are a number of algorithms available in Weka, but I wouldn't recommend Neural Networks given the dimensionality of the problem.
If you have saved the data in a CSV File, you could assign attribute names in the first row of the data. This way, you can assign attribute names. Given the number of dimensions, you would likely call these a1 to a30000 and output for the output class.
Hope this Helps!
That's rather newbie question, so please take it with a grain of salt.
I'm new in the field of data mining and trying to get my head wrapped around this topic. Right now I'm trying to polish my existing model so that it classifies instances better. The problem is, that my model has around 480 attributes. I know for sure that not all of them are relevant, but it's hard for me point out which are indeed important.
The question is: having valid training and test sets, does one can use some sort of data mining algorithm which would throw away attributes that seem to not have any impact on the quality of classification?
I'm using Weka.
You should test using some of the Classifier algorithms that Weka has.
The basic idea is to use the Cross-validation option, so you can see which algorithm gives you the best Correctly Classified Instances value.
I can give you an example of one of my training set, using the Cross-validation option and choosing Folds 10.
As you can see, using the J48 classifier I will have:
Correctly Classified Instances 4310 83.2207 %
Incorrectly Classified Instances 869 16.7793 %
and if I will use for example the NaiveBayes Algorithm I will have:
Correctly Classified Instances 1996 38.5403 %
Incorrectly Classified Instances 3183 61.4597 %
and so on, the values differ depending on the algorithm.
So, test as many algorithms as possible and see which one gives you the best Correctly Classified Instances / Time consumed.
Comment converted to answer as OP suggested:
If You use weka 3.6.6 - select module explorer -> than go to tab "Select attributes" and choose "Attribute evaluator" and "Search method", you can also choose between using full data set or cv sets, for more details see e.g. http://forums.pentaho.com/showthread.php?68687-Selecting-Attributes-with-Weka or http://weka.wikispaces.com/Performing+attribute+selection
Read up on the topic of clustering algorithms (only on your training set though!)
Look into the InfoGainAttributeEval class.
The buildEvaluator() and the evaluateAttribute(int index) functions should help.
I'm building a text classifier in java with Weka library.
First i remove stopwords, then I'm using a stemmer (e.g convert cars to car).
Right now i have 6 predefined categories. I train the classifier on
5 documents for every category. The length of the documents are similar.
The results are ok when the text to be classified is short. But when the text is longer
than 100 words the results getting stranger and stranger.
I return the probabilities for each category as following:
Probability:
[0.0015560238056109177, 0.1808919321002592, 0.6657404531908249, 0.004793498469427115, 0.13253647895234325, 0.014481613481534815]
which is a pretty reliable classification.
But when I use texts longer than around 100 word I get results like:
Probability: [1.2863123678314889E-5, 4.3728547754744305E-5, 0.9964710903856974, 5.539960514402068E-5, 0.002993481218084141, 4.234371196414616E-4]
Which is to good.
Right now Im using Naive Bayes Multinomial for classifying the documents. I have read
about it and found out that i could act strange on longer text. Might be my problem right now?
Anyone has any good idea why this is happening?
There can be multiple factors for this behavior. If your training and test texts are not of the same domain, this can happen. Also, I believe adding more documents for every category should do some good. 5 documents in every category is seeming very less. If you do not have more training documents or it is difficult to have more training documents, then you can synthetically add positive and negative instances in your training set (see SMOTE algorithm in detail). Keep us posted the update.