I have a numerical dataset of the format class, unigram count, bigram count, sentiment. I went through some of the Apache Mahout documentation and it was all about text data. I am aware that I need to perform 3 steps to classify: Convert to sequence files, Vectorize sequence files, Pass it to train the Naive Bayes Classifier. But I am having a hard time to understand the difference between classifying a text dataset vs classifying a numerical dataset in Mahout. What do I need to do differently in my case? I would appreciate any help.
As you might know, mahout can not use text data to train a model. If you start from a numerical dataset, the classification will be even easier because the vectors that mahout handle are numerical data vectors.
I used mahout on a text dataset and I know that in that case, I had to use dictionnary to convert text data to numerical data. Some algorithms handle it better than others ( for example Naive Bayes strongly prefers text-like data).
So in your case, try to use other classifiers like random forrest or online logistic regression to obtain more efficient result. In my experience, using random forrest, you can just define the type of features that you have (in your case all your features are numerical) so the classification could be done pretty easily. If you want to stick with Naive Bayes, I am sure it is still possible to classify your numerical dataset but I never used it so I can not give more help.
Related
I am trying to classify legal case documents which are in text format, in different folders like Civil, Land, Criminal, e.t.c, I intended using Naive Bayes as Vectoriser to get the vectors from the text documents,feed it in to SVM to classify the documents using javaml, I have implemented the preprocessing like stemming, I used the formulars of Naive Bayes as seen in http://eprints.nottingham.ac.uk/2995/1/Isa_Text.pdf to calculate prior probability, likelihood, evidence and posterior probability, I am assuming the posterior probability is the vector to be fed into SVM, but I cannot format the output to feed into the SVM library.
I need all the help I can get in this, I hope I am doing things right.
I have other legal cases as test set that I want to classify to the right categories.
I have seen questions related to this post but it is little bit confusing.
I have gone through kaggle site
Here we need to distingush between Dog and cat (Looking into training only).
I want to apply my svm java implementation in these "train" data.
How to do that. My Normal svm takes only numeric values with 1/-1 classification.
Here it is images.
So i need to first convert the images into numerical data?
How will be the flow : convert to numeric data then do svm then what will be final result
Where can i find a large file (1GB) for training svm.Only numeric value not images.
I dont want to use libraries for that like libsvm. I need to do only training.
Any suggestion.
I have some high dimensional (30000 dimensions) vectors of integer numbers. I have 2 classes: [YES, NO]. I have 6000 samples of the YES-class and 50000 samples of the NO-class. I would like to train a classifier, to classify new samples in future automatically to one of these classes.
I know how to use the Weka Java API, but I am not sure which algorithms in which order to use. Can anyone give me advice on the following questions:
Are the vectors too high dimensional or do I have too many samples to do this efficiently in Weka?
Should I reduce the dimensionality before I start? What algorithm can I use to identify significant elements of my feature vector?
What classifier would be best to classify this kind of data? I think a decision tree should work fine, but maybe a naive bayes is faster to train, is it?
Since every element must have a name in weka, how can I assign a name to each of my 30000 features?
Any advice is appreciated. Thanks.
The number of dimensions to this problem most certainly are quite large, but I believe that Weka should be able to handle a large number of dimensions. The number of samples should not be a problem, but there are a lot more NO class samples than there are YES Class, so balancing the two might assist in classifying the NO Class cases better.
If you believe that there are redundant dimensions or some of the dimensions may contain noise, then it would certainly help.
A decision tree shouldn't be too much of a problem. There are a number of algorithms available in Weka, but I wouldn't recommend Neural Networks given the dimensionality of the problem.
If you have saved the data in a CSV File, you could assign attribute names in the first row of the data. This way, you can assign attribute names. Given the number of dimensions, you would likely call these a1 to a30000 and output for the output class.
Hope this Helps!
good day everyone. i try learn some about parsing files, and i decide start with DOM, xPath or SAX,
for example how can i getting numeric values form two 2x2 matrix.
and afterwards perform computation in JAVA of these two matrix to output result.
what is best to use for parsing,and then obtain result for computation matrixes with some algorithm?
You might look at the JScience library, which uses javolution for XML marshaling.
I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance
An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.
This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric