I am trying to classify legal case documents which are in text format, in different folders like Civil, Land, Criminal, e.t.c, I intended using Naive Bayes as Vectoriser to get the vectors from the text documents,feed it in to SVM to classify the documents using javaml, I have implemented the preprocessing like stemming, I used the formulars of Naive Bayes as seen in http://eprints.nottingham.ac.uk/2995/1/Isa_Text.pdf to calculate prior probability, likelihood, evidence and posterior probability, I am assuming the posterior probability is the vector to be fed into SVM, but I cannot format the output to feed into the SVM library.
I need all the help I can get in this, I hope I am doing things right.
I have other legal cases as test set that I want to classify to the right categories.
Related
I am currently conducting a java project in NLP/IR, and are fairly new to this.
The project consists of a collection with around 1000 documents, where each document has about 100 words, structured as bag of words with term-frequency. I want to find similar documents based on a document(from the collection).
Using TF-IDF, calculating tf-idf for the query(a given document) and every other document in the collection, then comparing these values as a vector with cosine similarity. Could this give some insight in their similarity? Or would it not be reasonable, because of the big query(document)?
Is there any other similarity measures that could work better?
Thanks for the help
TF-IDF-based similarity, typically using a cosine to compare a vector representing the query terms to a set of vector representing the TF-IDF values of the documents, is a common approach to calculate "similarity".
Mind that there "similarity" is a very generic term. In the IR domain, you typically speak rather of "relevance". Texts can be similar on many levels: in the same language, using the same characters, using the same words, talking about the same people, using a similarly complex grammatical structure and much more - consequently, there are many many measures. Search the web for text similarity to find many publications but also open-source frameworks and libraries that implement different measures.
Today, "semantic similarity" is attracting more interest than the traditional keyword-based IR models. If this is your area of interest, you might look into the results of the SemEval shared tasks years 2012-2015.
If all you want is to compare two documents using TF-IDF, you can do that. Since you mention that each doc contains 100 words, in the worst case there might be 1000*100 unique words. So, im assuming your vectors are built on all unique words (since all documents should be represented in same dimension). If the no. of unique words are too high, you could try using some dimensionality reduction techniques to reduce the dimensions (like PCA). But what you are trying to do is right, you can always compare documents like this for finding similarity between documents.
If you want similarity more in the sense of semantics you should look at using LDA (topic modelling) type techniques.
I have a numerical dataset of the format class, unigram count, bigram count, sentiment. I went through some of the Apache Mahout documentation and it was all about text data. I am aware that I need to perform 3 steps to classify: Convert to sequence files, Vectorize sequence files, Pass it to train the Naive Bayes Classifier. But I am having a hard time to understand the difference between classifying a text dataset vs classifying a numerical dataset in Mahout. What do I need to do differently in my case? I would appreciate any help.
As you might know, mahout can not use text data to train a model. If you start from a numerical dataset, the classification will be even easier because the vectors that mahout handle are numerical data vectors.
I used mahout on a text dataset and I know that in that case, I had to use dictionnary to convert text data to numerical data. Some algorithms handle it better than others ( for example Naive Bayes strongly prefers text-like data).
So in your case, try to use other classifiers like random forrest or online logistic regression to obtain more efficient result. In my experience, using random forrest, you can just define the type of features that you have (in your case all your features are numerical) so the classification could be done pretty easily. If you want to stick with Naive Bayes, I am sure it is still possible to classify your numerical dataset but I never used it so I can not give more help.
Can anybody help me to implement an alternative missing value handling in J48 algorithm using Weka API in Java.
I am sure that using pre-imputation approaches before training the J48 is easy.
But what is about using a surrogate split attribute in case of partition the training date (like Breiman does in CART) instead of the J48 standard approach (Quinlan in C4.5) splitting the cases across a probability distribution from observed cases with known value.
Can anybody give me some information, tip, help, where in the Weka API and Source Code a have to modify to replace standard with surrogate split?
Look at weka source code weka.classifiers.trees.j48.C45ModelSelection from line 152 (Find "best" attribute to split on). It uses info gain ratio as splitting criteria.
I have seen questions related to this post but it is little bit confusing.
I have gone through kaggle site
Here we need to distingush between Dog and cat (Looking into training only).
I want to apply my svm java implementation in these "train" data.
How to do that. My Normal svm takes only numeric values with 1/-1 classification.
Here it is images.
So i need to first convert the images into numerical data?
How will be the flow : convert to numeric data then do svm then what will be final result
Where can i find a large file (1GB) for training svm.Only numeric value not images.
I dont want to use libraries for that like libsvm. I need to do only training.
Any suggestion.
I'm currently developing a percussion tutorial program. The program requires that I can determine what drum is being played, to do this I was going to analyse the frequency of the drum recording and see if the frequency is within a given range.
I have been using the Apache math commons implementation for FFT so far (http://commons.apache.org/math/) but my question is, once I preform the FFT, how do I use the array of results to calculate the frequencies contained in the signal?
Note: I have also tried experimenting with using Autocorrelation, but it didn't seem to work to well with sample from a drum kit
Any help or alternative suggestions of how to determine what drum is being hit would be greatly appreciated
Edit: Since writing this I've found a great online lesson on implementing FFT in java for Time/ frequency transformations Spectrum Analysis in Java
In the area of music information retrieval, people often use a related metric known as the mel-frequency cepstral coefficients (MFCCs).
For any N-sample segment of your signal, take the FFT. Those resulting N samples are transformed into a set of MFCCs containing, say, 12 elements (i.e., coefficients). This 12-element vector is used to classify the instrument, including which drum is used.
To do supervised classification, you can use something like a support vector machine (SVM). LIBSVM is a commonly used library that has Java compatibility (and many other languages). You train the SVM with these MFCCs and their corresponding instrument labels. Then, you test it by feeding a query MFCC vector, and it will tell you which instrument it is.
So the basic procedure, in summary:
Get FFT.
Get MFCCs from FFT.
Train SVM with MFCCs and instrument labels.
Query the SVM with MFCCs of the query signal.
Check for Java packages that do these things. (They must exist. I just don't know them.) Relatively, drum transcription is easier than most other instrument groups, so I am optimistic that this would work.
For further reading, there are a whole bunch of articles on drum transcription.
When I made a program using a DFT, I had it create an array of Frequencies and Amplitudes for each frequency. I could then find the largest amplitudes, and compare those to musical notes, getting a good grasp on what was played. If you know the approximate frequency of the drum, you should be able to do that.