I wrote a WEKA java code to train 4 classifiers. I saved the classifiers models and want to use them to predict new unseen instances (think about it as someone who wants to test whether a tweet is positive or negative).
I used StringToWordsVector filter on the training data. And to avoid the "Src and Dest differ in # of attributes" error I used the following code to train the filter using the trained data before applying the filter on the new instance to try and predict whether a new instance is positive or negative. And I just can't get it right.
Classifier cls = (Classifier) weka.core.SerializationHelper.read("models/myModel.model"); //reading one of the trained classifiers
BufferedReader datafile = readDataFile("Tweets/tone1.ARFF"); //read training data
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
Filter filter = new StringToWordVector(50);//keep 50 words
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
// rebuild classifier
cls.buildClassifier(filteredData);
String testInstance= "Text that I want to use as an unseen instance and predict whether it's positive or negative";
System.out.println(">create test instance");
FastVector attributes = new FastVector(2);
attributes.addElement(new Attribute("text", (FastVector) null));
// Add class attribute.
FastVector classValues = new FastVector(2);
classValues.addElement("Negative");
classValues.addElement("Positive");
attributes.addElement(new Attribute("Tone", classValues));
// Create dataset with initial capacity of 100, and set index of class.
Instances tests = new Instances("test istance", attributes, 100);
tests.setClassIndex(tests.numAttributes() - 1);
Instance test = new Instance(2);
// Set value for message attribute
Attribute messageAtt = tests.attribute("text");
test.setValue(messageAtt, messageAtt.addStringValue(testInstance));
test.setDataset(tests);
Filter filter2 = new StringToWordVector(50);
filter2.setInputFormat(tests);
Instances filteredTests = Filter.useFilter(tests, filter2);
System.out.println(">train Test filter using training data");
Standardize sfilter = new Standardize(); //Match the number of attributes between src and dest.
sfilter.setInputFormat(filteredData); // initializing the filter with training set
filteredTests = Filter.useFilter(filteredData, sfilter); // create new test set
ArffSaver saver = new ArffSaver(); //save test data to ARFF file
saver.setInstances(filteredTests);
File unseenFile = new File ("Tweets/unseen.ARFF");
saver.setFile(unseenFile);
saver.writeBatch();
When I try to Standardize the Input data using the filtered training data I get a new ARFF file (unseen.ARFF) but with 2000 (same number of training data) instances where most of the values are negative. I don't understand why or how to remove those instances.
System.out.println(">Evaluation"); //without the following 2 lines I get ArrayIndexOutOfBoundException.
filteredData.setClassIndex(filteredData.numAttributes() - 1);
filteredTests.setClassIndex(filteredTests.numAttributes() - 1);
Evaluation eval = new Evaluation(filteredData);
eval.evaluateModel(cls, filteredTests);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
Printing the evaluation results I want to see for example a percentage of how positive or negative this instance is but instead I get the following. I also want to see 1 instance instead of 2000. Any help on how to do this will be great.
> Results
======
Correlation coefficient 0.0285
Mean absolute error 0.8765
Root mean squared error 1.2185
Relative absolute error 409.4123 %
Root relative squared error 121.8754 %
Total Number of Instances 2000
Thanks
use eval.predictions(). It is an java.util.ArrayList<Prediction>. Then you can use Prediction.weight() method to get how much positive or negative your test variable is....
cls.distributionForInstance(newInst) returns the probability distribution for an instance. Check the docs
I have reached a good solution and here I share my code with you. This trains a classifier using WEKA Java code then use it to predict new unseen instances. Some parts - like paths - are hardcoded but you can easily modify the method to take parameters.
/**
* This method performs classification of unseen instance.
* It starts by training a model using a selection of classifiers then classifiy new unlabled instances.
*/
public static void predict() throws Exception {
//start by providing the paths for your training and testing ARFF files make sure both files have the same structure and the exact classes in the header
//initialise classifier
Classifier classifier = null;
System.out.println("read training arff");
Instances train = new Instances(new BufferedReader(new FileReader("Train.arff")));
train.setClassIndex(0);//in my case the class was the first attribute thus zero otherwise it's the number of attributes -1
System.out.println("read testing arff");
Instances unlabeled = new Instances(new BufferedReader(new FileReader("Test.arff")));
unlabeled.setClassIndex(0);
// training using a collection of classifiers (NaiveBayes, SMO (AKA SVM), KNN and Decision trees.)
String[] algorithms = {"nb","smo","knn","j48"};
for(int w=0; w<algorithms.length;w++){
if(algorithms[w].equals("nb"))
classifier = new NaiveBayes();
if(algorithms[w].equals("smo"))
classifier = new SMO();
if(algorithms[w].equals("knn"))
classifier = new IBk();
if(algorithms[w].equals("j48"))
classifier = new J48();
System.out.println("==========================================================================");
System.out.println("training using " + algorithms[w] + " classifier");
Evaluation eval = new Evaluation(train);
//perform 10 fold cross validation
eval.crossValidateModel(classifier, train, 10, new Random(1));
String output = eval.toSummaryString();
System.out.println(output);
String classDetails = eval.toClassDetailsString();
System.out.println(classDetails);
classifier.buildClassifier(train);
}
Instances labeled = new Instances(unlabeled);
// label instances (use the trained classifier to classify new unseen instances)
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));
}
//save the model for future use
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("myModel.dat"));
out.writeObject(classifier);
out.close();
System.out.println("===== Saved model =====");
}
Related
I am using Weka to do a text classification. I have created a NaiveBayes model using the Weka GUI, and I have saved that model and then was trying to use this model to classify instances of a training set. This is my code :
Classifier clsClassifier = (Classifier) weka.core.SerializationHelper.read("Source/test/80percentModel.model");
StringToWordVector filter = new StringToWordVector();
BufferedReader reader = new BufferedReader(
new FileReader("Source/test/clt.train.arff"));
Instances trainingData = new Instances(reader);
reader.close();
trainingData.setClassIndex(trainingData.numAttributes() - 1);
filter.setInputFormat(trainingData);
BufferedReader reader2 = new BufferedReader(
new FileReader("Source/test/clt.test.arff"));
Instances testingData = new Instances(reader2);
reader2.close();
testingData.setClassIndex(testingData.numAttributes() - 1);
testingData = Filter.useFilter(testingData, filter);
System.out.println(testingData.numInstances());
for (int j = 0; j < testingData.numInstances(); j++) {
double res = clsClassifier.classifyInstance(testingData.get(j));
System.out.println(testingData.classAttribute().value((int)res));
}
I am getting the following error :
java.lang.IllegalArgumentException: Src and Dest differ in # of attributes: 1 != 1781
at weka.core.RelationalLocator.copyRelationalValues(RelationalLocator.java:87)
at weka.filters.Filter.copyValues(Filter.java:405)
at weka.filters.Filter.push(Filter.java:326)
at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:655)
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:672)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:699)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:173)
at test.WekaClassification.main(WekaClassification.java:66)
I dont quite get what I am doing wrong here. Why is there a mismatch in the number of attributes ? and is this the correct way to apply a trained model in a testData set ?
There are some possibilities, but your Error Shows that the number of attributes are not equal in your trained and dataset. They must be completely in a same format, type and value. The test file also should have the values for the label attribute. Check to see if in Weka GUI you get the same Error or not? StringToWordVector may not Filter in an affordable manner. check its output by watching the contents.
I'm using the model builder addon for OpenNLP to create a better NER model.
According to this post, I have used the code posted by markg :
public class ModelBuilderAddonUse {
private static List<String> getSentencesFromSomewhere() throws Exception
{
List<String> list = new ArrayList<String>();
BufferedReader reader = new BufferedReader(new FileReader("D:\\Work\\workspaces\\default\\UpdateModel\\documentrequirements.docx"));
String line;
while ((line = reader.readLine()) != null)
{
list.add(line);
}
reader.close();
return list;
}
public static void main(String[] args) throws Exception {
/**
* establish a file to put sentences in
*/
File sentences = new File("D:\\Work\\workspaces\\default\\UpdateModel\\sentences.text");
/**
* establish a file to put your NER hits in (the ones you want to keep based
* on prob)
*/
File knownEntities = new File("D:\\Work\\workspaces\\default\\UpdateModel\\knownentities.txt");
/**
* establish a BLACKLIST file to put your bad NER hits in (also can be based
* on prob)
*/
File blacklistedentities = new File("D:\\Work\\workspaces\\default\\UpdateModel\\blentities.txt");
/**
* establish a file to write your annotated sentences to
*/
File annotatedSentences = new File("D:\\Work\\workspaces\\default\\UpdateModel\\annotatedSentences.txt");
/**
* establish a file to write your model to
*/
File theModel = new File("D:\\Work\\workspaces\\default\\UpdateModel\\nl-ner-person.bin");
//------------create a bunch of file writers to write your results and sentences to a file
FileWriter sentenceWriter = new FileWriter(sentences, true);
FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
FileWriter knownEntityWriter = new FileWriter(knownEntities, true);
//set some thresholds to decide where to write hits, you don't have to use these at all...
double keeperThresh = .95;
double blacklistThresh = .7;
/**
* Load your model as normal
*/
TokenNameFinderModel personModel = new TokenNameFinderModel(new File("D:\\Work\\workspaces\\default\\UpdateModel\\nl-ner-person.bin"));
NameFinderME personFinder = new NameFinderME(personModel);
/**
* do your normal NER on the sentences you have
*/
for (String s : getSentencesFromSomewhere()) {
sentenceWriter.write(s.trim() + "\n");
sentenceWriter.flush();
String[] tokens = s.split(" ");//better to use a tokenizer really
Span[] find = personFinder.find(tokens);
double[] probs = personFinder.probs();
String[] names = Span.spansToStrings(find, tokens);
for (int i = 0; i < names.length; i++) {
//YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
if (probs[i] > keeperThresh) {
knownEntityWriter.write(names[i].trim() + "\n");
}
if (probs[i] < blacklistThresh) {
blacklistWriter.write(names[i].trim() + "\n");
}
}
personFinder.clearAdaptiveData();
blacklistWriter.flush();
knownEntityWriter.flush();
}
//flush and close all the writers
knownEntityWriter.flush();
knownEntityWriter.close();
sentenceWriter.flush();
sentenceWriter.close();
blacklistWriter.flush();
blacklistWriter.close();
/**
* THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
* KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
*/
DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities, theModel, annotatedSentences, "person", 3);
}
}
It also runs, but my output quits at :
annotated sentences: 1862
knowns: 58
Building Model using 1862 annotations
reading training data...
But in the example in the post it should go futher like this :
Indexing events using cutoff of 5
Computing event counts... done. 561755 events
Indexing... done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 127362
Number of Outcomes: 3
Number of Predicates: 106490
...done.
Can anyone help me to fix this problem, so I can generate a model?
I have searched realy a lot but cant find any good documutation about it.
Would really appreciat it, thanks.
Correct the path to your training data file like this:
File sentences = new File("D:/Work/workspaces/default/UpdateModel/sentences.text");
instead of
File sentences = new File("D:\\Work\\workspaces\\default\\UpdateModel\\sentences.text");
Update
This is how is used, by adding the files to the project folder. Try it like this -
File sentences = new File("src/training/resources/CreateModel/sentences.txt");
Check my respository for reference on Github
This should help.
I am using weka java API to classify couple of my instances, the file that I feed my weka file with is as follow:
0.3,0.1,1
0.0,0.04,0
0.0,0.03,1
And all of the above instances have unique id assigned to them for example the first row has id of 1098...
I wrote the following code which use weka java API to classify the result and return those instances that are classified incorrectly:
public static void SVM(ArrayList<String[]> testData) throws FileNotFoundException, IOException,
Exception {
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("weka/train.txt"));
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
Instances unlabeled = new Instances(new BufferedReader(new FileReader(
"weka/test.txt")));
breader.close();
// set class attribute
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
LibSVM svm = new LibSVM();
svm.buildClassifier(train);
Evaluation eval = new Evaluation(train);
BufferedWriter writer = new BufferedWriter(new FileWriter(
"weka/labeledSVM.txt"));
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = svm.classifyInstance(unlabeled.instance(i));
if(unlabeled.instance(i).value(5)!=clsLabel){
writer.write("the unique id is: "+testData.get(i)[0] + " real label of the text is : "+ unlabeled.instance(i).toString() + ", According to Algorithm reult label is: " + clsLabel);
writer.newLine();
}
writer.flush();
writer.close();
}
But a big problem is that the mapping between the unique id and the instance labeled by algorithm is incorrect, so I am wondering if there is any way that I can include the unique id of each text inside the instances that I have but tell the weka classifier to ignore it ?
for example something like this:
1980,0.3,0.1,1
1981,0.0,0.04,0
1982,0.0,0.03,0
or any other suggestion is appreciated
The only way I found to do this was to create my own subclass of Instance.
Use "AddID" filter which will assign a uniqueID to every instance, then use FilteredClassifier i.e. weka.classifiers.meta.FilteredClassifier.
I am using a string to vector filter to convert my arff to vector format.
But it throws an exception
weka.core.WekaException: weka.classifiers.bayes.NaiveBayesMultinomialUpdateable: Not enough training instances with class labels (required: 1, provided: 0)!
I tried to use the same on weka explorer and it worked fine.
This is my code
ArffLoader loader = new ArffLoader();
loader.setFile(new File("valid file"));
Instances structure = loader.getStructure();
structure.setClassIndex(0);
// train NaiveBayes
NaiveBayesMultinomialUpdateable n = new NaiveBayesMultinomialUpdateable();
FilteredClassifier f = new FilteredClassifier();
StringToWordVector s = new StringToWordVector();
f.setFilter(s);
f.setClassifier(n);
f.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null)
n.updateClassifier(current);
// output generated model
System.out.println(n);
I have tried another example but it still does not work
ArffLoader loader = new ArffLoader();
loader.setFile(new File("valid file"));
Instances structure = loader.getStructure();
// train NaiveBayes
NaiveBayesMultinomialUpdateable n = new NaiveBayesMultinomialUpdateable();
FilteredClassifier f = new FilteredClassifier();
StringToWordVector s = new StringToWordVector();
s.setInputFormat(structure);
Instances struct = Filter.useFilter(structure, s);
struct.setClassIndex(0);
System.out.println(struct.numAttributes()); // only gives 2 or 1 attributes
n.buildClassifier(struct);
Instance current;
while ((current = loader.getNextInstance(struct)) != null)
n.updateClassifier(current);
// output generated model
System.out.println(n);
The number of attributes printed is always 2 or 1.
It seems the string to word vector isn't working as expected
Original folder : https://www.dropbox.com/sh/cma4hbe2r96ul1c/GL2wNdeVUz
Converted to arff: https://www.dropbox.com/s/efle6ci4lb5riq7/test1.arff
According to your arff, the class seems to be the second in the two attributes, so the problem can be here:
struct.setClassIndex(0);
try
struct.setClassIndex(1);
UPDATE: I made this change to the first example, and it gives no exception, and prints out:
The independent probability of a class
--------------------------------------
oil spill 40.0
police 989.0
The probability of a word given the class
-----------------------------------------
oil spill police
class Infinity Infinity
I used Weka Explorer:
Loaded the arff file
Applied StringToWordVector filter
Selected IBk as the best classifier
Generated/Saved my_model.model binary
In my Java code I deserialize the model:
URL curl = ClassUtility.findClasspathResource( "models/my_model.model" );
final Classifier cls = (Classifier) weka.core.SerializationHelper.read( curl.openConnection().getInputStream() );
Now, I have the classifier BUT I need somehow the information on the filter. Where I am getting is: how do I prepare an instance to be classified by my deserialized model (how do I apply the filter before classification) - (The raw instance that I have to classify has a field text with tokens in it. The filter was supposed to transform that into a list of new attributes)
I even tried to use a FilteredClassifier where I set the classifier to the deserialized on and the filter to a manually created instance of StringToWordVector
final StringToWordVector filter = new StringToWordVector();
filter.setOptions(new String[]{"-C", "-P x_", "-L"});
FilteredClassifier fcls = new FilteredClassifier();
fcls.setFilter(filter);
fcls.setClassifier(cls);
The above does not work either. It throws the exception:
Exception in thread "main" java.lang.NullPointerException: No output instance format defined
What I am trying to avoid is doing the training in the Java code. It can be very slow and the prospect is that I might have multiple classifiers to train (different algorithms as well) and I want my app to start fast.
Your problem is that your model doesn't know anything about what the filter did to the data. The StringToWordVector filter changes the data, but depending on the input (training) data. A model trained on this transformed data set will only work on data that underwent the exact same transformation. To guarantee this, the filter needs to be part of your model.
Using a FilteredClassifier is the correct idea, but you have to use it from the beginning:
Load the ARFF file
Select FilteredClassifier as classifier
Select StringToWordVector as filter for it
Select IBk as classifier for the FilteredClassifier
Generate/Save the model to my_model.binary
The trained and serialized model will then also contain the intialized filter, including the information on how to transform data.
Another way to do this is to use the same filter to your testing data as the one used on training data. I describe the procedure analytically. In your case you just need to follow steps after the loading of your serialized classifier.
Create your training file (e.g training.arff)
Create Instances from training file. Instances trainingData = ..
Use StringToWordVector to transform your string attributes to number representation:
sample code:
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Select a classifier to create your model
sample code for LibLinear classifier
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
Save model
sample code
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create a testing file (e.g testing.arff)
Create Instances from training file: Instances testingData=...
Load classifier
sample code
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will keep the format of training set and will not add words that are not in training set.
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
sample code
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}