Once a 10-fold cross-validation is done with a classifier, how can I print out the prediced class of every instance and the distribution of these instances?
J48 j48 = new J48();
Evaluation eval = new Evaluation(newData);
eval.crossValidateModel(j48, newData, 10, new Random(1));
When I tried something similar to below, it said that the classifier is not built.
for (int i=0; i<data.numInstances(); i++){
System.out.println(j48.distributionForInstance(newData.instance(i)));
}
What I'm trying to do is the same function as in the WEKA GUI wherein once a classifier is trained, I can click on Visualize classifier error" > Save, and I will find the predicted class in the file. But now I need it in to work in my own Java code.
I have tried something like below:
J48 j48 = new J48();
Evaluation eval = new Evaluation(newData);
StringBuffer forPredictionsPrinting = new StringBuffer();
weka.core.Range attsToOutput = null;
Boolean outputDistribution = new Boolean(true);
eval.crossValidateModel(j48, newData, 10, new Random(1), forPredictionsPrinting, attsToOutput, outputDistribution);
Yet it prompts me the error:
Exception in thread "main" java.lang.ClassCastException: java.lang.StringBuffer cannot be cast to weka.classifiers.evaluation.output.prediction.AbstractOutput
The crossValidateModel() method can take a forPredictionsPrinting varargs parameter that is a weka.classifiers.evaluation.output.prediction.AbstractOutput instance.
The important part of that is a StringBuffer to hold a string representation of all the predictions. The following code is in untested JRuby, but you should be able to convert it for your needs.
j48 = j48.new
eval = Evalution.new(newData)
predictions = java.lange.StringBuffer.new
eval.crossValidateModel(j48, newData, 10, Random.new(1), predictions, Range.new('1'), true)
# variable predictions now hold a string of all the individual predictions
I was stuck some days ago. I wanted to to evaluate a Weka classifier in matlab using a matrix instead of loading from an arff file. I use http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface and the following source code. I hope this help someone else.
import weka.classifiers.*;
import java.util.*
wekaClassifier = javaObject('weka.classifiers.trees.J48');
wekaClassifier.buildClassifier(processed);%Loaded from loadARFF
e = javaObject('weka.classifiers.Evaluation',processed);%Loaded from loadARFF
myrand = Random(1);
plainText = javaObject('weka.classifiers.evaluation.output.prediction.PlainText');
buffer = javaObject('java.lang.StringBuffer');
plainText.setBuffer(buffer)
bool = javaObject('java.lang.Boolean',true);
range = javaObject('weka.core.Range','1');
array = javaArray('java.lang.Object',3);
array(1) = plainText;
array(2) = range;
array(3) = bool;
e.crossValidateModel(wekaClassifier,testing,10,myrand,array)
e.toClassDetailsString
Asdrúbal López-Chau
clc
clear
%Load from disk
fileDataset = 'cm1.arff';
myPath = 'C:\Users\Asdrubal\Google Drive\Respaldo\DoctoradoALCPC\Doctorado ALC PC\AlcMobile\AvTh\MyPapers\Papers2014\UnderOverSampling\data\Skewed\datasetsKeel\';
javaaddpath('C:\Users\Asdrubal\Google Drive\Respaldo\DoctoradoALCPC\Doctorado ALC PC\AlcMobile\JarsForExperiments\weka.jar');
wekaOBJ = loadARFF([myPath fileDataset]);
%Transform from data into Matlab
[data, featureNames, targetNDX, stringVals, relationName] = ...
weka2matlab(wekaOBJ,'[]');
%Create testing and training sets in matlab format (this can be improved)
[tam, dim] = size(data);
idx = randperm(tam);
testIdx = idx(1 : tam*0.3);
trainIdx = idx(tam*0.3 + 1:end);
trainSet = data(trainIdx,:);
testSet = data(testIdx,:);
%Trasnform the training and the testing sets into the Weka format
testingWeka = matlab2weka('testing', featureNames, testSet);
trainingWeka = matlab2weka('training', featureNames, trainSet);
%Now evaluate classifier
import weka.classifiers.*;
import java.util.*
wekaClassifier = javaObject('weka.classifiers.trees.J48');
wekaClassifier.buildClassifier(trainingWeka);
e = javaObject('weka.classifiers.Evaluation',trainingWeka);
myrand = Random(1);
plainText = javaObject('weka.classifiers.evaluation.output.prediction.PlainText');
buffer = javaObject('java.lang.StringBuffer');
plainText.setBuffer(buffer)
bool = javaObject('java.lang.Boolean',true);
range = javaObject('weka.core.Range','1');
array = javaArray('java.lang.Object',3);
array(1) = plainText;
array(2) = range;
array(3) = bool;
e.crossValidateModel(wekaClassifier,testingWeka,10,myrand,array)%U
e.toClassDetailsString
Related
I am working on the classification of Imagenet DataSet on AlexNet architecture. I am working on distributed systems for data streams. I am using DeepLearning4j library. I have a problem with loading Imagenet data from a path on our HPC. So my current, normally loading data method is:
FileSplit fileSplit= new FileSplit(new File("/scratch/imagenet/ILSVRC2012/train"), NativeImageLoader.ALLOWED_FORMATS);
int imageHeightWidth = 224; //224x224 pixel input
int imageChannels = 3; //RGB
PathLabelGenerator labelMaker = new ParentPathLabelGenerator();
ImageRecordReader rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth, imageChannels, labelMaker);
System.out.println("initialization");
rr.initialize(fileSplit);
System.out.println("iterator");
DataSetIterator iter = new RecordReaderDataSetIterator.Builder(rr, minibatch)
.classification(1, 1000)
.preProcessor(new ImagePreProcessingScaler()) //For normalization of image values 0-255 to 0-1
.build();
System.out.println("data list creator");
List<DataSet> dataList = new ArrayList<>();
while (iter.hasNext()){
dataList.add(iter.next());
}
And this is my try to load the dataset via spark. labels list contain all the labels of Imagenet Dataset but I didn't copy them all here:
JavaSparkContext sc = SparkContext.initSparkContext(useSparkLocal);
//load data just one time
System.out.println("load data");
List<String> labelsList = Arrays.asList("kit fox, Vulpes macrotis " , "English setter " , "Australian terrier ");
String folder= "/scratch/imagenet/ILSVRC2012/train/*";
File f = new File(folder);
String path = f.getPath();
path=folder+"/*";
JavaPairRDD<String, PortableDataStream> origData = sc.binaryFiles(path);
int imageHeightWidth = 224; //224x224 pixel input
int imageChannels = 3; //RGB
PathLabelGenerator labelMaker = new ParentPathLabelGenerator();
ImageRecordReader rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth, imageChannels, labelMaker);
System.out.println("initialization");
rr.setLabels(labelsList);
RecordReaderFunction rrf = new org.datavec.spark.functions.RecordReaderFunction(rr);
JavaRDD<List<Writable>> rdd = origData.map(rrf);
JavaRDD<DataSet> data = rdd.map(new DataVecDataSetFunction(1, 1000, false));
List<DataSet> collected = data.collect();
By the way, in the train directory there is 1000 folders (n01440764, n01755581, n02012849, n02097658 ...) in which we find the images.
I need this parallelization since the load of the data itself took around 26h and it's not efficient. So could you help me with correcting me my try method?
For spark I would recommend pre vectorizing all of the data and just loading the ndarrays themselves directly. We cover this approach in our examples: https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-distributed-training-examples/
I would recommend this approach and just loading the pre created datasets using a map call after that where ideally you setup the batches relative to your number of workers available. Datasets have a save(..) load(..) you can use.
In order to implement this consider using:
SparkDataUtils.createFileBatchesSpark(JavaRDD filePaths, final String rootOutputDir, final int batchSize, #NonNull final org.apache.hadoop.conf.Configuration hadoopConfig)
This takes in filepaths, an output directory on HDFS, a pre configured batch size and a hadoop configuration for accessing your cluster.
Here is a snippet from the relevant java doc to get you started on some of the concepts:
{#code
* JavaSparkContext sc = ...
* SparkDl4jMultiLayer net = ...
* String baseFileBatchDir = ...
* JavaRDD<String> paths = org.deeplearning4j.spark.util.SparkUtils.listPaths(sc, baseFileBatchDir);
*
* //Image record reader:
* PathLabelGenerator labelMaker = new ParentPathLabelGenerator();
* ImageRecordReader rr = new ImageRecordReader(32, 32, 1, labelMaker);
* rr.setLabels(<labels here>);
*
* //Create DataSetLoader:
* int batchSize = 32;
* int numClasses = 1000;
* DataSetLoader loader = RecordReaderFileBatchLoader(rr, batchSize, 1, numClasses);
*
* //Fit the network
* net.fitPaths(paths, loader);
I have a project where I want to load in a given shapefile, and pick out polygons above a certain size before writing the results to a new shapefile. Maybe not the most efficient, but I've got code that successfully does all of that, right up to the point where it is supposed to write the shapefile. I get no errors, but the resulting shapefile has no usable data in it. I've followed as many tutorials as possible, but still I'm coming up blank.
The first bit of code is where I read in a shapefile, pickout the polygons I want, and put then into a feature collection. This part seems to work fine as far as I can tell.
public class ShapefileTest {
public static void main(String[] args) throws MalformedURLException, IOException, FactoryException, MismatchedDimensionException, TransformException, SchemaException {
File oldShp = new File("Old.shp");
File newShp = new File("New.shp");
//Get data from the original ShapeFile
Map<String, Object> map = new HashMap<String, Object>();
map.put("url", oldShp.toURI().toURL());
//Connect to the dataStore
DataStore dataStore = DataStoreFinder.getDataStore(map);
//Get the typeName from the dataStore
String typeName = dataStore.getTypeNames()[0];
//Get the FeatureSource from the dataStore
FeatureSource<SimpleFeatureType, SimpleFeature> source = dataStore.getFeatureSource(typeName);
SimpleFeatureCollection collection = (SimpleFeatureCollection) source.getFeatures(); //Get all of the features - no filter
//Start creating the new Shapefile
final SimpleFeatureType TYPE = createFeatureType(); //Calls a method that builds the feature type - tested and works.
DefaultFeatureCollection newCollection = new DefaultFeatureCollection(); //To hold my new collection
try (FeatureIterator<SimpleFeature> features = collection.features()) {
while (features.hasNext()) {
SimpleFeature feature = features.next(); //Get next feature
SimpleFeatureBuilder fb = new SimpleFeatureBuilder(TYPE); //Create a new SimpleFeature based on the original
Integer level = (Integer) feature.getAttribute(1); //Get the level for this feature
MultiPolygon multiPoly = (MultiPolygon) feature.getDefaultGeometry(); //Get the geometry collection
//First count how many new polygons we will have
int numNewPoly = 0;
for (int i = 0; i < multiPoly.getNumGeometries(); i++) {
double area = getArea(multiPoly.getGeometryN(i));
if (area > 20200) {
numNewPoly++;
}
}
//Now build an array of the larger polygons
Polygon[] polys = new Polygon[numNewPoly]; //Array of new geometies
int iPoly = 0;
for (int i = 0; i < multiPoly.getNumGeometries(); i++) {
double area = getArea(multiPoly.getGeometryN(i));
if (area > 20200) { //Write the new data
polys[iPoly] = (Polygon) multiPoly.getGeometryN(i);
iPoly++;
}
}
GeometryFactory gf = new GeometryFactory(); //Create a geometry factory
MultiPolygon mp = new MultiPolygon(polys, gf); //Create the MultiPolygonyy
fb.add(mp); //Add the geometry collection to the feature builder
fb.add(level);
fb.add("dBA");
SimpleFeature newFeature = SimpleFeatureBuilder.build( TYPE, new Object[]{mp, level,"dBA"}, null );
newCollection.add(newFeature); //Add it to the collection
}
At this point I have a collection that looks right - it has the correct bounds and everything. The next bit if code is where I put it into a new Shapefile.
//Time to put together the new Shapefile
Map<String, Serializable> newMap = new HashMap<String, Serializable>();
newMap.put("url", newShp.toURI().toURL());
newMap.put("create spatial index", Boolean.TRUE);
DataStore newDataStore = DataStoreFinder.getDataStore(newMap);
newDataStore.createSchema(TYPE);
String newTypeName = newDataStore.getTypeNames()[0];
SimpleFeatureStore fs = (SimpleFeatureStore) newDataStore.getFeatureSource(newTypeName);
Transaction t = new DefaultTransaction("add");
fs.setTransaction(t);
fs.addFeatures(newCollection);
t.commit();
ReferencedEnvelope env = fs.getBounds();
}
}
I put in the very last code to check the bounds of the FeatureStore fs, and it comes back null. Obviously, loading the newly created shapefile (which DOES get created and is ab out the right size), nothing shows up.
The solution actually had nothing to do with the code I posted - it had everything to do with my FeatureType definition. I did not include the "the_geom" to my polygon feature type, so nothing was getting written to the file.
I believe you are missing the step to finalize/close the file. Try adding this after the the t.commit line.
fs.close();
As an expedient alternative, you might try out the Shapefile dumper utility mentioned in the Shapefile DataStores docs. Using that may simplify your second code block into two or three lines.
I train and create a J48 model use WEKA Java Api.
Then, I use classifyInstance() to classify my instance.
but the result is wrong.
my code id following:
Instances train = reader.getDataSet();
Instances test = reader_test.getDataSet();
train.setClassIndex(train.numAttributes() - 1);
Classifier cls = new J48();
cls.buildClassifier(train);
test.setClassIndex(test.numAttributes() - 1);
for(int i = 0; i < test.numInstances(); i++){
Instance inst = test.instance(i);
double result = cls.classifyInstance(inst);
System.out.println(train.classAttribute().value((int)r));
}
The result always equal 0.0
Finally, I use test.insertAttributeAt() before test.setClassIndex().
as following:
test.insertAttributeAt(train.attribute(train.numAttributes() - 1), test.numAttributes());
The result become right. I am very surprising!
however, most documents are not use the function to inserAttribute.
I want to understand why the result become right suddenly.
It will help you.
BufferedReader datafile = readDataFile(TrainingFile);
Instances train = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
Classifier cls = new J48();
cls.buildClassifier(train);
DataSource testDataset = new DataSource(Test);
Instances test = testDataset.getDataSet();
Testdata.setClassIndex(Testdata.numAttributes() - 1);
for(int i = 0; i < test.numInstances(); i++){
Instance inst = test.instance(i);
double actualClassValue = test.instance(i).classValue();
//it will print your class value
String actual=test.classAttribute().value((int)actualClassValue);
double result = cls.classifyInstance(inst);
//will print your predicted value
String prediction=test.classAttribute().value((int)result );
}
you don't need to use insertAttributeAt now.
File Conversion Code
// load CSV
CSVLoader loader = new CSVLoader();
String InputFilename = "TrainingFileName";
loader.setSource(new File(InputFilename));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
String FileT = Filename+".arff";
saver.setFile(new File(Path+Directory+"\\"+FileT));
saver.writeBatch();
Change accordingly.
Thanks
I wrote a WEKA java code to train 4 classifiers. I saved the classifiers models and want to use them to predict new unseen instances (think about it as someone who wants to test whether a tweet is positive or negative).
I used StringToWordsVector filter on the training data. And to avoid the "Src and Dest differ in # of attributes" error I used the following code to train the filter using the trained data before applying the filter on the new instance to try and predict whether a new instance is positive or negative. And I just can't get it right.
Classifier cls = (Classifier) weka.core.SerializationHelper.read("models/myModel.model"); //reading one of the trained classifiers
BufferedReader datafile = readDataFile("Tweets/tone1.ARFF"); //read training data
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
Filter filter = new StringToWordVector(50);//keep 50 words
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
// rebuild classifier
cls.buildClassifier(filteredData);
String testInstance= "Text that I want to use as an unseen instance and predict whether it's positive or negative";
System.out.println(">create test instance");
FastVector attributes = new FastVector(2);
attributes.addElement(new Attribute("text", (FastVector) null));
// Add class attribute.
FastVector classValues = new FastVector(2);
classValues.addElement("Negative");
classValues.addElement("Positive");
attributes.addElement(new Attribute("Tone", classValues));
// Create dataset with initial capacity of 100, and set index of class.
Instances tests = new Instances("test istance", attributes, 100);
tests.setClassIndex(tests.numAttributes() - 1);
Instance test = new Instance(2);
// Set value for message attribute
Attribute messageAtt = tests.attribute("text");
test.setValue(messageAtt, messageAtt.addStringValue(testInstance));
test.setDataset(tests);
Filter filter2 = new StringToWordVector(50);
filter2.setInputFormat(tests);
Instances filteredTests = Filter.useFilter(tests, filter2);
System.out.println(">train Test filter using training data");
Standardize sfilter = new Standardize(); //Match the number of attributes between src and dest.
sfilter.setInputFormat(filteredData); // initializing the filter with training set
filteredTests = Filter.useFilter(filteredData, sfilter); // create new test set
ArffSaver saver = new ArffSaver(); //save test data to ARFF file
saver.setInstances(filteredTests);
File unseenFile = new File ("Tweets/unseen.ARFF");
saver.setFile(unseenFile);
saver.writeBatch();
When I try to Standardize the Input data using the filtered training data I get a new ARFF file (unseen.ARFF) but with 2000 (same number of training data) instances where most of the values are negative. I don't understand why or how to remove those instances.
System.out.println(">Evaluation"); //without the following 2 lines I get ArrayIndexOutOfBoundException.
filteredData.setClassIndex(filteredData.numAttributes() - 1);
filteredTests.setClassIndex(filteredTests.numAttributes() - 1);
Evaluation eval = new Evaluation(filteredData);
eval.evaluateModel(cls, filteredTests);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
Printing the evaluation results I want to see for example a percentage of how positive or negative this instance is but instead I get the following. I also want to see 1 instance instead of 2000. Any help on how to do this will be great.
> Results
======
Correlation coefficient 0.0285
Mean absolute error 0.8765
Root mean squared error 1.2185
Relative absolute error 409.4123 %
Root relative squared error 121.8754 %
Total Number of Instances 2000
Thanks
use eval.predictions(). It is an java.util.ArrayList<Prediction>. Then you can use Prediction.weight() method to get how much positive or negative your test variable is....
cls.distributionForInstance(newInst) returns the probability distribution for an instance. Check the docs
I have reached a good solution and here I share my code with you. This trains a classifier using WEKA Java code then use it to predict new unseen instances. Some parts - like paths - are hardcoded but you can easily modify the method to take parameters.
/**
* This method performs classification of unseen instance.
* It starts by training a model using a selection of classifiers then classifiy new unlabled instances.
*/
public static void predict() throws Exception {
//start by providing the paths for your training and testing ARFF files make sure both files have the same structure and the exact classes in the header
//initialise classifier
Classifier classifier = null;
System.out.println("read training arff");
Instances train = new Instances(new BufferedReader(new FileReader("Train.arff")));
train.setClassIndex(0);//in my case the class was the first attribute thus zero otherwise it's the number of attributes -1
System.out.println("read testing arff");
Instances unlabeled = new Instances(new BufferedReader(new FileReader("Test.arff")));
unlabeled.setClassIndex(0);
// training using a collection of classifiers (NaiveBayes, SMO (AKA SVM), KNN and Decision trees.)
String[] algorithms = {"nb","smo","knn","j48"};
for(int w=0; w<algorithms.length;w++){
if(algorithms[w].equals("nb"))
classifier = new NaiveBayes();
if(algorithms[w].equals("smo"))
classifier = new SMO();
if(algorithms[w].equals("knn"))
classifier = new IBk();
if(algorithms[w].equals("j48"))
classifier = new J48();
System.out.println("==========================================================================");
System.out.println("training using " + algorithms[w] + " classifier");
Evaluation eval = new Evaluation(train);
//perform 10 fold cross validation
eval.crossValidateModel(classifier, train, 10, new Random(1));
String output = eval.toSummaryString();
System.out.println(output);
String classDetails = eval.toClassDetailsString();
System.out.println(classDetails);
classifier.buildClassifier(train);
}
Instances labeled = new Instances(unlabeled);
// label instances (use the trained classifier to classify new unseen instances)
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));
}
//save the model for future use
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("myModel.dat"));
out.writeObject(classifier);
out.close();
System.out.println("===== Saved model =====");
}
I am using a string to vector filter to convert my arff to vector format.
But it throws an exception
weka.core.WekaException: weka.classifiers.bayes.NaiveBayesMultinomialUpdateable: Not enough training instances with class labels (required: 1, provided: 0)!
I tried to use the same on weka explorer and it worked fine.
This is my code
ArffLoader loader = new ArffLoader();
loader.setFile(new File("valid file"));
Instances structure = loader.getStructure();
structure.setClassIndex(0);
// train NaiveBayes
NaiveBayesMultinomialUpdateable n = new NaiveBayesMultinomialUpdateable();
FilteredClassifier f = new FilteredClassifier();
StringToWordVector s = new StringToWordVector();
f.setFilter(s);
f.setClassifier(n);
f.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null)
n.updateClassifier(current);
// output generated model
System.out.println(n);
I have tried another example but it still does not work
ArffLoader loader = new ArffLoader();
loader.setFile(new File("valid file"));
Instances structure = loader.getStructure();
// train NaiveBayes
NaiveBayesMultinomialUpdateable n = new NaiveBayesMultinomialUpdateable();
FilteredClassifier f = new FilteredClassifier();
StringToWordVector s = new StringToWordVector();
s.setInputFormat(structure);
Instances struct = Filter.useFilter(structure, s);
struct.setClassIndex(0);
System.out.println(struct.numAttributes()); // only gives 2 or 1 attributes
n.buildClassifier(struct);
Instance current;
while ((current = loader.getNextInstance(struct)) != null)
n.updateClassifier(current);
// output generated model
System.out.println(n);
The number of attributes printed is always 2 or 1.
It seems the string to word vector isn't working as expected
Original folder : https://www.dropbox.com/sh/cma4hbe2r96ul1c/GL2wNdeVUz
Converted to arff: https://www.dropbox.com/s/efle6ci4lb5riq7/test1.arff
According to your arff, the class seems to be the second in the two attributes, so the problem can be here:
struct.setClassIndex(0);
try
struct.setClassIndex(1);
UPDATE: I made this change to the first example, and it gives no exception, and prints out:
The independent probability of a class
--------------------------------------
oil spill 40.0
police 989.0
The probability of a word given the class
-----------------------------------------
oil spill police
class Infinity Infinity