how to calculate statistical significance using WEKA Java API? - java

I'm attempting to calculate the statistical significance of classifiers using WEKA Java API. I was reading the documentation and see that I need to use calculateStatistics from PairedCorrectedTTester I'm not sure how to use it.
Any ideas?
public static void main(String[] args) throws Exception {
ZeroR zr = new ZeroR();
Bagging bg = new Bagging();
Experiment exp = new Experiment();
exp.setPropertyArray(new Classifier[0]);
exp.setUsePropertyIterator(true);
SplitEvaluator se = null;
Classifier sec = null;
se = new ClassifierSplitEvaluator();
sec = ((ClassifierSplitEvaluator) se).getClassifier();
CrossValidationResultProducer cvrp = new CrossValidationResultProducer();
cvrp.setNumFolds(10);
cvrp.setSplitEvaluator(se);
PropertyNode[] propertyPath = new PropertyNode[2];
propertyPath[0] = new PropertyNode(
se,
new PropertyDescriptor("splitEvaluator", CrossValidationResultProducer.class), CrossValidationResultProducer.class
);
propertyPath[1] = new PropertyNode(
sec,
new PropertyDescriptor("classifier",
se.getClass()),
se.getClass()
);
exp.setResultProducer(cvrp);
exp.setPropertyPath(propertyPath);
// set classifiers here
exp.setPropertyArray(new Classifier[]{zr, bg});
DefaultListModel model = new DefaultListModel();
File file = new File("dataset arff file");
model.addElement(file);
exp.setDatasets(model);
InstancesResultListener irl = new InstancesResultListener();
irl.setOutputFile(new File("output.csv"));
exp.setResultListener(irl);
exp.initialize();
exp.runExperiment();
exp.postProcess();
PairedCorrectedTTester tester = new PairedCorrectedTTester();
Instances result = new Instances(new BufferedReader(new FileReader(irl.getOutputFile())));
tester.setInstances(result);
tester.setSortColumn(-1);
tester.setRunColumn(result.attribute("Key_Run").index());
tester.setFoldColumn(result.attribute("Key_Fold").index());
tester.setResultsetKeyColumns(
new Range(
""
+ (result.attribute("Key_Dataset").index() + 1)));
tester.setDatasetKeyColumns(
new Range(
""
+ (result.attribute("Key_Scheme").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_options").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_version_ID").index() + 1)));
tester.setResultMatrix(new ResultMatrixPlainText());
tester.setDisplayedResultsets(null);
tester.setSignificanceLevel(0.05);
tester.setShowStdDevs(true);
tester.multiResultsetFull(0, result.attribute("Percent_correct").index());
System.out.println("\nResult:");
ResultMatrix matrix = tester.getResultMatrix();
System.out.println(matrix.toStringMatrix());
}
Results from code above:
results
What I want is similar to the WEKA GUI (circled in red):
Statistical Significance using WEKA GUI
Resources Used:
https://waikato.github.io/weka-wiki/experimenter/using_the_experiment_api/
http://sce.carleton.ca/~mehrfard/repository/Case_Studies_(No_instrumentation)/Weka/doc/weka/experiment/PairedCorrectedTTester.html

You have to swap the key columns for dataset and resultset if you want to statistically evaluate classifiers on datasets (rather than datasets on classifiers):
tester.setDatasetKeyColumns(
new Range(
""
+ (result.attribute("Key_Dataset").index() + 1)));
tester.setResultsetKeyColumns(
new Range(
""
+ (result.attribute("Key_Scheme").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_options").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_version_ID").index() + 1)));
That will give you something like this when using the UCI dataset anneal:
Result:
Dataset (1) rules.ZeroR '' | (2) meta.Baggin
--------------------------------------------------------------
anneal (100) 76.17(0.55) | 98.73(1.12) v
--------------------------------------------------------------
(v/ /*) | (1/0/0)

Related

DoNotForward Graph Api Event With Java

I have an application I am working on where a user will make an reservation for an office. Once the user makes it they are sent a calendar event via graph api so their Calendar reflects their reservation. I am trying to use the DoNotForward Boolean type to send in the singleValueExtendedProperties setting but nothing has worked.
Please Only answer using Java as the language. Thank you
**Note that I am sending the event and receiving it without any trouble until I try to add the DoNotForward. Below is a portion of the code that represents the Graph API Event.
Questions:
Has anyone configured an Event with the DoNoForward Option?
Can you provide me some guidance on what I am doing wrong?
Event event = new Event();
ItemBody eventBody = new ItemBody();
eventBody.contentType = BodyType.HTML;
eventBody.content = doc.html();
event.body = eventBody;
event.subject = "MyWorkSpot Reservation (" + environmentName + ") : " + buildingName + " Floor: " + floorName + " Seat: " + seatCode + " Time: " + reservationTime;
event.showAs = FreeBusyStatus.FREE;
event.isReminderOn = false;
event.reminderMinutesBeforeStart = 0;
event.isAllDay = true;
DateTimeTimeZone start = new DateTimeTimeZone();
start.dateTime = formattedResAllDayStartDate + "T00:00:00";
start.timeZone = "Eastern Standard Time";
event.start = start;
DateTimeTimeZone end = new DateTimeTimeZone();
end.dateTime = nextDay + "T00:00:00";
end.timeZone = "Eastern Standard Time";
event.end = end;
LinkedList<Attendee> attendeesList = new LinkedList<Attendee>();
Attendee attendees = new Attendee();
EmailAddress emailAddressEvent = new EmailAddress();
emailAddressEvent.address = recipientEmailAddress;
attendees.emailAddress = emailAddressEvent;
attendees.type = AttendeeType.REQUIRED;
attendeesList.add(attendees);
event.attendees = attendeesList;
SingleValueLegacyExtendedPropertyCollectionResponse singleValColRes = new SingleValueLegacyExtendedPropertyCollectionResponse();
SingleValueLegacyExtendedProperty singleValProp = new SingleValueLegacyExtendedProperty();
singleValProp.id = "Boolean {00020329-0000-0000-C000-000000000046} Name DoNotForward";
singleValProp.value = "true";
singleValProp.oDataType = "Boolean";
singleValColRes.value.add(0, singleValProp);
event.singleValueExtendedProperties = new SingleValueLegacyExtendedPropertyCollectionPage(singleValColRes, null);

How to extract the best set of parameters from a TrainValidationSplitModel in Java?

I am using a ParamGridBuilder to construct a grid of parameters to search over and TrainValidationSplit to determine the best model (RandomForestClassifier), in Java. Now, I want to know what are the parameters (maxDepth, numTrees) from ParamGridBuilder that produces the best model.
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{
new VectorAssembler()
.setInputCols(new String[]{"a", "b"}).setOutputCol("features"),
new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")});
ParamMap[] paramGrid = new ParamGridBuilder()
.addGrid(rf.maxDepth(), new int[]{10, 15})
.addGrid(rf.numTrees(), new int[]{5, 10})
.build();
BinaryClassificationEvaluator evaluator = new BinaryClassificationEvaluator().setLabelCol("label");
TrainValidationSplit trainValidationSplit = new TrainValidationSplit()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(evaluator)
.setTrainRatio(0.85);
TrainValidationSplitModel model = trainValidationSplit.fit(dataLog);
System.out.println("paramMap size: " + model.bestModel().paramMap().size());
System.out.println("defaultParamMap size: " + model.bestModel().defaultParamMap().size());
System.out.println("extractParamMap: " + model.bestModel().extractParamMap());
System.out.println("explainParams: " + model.bestModel().explainParams());
System.out.println("numTrees: " + model.bestModel().getParam("numTrees"))//NoSuchElementException: Param numTrees does not exist.
Those tries do not help...
paramMap size: 0
defaultParamMap size: 0
extractParamMap: {
}
explainParams:
I found a way:
Pipeline bestModelPipeline = (Pipeline) model.bestModel().parent();
RandomForestClassifier bestRf = (RandomForestClassifier) bestModelPipeline.getStages()[1];
System.out.println("maxDepth : " + bestRf.getMaxDepth());
System.out.println("numTrees : " + bestRf.getNumTrees());
System.out.println("maxBins : " + bestRf.getMaxBins());

OpenNLP classifier output

At the moment I'm using the following code to train a classifier model :
final String iterations = "1000";
final String cutoff = "0";
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/classifierA.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, iterations);
params.put(TrainingParameters.CUTOFF_PARAM, cutoff);
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());
OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("src/main/resources/models/model.bin"));
model.serialize(modelOut);
return model;
This goes well and after every run I get the following output :
Indexing events with TwoPass using cutoff of 0
Computing event counts... done. 1474 events
Indexing... done.
Collecting events... Done indexing in 0,03 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1474
Number of Outcomes: 2
Number of Predicates: 4149
Computing model parameters...
Stats: (998/1474) 0.6770691994572592
...done.
Could someone explain what this output means? And if it tells something about the accuracy?
Looking at the source, we can tell this output is done by NaiveBayesTrainer::trainModel method:
public AbstractModel trainModel(DataIndexer di) {
// ...
display("done.\n");
display("\tNumber of Event Tokens: " + numUniqueEvents + "\n");
display("\t Number of Outcomes: " + numOutcomes + "\n");
display("\t Number of Predicates: " + numPreds + "\n");
display("Computing model parameters...\n");
MutableContext[] finalParameters = findParameters();
display("...done.\n");
// ...
}
If you take a look at findParameters() code, you'll notice that it calls the trainingStats() method, which contains the code snippet that calculates the accuracy:
private double trainingStats(EvalParameters evalParams) {
// ...
double trainingAccuracy = (double) numCorrect / numEvents;
display("Stats: (" + numCorrect + "/" + numEvents + ") " + trainingAccuracy + "\n");
return trainingAccuracy;
}
TL;DR the Stats: (998/1474) 0.6770691994572592 part of the output is the accuracy you're looking for.

Display labels of data [deep4j]

I would like to print the labels of traindata / testdata used in classification. Here is the definition of both inputs (using deep4j).
InputSplit[] inputSplit = fileSplit.sample(pathFilter, splitTrainTest, 1 - splitTrainTest);
InputSplit trainData = inputSplit[0];
InputSplit testData = inputSplit[1];
that are then transformed in DataSetIterator like this :
ImageRecordReader recordReader = new ImageRecordReader(height, width, channels, labelMaker);
recordReader.initialize(trainData, null);
trainIter = new RecordReaderDataSetIterator(recordReader, batchSize, 1, numLabels);
Then I want to print how many examples per labels where found in each iterator in this function :
public void print(DataSetIterator iter){
HashMap<String, Integer> hash = new HashMap<String, Integer>();
while(iter.hasNext()){
DataSet example = iter.next();
for(int i = 0 ; i<numLabels ; i++){
if(example.getLabels().getDouble(i)==1.){
String label = example.getLabelName(i);
if(hash.containsKey(label))
hash.put(label, hash.get(label)+1);
else
hash.put(label, 1);
}
}
}
for (String label: hash.keySet()){
System.out.println(" label : " + label.toString() + ", " + hash.get(label) + " examples");
}
}
The issue is that it displays only one example per label, whereas there should much more... And when I don't split my dataset using fileSplit.sample() the function displays the right number of examples.
Any suggestion ?
If you use a dataset you can use the toString() of the dataset.getFeatureMatrix() and dataset.getLabels()
If you want to print just the label counts, you can use dataset.labelCounts() I would look more at the dl4j javadoc:
http://deeplearning4j.org/doc

File diff against the last commit with JGit

I am trying to use JGit to get the differences of a file from the last commit to the most recent uncommitted changes. How can I do this with JGit? (using the command line would be the output of git diff HEAD)
Following several discussions (link1, link2) I come with a piece of code that is able to find the files that are uncommited, but it I cannot get the difference of the files
Repository db = new FileRepository("/path/to/git");
Git git = new Git(db);
AbstractTreeIterator oldTreeParser = this.prepareTreeParser(db, Constants.HEAD);
List<DiffEntry> diff = git.diff().setOldTree(oldTreeParser).call();
for (DiffEntry entry : diff) {
System.out.println("Entry: " + entry + ", from: " + entry.getOldId() + ", to: " + entry.getNewId());
DiffFormatter formatter = new DiffFormatter(System.out);
formatter.setRepository(db);
formatter.format(entry);
}
UPDATE
This issue was a long time ago. My existing for does display the uncommitted code. The current code that I am using for prepareTreeParser, in the context of displaying the difference, is:
public void gitDiff() throws Exception {
Repository db = new FileRepository("/path/to/git" + DEFAULT_GIT);
Git git = new Git(db);
ByteArrayOutputStream out = new ByteArrayOutputStream();
DiffFormatter formatter = new DiffFormatter( out );
formatter.setRepository(git.getRepository());
AbstractTreeIterator commitTreeIterator = prepareTreeParser(git.getRepository(), Constants.HEAD);
FileTreeIterator workTreeIterator = new FileTreeIterator( git.getRepository() );
List<DiffEntry> diffEntries = formatter.scan( commitTreeIterator, workTreeIterator );
for( DiffEntry entry : diffEntries ) {
System.out.println("DIFF Entry: " + entry + ", from: " + entry.getOldId() + ", to: " + entry.getNewId());
formatter.format(entry);
String diffText = out.toString("UTF-8");
System.out.println(diffText);
out.reset();
}
git.close();
db.close();
// This code is untested. It is slighting different for the code I am using in production,
// but it should be very easy to adapt it for your needs
}
private static AbstractTreeIterator prepareTreeParser(Repository repository, String ref) throws Exception {
Ref head = repository.getRef(ref);
RevWalk walk = new RevWalk(repository);
RevCommit commit = walk.parseCommit(head.getObjectId());
RevTree tree = walk.parseTree(commit.getTree().getId());
CanonicalTreeParser oldTreeParser = new CanonicalTreeParser();
ObjectReader oldReader = repository.newObjectReader();
try {
oldTreeParser.reset(oldReader, tree.getId());
} finally {
oldReader.release();
}
return oldTreeParser;
}
The following setup works for me:
DiffFormatter formatter = new DiffFormatter( System.out );
formatter.setRepository( git.getRepository() );
AbstractTreeIterator commitTreeIterator = prepareTreeParser( git.getRepository(), Constants.HEAD );
FileTreeIterator workTreeIterator = new FileTreeIterator( git.getRepository() );
List<DiffEntry> diffEntries = formatter.scan( commitTreeIterator, workTreeIterator );
for( DiffEntry entry : diffEntries ) {
System.out.println( "Entry: " + entry + ", from: " + entry.getOldId() + ", to: " + entry.getNewId() );
formatter.format( entry );
}
The uncommitted changes are made accessible trough the FileTreeIterator. Using formatter.scan() instead of the DiffCommand has the advantage that the formatter is set up properly to handle the FileTreeIterator. Otherwise you will get MissingObjectExceptions as the formatter tries to locate changes from the work tree in the repository.

Categories

Resources