OpenNLP classifier output - java

At the moment I'm using the following code to train a classifier model :
final String iterations = "1000";
final String cutoff = "0";
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/classifierA.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, iterations);
params.put(TrainingParameters.CUTOFF_PARAM, cutoff);
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());
OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("src/main/resources/models/model.bin"));
model.serialize(modelOut);
return model;
This goes well and after every run I get the following output :
Indexing events with TwoPass using cutoff of 0
Computing event counts... done. 1474 events
Indexing... done.
Collecting events... Done indexing in 0,03 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1474
Number of Outcomes: 2
Number of Predicates: 4149
Computing model parameters...
Stats: (998/1474) 0.6770691994572592
...done.
Could someone explain what this output means? And if it tells something about the accuracy?

Looking at the source, we can tell this output is done by NaiveBayesTrainer::trainModel method:
public AbstractModel trainModel(DataIndexer di) {
// ...
display("done.\n");
display("\tNumber of Event Tokens: " + numUniqueEvents + "\n");
display("\t Number of Outcomes: " + numOutcomes + "\n");
display("\t Number of Predicates: " + numPreds + "\n");
display("Computing model parameters...\n");
MutableContext[] finalParameters = findParameters();
display("...done.\n");
// ...
}
If you take a look at findParameters() code, you'll notice that it calls the trainingStats() method, which contains the code snippet that calculates the accuracy:
private double trainingStats(EvalParameters evalParams) {
// ...
double trainingAccuracy = (double) numCorrect / numEvents;
display("Stats: (" + numCorrect + "/" + numEvents + ") " + trainingAccuracy + "\n");
return trainingAccuracy;
}
TL;DR the Stats: (998/1474) 0.6770691994572592 part of the output is the accuracy you're looking for.

Related

how to calculate statistical significance using WEKA Java API?

I'm attempting to calculate the statistical significance of classifiers using WEKA Java API. I was reading the documentation and see that I need to use calculateStatistics from PairedCorrectedTTester I'm not sure how to use it.
Any ideas?
public static void main(String[] args) throws Exception {
ZeroR zr = new ZeroR();
Bagging bg = new Bagging();
Experiment exp = new Experiment();
exp.setPropertyArray(new Classifier[0]);
exp.setUsePropertyIterator(true);
SplitEvaluator se = null;
Classifier sec = null;
se = new ClassifierSplitEvaluator();
sec = ((ClassifierSplitEvaluator) se).getClassifier();
CrossValidationResultProducer cvrp = new CrossValidationResultProducer();
cvrp.setNumFolds(10);
cvrp.setSplitEvaluator(se);
PropertyNode[] propertyPath = new PropertyNode[2];
propertyPath[0] = new PropertyNode(
se,
new PropertyDescriptor("splitEvaluator", CrossValidationResultProducer.class), CrossValidationResultProducer.class
);
propertyPath[1] = new PropertyNode(
sec,
new PropertyDescriptor("classifier",
se.getClass()),
se.getClass()
);
exp.setResultProducer(cvrp);
exp.setPropertyPath(propertyPath);
// set classifiers here
exp.setPropertyArray(new Classifier[]{zr, bg});
DefaultListModel model = new DefaultListModel();
File file = new File("dataset arff file");
model.addElement(file);
exp.setDatasets(model);
InstancesResultListener irl = new InstancesResultListener();
irl.setOutputFile(new File("output.csv"));
exp.setResultListener(irl);
exp.initialize();
exp.runExperiment();
exp.postProcess();
PairedCorrectedTTester tester = new PairedCorrectedTTester();
Instances result = new Instances(new BufferedReader(new FileReader(irl.getOutputFile())));
tester.setInstances(result);
tester.setSortColumn(-1);
tester.setRunColumn(result.attribute("Key_Run").index());
tester.setFoldColumn(result.attribute("Key_Fold").index());
tester.setResultsetKeyColumns(
new Range(
""
+ (result.attribute("Key_Dataset").index() + 1)));
tester.setDatasetKeyColumns(
new Range(
""
+ (result.attribute("Key_Scheme").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_options").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_version_ID").index() + 1)));
tester.setResultMatrix(new ResultMatrixPlainText());
tester.setDisplayedResultsets(null);
tester.setSignificanceLevel(0.05);
tester.setShowStdDevs(true);
tester.multiResultsetFull(0, result.attribute("Percent_correct").index());
System.out.println("\nResult:");
ResultMatrix matrix = tester.getResultMatrix();
System.out.println(matrix.toStringMatrix());
}
Results from code above:
results
What I want is similar to the WEKA GUI (circled in red):
Statistical Significance using WEKA GUI
Resources Used:
https://waikato.github.io/weka-wiki/experimenter/using_the_experiment_api/
http://sce.carleton.ca/~mehrfard/repository/Case_Studies_(No_instrumentation)/Weka/doc/weka/experiment/PairedCorrectedTTester.html
You have to swap the key columns for dataset and resultset if you want to statistically evaluate classifiers on datasets (rather than datasets on classifiers):
tester.setDatasetKeyColumns(
new Range(
""
+ (result.attribute("Key_Dataset").index() + 1)));
tester.setResultsetKeyColumns(
new Range(
""
+ (result.attribute("Key_Scheme").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_options").index() + 1)
+ ","
+ (result.attribute("Key_Scheme_version_ID").index() + 1)));
That will give you something like this when using the UCI dataset anneal:
Result:
Dataset (1) rules.ZeroR '' | (2) meta.Baggin
--------------------------------------------------------------
anneal (100) 76.17(0.55) | 98.73(1.12) v
--------------------------------------------------------------
(v/ /*) | (1/0/0)

How to extract the best set of parameters from a TrainValidationSplitModel in Java?

I am using a ParamGridBuilder to construct a grid of parameters to search over and TrainValidationSplit to determine the best model (RandomForestClassifier), in Java. Now, I want to know what are the parameters (maxDepth, numTrees) from ParamGridBuilder that produces the best model.
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{
new VectorAssembler()
.setInputCols(new String[]{"a", "b"}).setOutputCol("features"),
new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")});
ParamMap[] paramGrid = new ParamGridBuilder()
.addGrid(rf.maxDepth(), new int[]{10, 15})
.addGrid(rf.numTrees(), new int[]{5, 10})
.build();
BinaryClassificationEvaluator evaluator = new BinaryClassificationEvaluator().setLabelCol("label");
TrainValidationSplit trainValidationSplit = new TrainValidationSplit()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(evaluator)
.setTrainRatio(0.85);
TrainValidationSplitModel model = trainValidationSplit.fit(dataLog);
System.out.println("paramMap size: " + model.bestModel().paramMap().size());
System.out.println("defaultParamMap size: " + model.bestModel().defaultParamMap().size());
System.out.println("extractParamMap: " + model.bestModel().extractParamMap());
System.out.println("explainParams: " + model.bestModel().explainParams());
System.out.println("numTrees: " + model.bestModel().getParam("numTrees"))//NoSuchElementException: Param numTrees does not exist.
Those tries do not help...
paramMap size: 0
defaultParamMap size: 0
extractParamMap: {
}
explainParams:
I found a way:
Pipeline bestModelPipeline = (Pipeline) model.bestModel().parent();
RandomForestClassifier bestRf = (RandomForestClassifier) bestModelPipeline.getStages()[1];
System.out.println("maxDepth : " + bestRf.getMaxDepth());
System.out.println("numTrees : " + bestRf.getNumTrees());
System.out.println("maxBins : " + bestRf.getMaxBins());

Fetching all the document URI's in MarkLogic Using Java Client API

i am trying to fetch all the documents from a database without knowing the exact url's . I got one query
DocumentPage documents =docMgr.read();
while (documents.hasNext()) {
DocumentRecord document = documents.next();
System.out.println(document.getUri());
}
But i do not have specific urls , i want all the documents
The first step is to enable your uris lexicon on the database.
You could eval some XQuery and run cts:uris() (or server-side JS and run cts.uris()):
ServerEvaluationCall call = client.newServerEval()
.xquery("cts:uris()");
for ( EvalResult result : call.eval() ) {
String uri = result.getString();
System.out.println(uri);
}
Two drawbacks are: (1) you'd need a user with privileges and (2) there is no pagination.
If you have a small number of documents, you don't need pagination. But for a large number of documents pagination is recommended. Here's some code using the search API and pagination:
// do the next eight lines just once
String options =
"<options xmlns='http://marklogic.com/appservices/search'>" +
" <values name='uris'>" +
" <uri/>" +
" </values>" +
"</options>";
QueryOptionsManager optionsMgr = client.newServerConfigManager().newQueryOptionsManager();
optionsMgr.writeOptions("uriOptions", new StringHandle(options));
// run the following each time you need to list all uris
QueryManager queryMgr = client.newQueryManager();
long pageLength = 10000;
queryMgr.setPageLength(pageLength);
ValuesDefinition query = queryMgr.newValuesDefinition("uris", "uriOptions");
// the following "and" query just matches all documents
query.setQueryDefinition(new StructuredQueryBuilder().and());
int start = 1;
boolean hasMore = true;
Transaction transaction = client.openTransaction();
try {
while ( hasMore ) {
CountedDistinctValue[] uriValues =
queryMgr.values(query, new ValuesHandle(), start, transaction).getValues();
for (CountedDistinctValue uriValue : uriValues) {
String uri = uriValue.get("string", String.class);
//System.out.println(uri);
}
start += uriValues.length;
// this is the last page if uriValues is smaller than pageLength
hasMore = uriValues.length == pageLength;
}
} finally {
transaction.commit();
}
The transaction is only necessary if you need a guaranteed "snapshot" list isolated from adds/deletes happening concurrently with this process. Since it adds some overhead, feel free to remove it if you don't need such exactness.
find out the page length and in the queryMgr you can specify the starting point to access. Keep on increasing the starting point and loop through all the URL. I was able to fetch all URI. This could be not so good approach but works.
List<String> uriList = new ArrayList<>();
QueryManager queryMgr = client.newQueryManager();
StructuredQueryBuilder qb = new StructuredQueryBuilder();
StructuredQueryDefinition querydef = qb.and(qb.collection("xxxx"), qb.collection("whatever"), qb.collection("whatever"));//outputs 241152
SearchHandle results = queryMgr.search(querydef, new SearchHandle(), 10);
long pageLength = results.getPageLength();
long totalResults = results.getTotalResults();
System.out.println("Total Reuslts: " + totalResults);
long timesToLoop = totalResults / pageLength;
for (int i = 0; i < timesToLoop; i = (int) (i + pageLength)) {
System.out.println("Printing Results from: " + (i) + " to: " + (i + pageLength));
results = queryMgr.search(querydef, new SearchHandle(), i);
MatchDocumentSummary[] summaries = results.getMatchResults();//10 results because page length is 10
for (MatchDocumentSummary summary : summaries) {
// System.out.println("Extracted friom URI-> " + summary.getUri());
uriList.add(summary.getUri());
}
if (i >= 1000) {//number of URI to store/retreive. plus 10
break;
}
}
uriList= uriList.stream().distinct().collect(Collectors.toList());
return uriList;

Create Custom InputFormat of ColumnFamilyInputFormat for cassandra

I am working on a project, using cassandra 1.2, hadoop 1.2
I have created my normal cassandra mapper and reducer, but I want to create my own Input format class, which will read the records from cassandra, and I'll get the desired column's value, by splitting that value using splitting and indexing ,
so, I planned to create custom Format class. but I'm confused and not able to know, how would I make it? What classes are to be extend and implement, and how I will able to fetch the row key, column name, columns value etc.
I have my Mapperclass as follow:
public class MyMapper extends
Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, Text> {
private Text word = new Text();
MyJDBC db = new MyJDBC();
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
Context context) throws IOException, InterruptedException {
long std_id = Long.parseLong(ByteBufferUtil.string(key));
long newSavePoint = 0;
if (columns.values().isEmpty()) {
System.out.println("EMPTY ITERATOR");
sb.append("column_N/A" + ":" + "N/A" + " , ");
} else {
for (IColumn cell : columns.values()) {
name = ByteBufferUtil.string(cell.name());
String value = null;
if (name.contains("int")) {
value = String.valueOf(ByteBufferUtil.toInt(cell.value()));
} else {
value = ByteBufferUtil.string(cell.value());
}
String[] data = value.toString().split(",");
// if (data[0].equalsIgnoreCase("login")) {
Long[] dif = getDateDiffe(d1, d2);
// logics i want to perform inside my custominput class , rather here, i just want a simple mapper class
if (condition1 && condition2) {
myhits++;
sb.append(":\t " + data[0] + " " + data[2] + " "+ data[1] /* + " " + data[3] */+ "\n");
newSavePoint = d2;
}
}
sb.append("~" + like + "~" + newSavePoint + "~");
word.set(sb.toString().replace("\t", ""));
}
db.setInterval(Long.parseLong(ByteBufferUtil.string(key)), newSavePoint);
db.setHits(Long.parseLong(ByteBufferUtil.string(key)), like + "");
context.write(new Text(ByteBufferUtil.string(key)), word);
}
I want to decrease my Mapper Class logics, and want to perform same calculations on my custom input class.
Please help, i wish for the positive r4esponse from stackies...
You can do the intended task by moving the Mapper logic to your custom input class (as you have indicated already)
I found this nice post which explains a similar problem statement as you have. I think it might solve your problem.

error in dataSearch using berkley DB

I've created a database using Berkley DB that stores N records where a record is a key/value pair. I originally populated it with only 20 records. With 20 records I managed to do a Key Search, and a Data Search (where I search through the database record by record for a data value that matches the string data inputted by the user).
public String dataSearch (String dataInput) {
String foundKey = null;
String foundData = null;
Cursor cursor = null;
try {
cursor = myDb.openCursor(null, null);
DatabaseEntry theKey = new DatabaseEntry();
DatabaseEntry theData = new DatabaseEntry();
while (cursor.getNext(theKey, theData, LockMode.DEFAULT) == OperationStatus.SUCCESS) {
foundKey = new String(theKey.getData(), "UTF-8");
foundData = new String(theData.getData(), "UTF-8");
// this is to see each key - data - inputdata as I was having an issue
System.out.println("KEY: " + foundKey +
"\nDATA: " + foundData +
"\nINPUT_DATA: " + dataInput + "\n\n");
if (foundData.equals(dataInput)) {
System.out.println("-----------------------------------\n\n");
System.out.println("Found record: " + foundKey +
"\nwith data: " + foundData);
System.out.println("\n\n-----------------------------------");
}
}
/* I then close the cursor and catch exceptions and such */
this works fine when I have less than (or equal to) 20 records... but when I use a bigger number I seem to have some funny behaviour. I set the number of records to 1000... the last key/data values to be inserted into the database are:
KEY: zghxnbujnsztazmnrmrlhjsjfeexohxqotjafliiktlptsquncuejcrebaohblfsqazznheurdqbqbxjmyqr
DATA: jzpqaymwwnoqzvxykowdhxvfbuhrsfojivugrmvmybbvurxmdvmrclalzfscmeknyzkqmrcflzdooyupwznvxikermrbicapynwspbbritjyeltywmmslpeuzsmh
I had it print out the last values to be inserted into the database then did a key search on the above key to ensure that the data above was infact the data associated with that key in the database. However, when I do a data search on the data listed above I get no found matching record (whereas the same process found a record when there was 20 records). I looked into it a bit more and got each my data search to print each key/data pair that it returned and found the following result:
KEY: zghxnbujnsztazmnrmrlhjsjfeexohxqotjafliiktlptsquncuejcrebaohblfsqazznheurdqbqbxjmyqrpzlyvnmdlvgyvzhbceeftcqssbeckxkuepxyphsgdzd
DATA: jzpqaymwwnoqzvxykowdhxvfbuhrsfojivugrmvmybbvurxmdvmrclalzfscmeknyzkqmrcflzdooyupwznvxikermrbicapynwspbbritjyeltywmmslpeuzsmhozy
INPUT DATA: jzpqaymwwnoqzvxykowdhxvfbuhrsfojivugrmvmybbvurxmdvmrclalzfscmeknyzkqmrcflzdooyupwznvxikermrbicapynwspbbritjyeltywmmslpeuzsmh
as you can see it seems to have randomly appended some extra bytes to the data value. however if I do a key search these extra bytes don't show up. So I think the problem is in the dataSearch function. The same results occur if I use b+tree or hash.
Any Ideas?
Thanks
After a long time looking at this I realized my error was that I was not reinitializing the theKey & theData variables.
the fix is in the while loop
while (cursor.getNext(theKey, theData, LockMode.DEFAULT) == OperationStatus.SUCCESS) {
foundKey = new String(theKey.getData(), "UTF-8");
foundData = new String(theData.getData(), "UTF-8");
// this is to see each key - data - inputdata as I was having an issue
System.out.println("KEY: " + foundKey +
"\nDATA: " + foundData +
"\nINPUT_DATA: " + dataInput + "\n\n");
if (foundData.equals(dataInput)) {
System.out.println("-----------------------------------\n\n");
System.out.println("Found record: " + foundKey +
"\nwith data: " + foundData);
System.out.println("\n\n-----------------------------------");
}
// THIS IS THE FIX
theKey = new DatabaseEntry();
theData = new DatabaseEntry();
// ----------------------------
}

Categories

Resources