I would like to get the classification rate for a tree constructed using J48.
DataSource source = new DataSource(Path);
Instances data = source.getDataSet();
J48 tree = tree.buildClassifier(data);
I know it has something to do with
public double getMeasure(java.lang.String additionalMeasureName)
But i can't find the correct String (additionalMeasureName) to use.
I just found an answer to my question using Evaluation class. The code should be:
//Learning
DataSource source = new DataSource(Path);
Instances data = source.getDataSet();
J48 tree = new J48();
tree.buildClassifier(data);
//Evaluation
Evaluation eval = new Evaluation(data);
eval.evaluateModel(tree, data);
System.out.println((eval.correct()/data.numInstances())*100);
This tests the decision tree using the learning data an displays the percentage of correctly classified instances.
Related
Following is the piece of code i'm trying to save my model. But i'm unable to find saveModel() API functionality to store the model.
// Create classification trainer.
DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(10, 0.1);
// Train decision tree model.
Model mdl = trainer.fit(
ignite,
dataCache,
featureExtractor,
labelExtractor
);
Exporter<DecisionTreeNode, String> exporter = new FileExporter<>();
**((DecisionTreeNode)mdl).saveModel(exporter, filePath);**
every classification algorithm(KNN, ANN, KMeans...) implements exportable modelFormat interface except decision tree, so in this case we can save it with ModelsComposition (which is true for decision tree scenario)
Exporter exporter = new FileExporter<>();
((ModelsComposition) mdl).saveModel(exporter, filePath);
In Weka (using Java), I would like to successsively fit classifiers to different subsets of attributes of the same dataset.
Is there a way to build the Instances object only once, and then remove the non-selected features but only temporarily, so they can be efficiently restored and used later in case the attribute is needed later to build another classifier, without having to create every time a totally new Instances object from scratch?
I am aware of method deleteAttributeAt() which says that
A deep copy of the attribute information is performed before the
attribute is deleted
and also of class Remove but I'm not sure this is what I need.
Create new Instances objects at each stage and use the appropriately.
For example, below example is using instances object without class and normalized to build a cluster.
Use rawData to get original instances. Hope this helps.
final SimpleKMeans kmeans = new SimpleKMeans();
final String[] options = weka.core.Utils
.splitOptions("-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 10 -A \"weka.core.EuclideanDistance -R first-last\" -I 500 -num-slots 1 -S 50");
kmeans.setOptions(options);
kmeans.setSeed(1000);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(5);
kmeans.setMaxIterations(1000);
final BufferedReader datafile = readDataFile("/Users/data.arff");
final Instances rawData = new Instances(datafile);
rawData.setClassIndex(classIndex);
//remove class column[0] from cluster
final Remove removeFilter = new Remove();
removeFilter.setAttributeIndices("" + (rawData.classIndex() + 1));
removeFilter.setInputFormat(rawData);
final Instances dataNoClass = Filter.useFilter(rawData, removeFilter);
//normalize
final Normalize normalizeFilter = new Normalize();
normalizeFilter.setIgnoreClass(true);
normalizeFilter.setInputFormat(dataNoClass);
final Instances data = Filter.useFilter(dataNoClass, normalizeFilter);
kmeans.buildClusterer(data);
My original tree was much bigger, but since I'm stuck with this issue for quite some time I decided to try to simplify my tree. I Ended up with something like this:
As you can see, I only have a single attribute called "LarguraBandaRede" with 3 possible nominal values "Congestionado", "Livre" and "Merda".
After that I exported the j48.model from weka to use on my java code.
With this piece of code I import the model to use as a classifier:
ObjectInputStream objectInputStream = new ObjectInputStream(in);
classifier = (J48) objectInputStream.readObject();
After that I started to create a arraylist of my attributes and a Instances File
for (int i = 0; i <features.length; i++) {
String feature = features[i];
Attribute attribute;
if (feature.equals("TamanhoDados(Kb)")) {
attribute = new Attribute(feature);
} else {
String[] strings = null;
if(i==0) strings = populateAttributes(7);
if(i==1) strings = populateAttributes(10);
ArrayList<String> attValues = new ArrayList<String>(Arrays.asList(strings));
attribute = new Attribute(feature,attValues);
}
atts.add(attribute);
}
where populateAttributes gives the possible values for each attribute, in this case "Livre, Merda, Congestionado;" for LarguraBandaRede and "Sim,Nao" for Resultado, my class attribute.
Instances instances = new Instances("header",atts,atts.size());
instances.setClassIndex(instances.numAttributes()-1);
After creating my instances is time to create my Instance File, that is, the instances that I'm trying to classify
Instance instanceLivre = new DenseInstance(features.length);
Instance instanceMediano = new DenseInstance(features.length);
Instance instanceCongestionado = new DenseInstance(features.length);
instanceLivre.setDataset(instances);
instanceMediano.setDataset(instances);
instanceCongestionado.setDataset(instances);
then I set each of this instances with the 3 possible values for "LarguraBandaRede". 'instanceLivre' with "Livre", 'instanceMediano' with "Merda" and 'instanceCongestionado' with "Congestionado".
After that I only classify this 3 instances using the classifyInstance method
System.out.println(instance.toString());
double resp = classifier.classifyInstance(instance);
System.out.println("valor: "+resp);
and this is my result:
As you can see, the instance that has Merda as "LarguraBandaRede" was classify to be the same class as Congestionado, the class 'Nao'. But that doesn't make any sense, since the tree above clearly show that when "LarguraBandaRede" is "Merda" or "Livre" the class should be the same.
So that's my question. How this happened and how to fix it?
Thanks in advance.
EDIT
I didn't know that this:
made any difference in the way the model works. But we have to follow this order when feeding a nominal attribute with possible values.
Have you checked if the weka nominal attribute index is equal in order to your populateAttributes method?
I am new to Lucene, so apologies for any unclear wording. I am working on an author search engine. The search query is the author name. The default search results are good - they return the names that match the most. However, we want to rank the results by author popularity as well, a blend of both the default similarity and a numeric value representing the circulations their titles have. The problem with the default results is it returns authors nobody is interested in, and while I can rank by circulation alone, the top result is generally not a great match in terms of name. I have been looking for days for a solution for this.
This is how I am building my index:
IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(INDEX_LOCATION)),
new IndexWriterConfig(new StandardAnalyzer()));
writer.deleteAll();
for (Contributor contributor : contributors) {
Document doc = new Document();
doc.add(new TextField("name", contributor.getName(), Field.Store.YES));
doc.add(new StoredField("contribId", contributor.getContribId()));
doc.add(new NumericDocValuesField("sum", sum));
writer.addDocument(doc);
}
writer.close();
The name is the field we want to search on, and the sum is the field we want to weight our search results with (but still taking into account the best match for the author name). I'm not sure if adding the sum to the document is the correct thing to do in this situation. I know that there will need to be some experimentation to figure out how to best blend the weighting of the two factors, but my problem is I don't know how to do it in the first place.
Any examples I've been able to find are either pre-Lucene 4 or don't seem to work. I thought this was what I was looking for, but it doesn't seem to work. Help appreciated!
As demonstrated in the blog post you linked, you could use a CustomScoreQuery; this would give you a lot of flexibility and influence over the scoring process, but it is also a bit overkill. Another possibility is to use a FunctionScoreQuery; since they behave differently, I will explain both.
Using a FunctionScoreQuery
A FunctionScoreQuery can modify a score based on a field.
Let's say you create you are usually performing a search like this:
Query q = .... // pass the user input to the QueryParser or similar
TopDocs hits = searcher.search(query, 10); // Get 10 results
Then you can modify the query in between like this:
Query q = .....
// Note that a Float field would work better.
DoubleValuesSource boostByField = DoubleValuesSource.fromLongField("sum");
// Create a query, based on the old query and the boost
FunctionScoreQuery modifiedQuery = new FunctionScoreQuery(q, boostByField);
// Search as usual
TopDocs hits = searcher.search(query, 10);
This will modify the query based on the value of field. Sadly, however, there isn't a possibility to control the influence of the DoubleValuesSource (besides by scaling the values during indexing) - at least none that I know of.
To have more control, consider using the CustomScoreQuery.
Using a CustomScoreQuery
Using this kind of query will allow you to modify a score of each result any way you like. In this context we will use it to alter the score based on a field in the index. First, you will have to store your value during indexing:
doc.add(new StoredField("sum", sum));
Then we will have to create our very own query class:
private static class MyScoreQuery extends CustomScoreQuery {
public MyScoreQuery(Query subQuery) {
super(subQuery);
}
// The CustomScoreProvider is what actually alters the score
private class MyScoreProvider extends CustomScoreProvider {
private LeafReader reader;
private Set<String> fieldsToLoad;
public MyScoreProvider(LeafReaderContext context) {
super(context);
reader = context.reader();
// We create a HashSet which contains the name of the field
// which we need. This allows us to retrieve the document
// with only this field loaded, which is a lot faster.
fieldsToLoad = new HashSet<>();
fieldsToLoad.add("sum");
}
#Override
public float customScore(int doc_id, float currentScore, float valSrcScore) throws IOException {
// Get the result document from the index
Document doc = reader.document(doc_id, fieldsToLoad);
// Get boost value from index
IndexableField field = doc.getField("sum");
Number number = field.numericValue();
// This is just an example on how to alter the current score
// based on the value of "sum". You will have to experiment
// here.
float influence = 0.01f;
float boost = number.floatValue() * influence;
// Return the new score for this result, based on the
// original lucene score.
return currentScore + boost;
}
}
// Make sure that our CustomScoreProvider is being used.
#Override
public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
return new MyScoreProvider(context);
}
}
Now you can use your new Query class to modify an existing query, similar to the FunctionScoreQuery:
Query q = .....
// Create a query, based on the old query and the boost
MyScoreQuery modifiedQuery = new MyScoreQuery(q);
// Search as usual
TopDocs hits = searcher.search(query, 10);
Final remarks
Using a CustomScoreQuery, you can influence the scoring process in all kinds of ways. Remember however that the method customScore is called for each search result - so don't perform any expensive computations there, as this would severely slow down the search process.
I've creating a small gist of a full working example of the CustomScoreQuery here: https://gist.github.com/philippludwig/14e0d9b527a6522511ae79823adef73a
I have a classifier that I trained using Python's scikit-learn. How can I use the classifier from a Java program? Can I use Jython? Is there some way to save the classifier in Python and load it in Java? Is there some other way to use it?
You cannot use jython as scikit-learn heavily relies on numpy and scipy that have many compiled C and Fortran extensions hence cannot work in jython.
The easiest ways to use scikit-learn in a java environment would be to:
expose the classifier as a HTTP / Json service, for instance using a microframework such as flask or bottle or cornice and call it from java using an HTTP client library
write a commandline wrapper application in python that reads data on stdin and output predictions on stdout using some format such as CSV or JSON (or some lower level binary representation) and call the python program from java for instance using Apache Commons Exec.
make the python program output the raw numerical parameters learnt at fit time (typically as an array of floating point values) and reimplement the predict function in java (this is typically easy for predictive linear models where the prediction is often just a thresholded dot product).
The last approach will be a lot more work if you need to re-implement feature extraction in Java as well.
Finally you can use a Java library such as Weka or Mahout that implement the algorithms you need instead of trying to use scikit-learn from Java.
There is JPMML project for this purpose.
First, you can serialize scikit-learn model to PMML (which is XML internally) using sklearn2pmml library directly from python or dump it in python first and convert using jpmml-sklearn in java or from a command line provided by this library. Next, you can load pmml file, deserialize and execute the loaded model using jpmml-evaluator in your Java code.
This way works with not all scikit-learn models, but with many of them.
As some commenters correctly pointed out, it's important to note that JPMML project is licensed under GNU AGPL. AGPL is a strong copyleft license, which may limit your ability to use the project. One of the examples may be if you develop a publically accessible service and want to keep the sources closed.
You can either use a porter, I have tested the sklearn-porter (https://github.com/nok/sklearn-porter), and it works well for Java.
My code is the following:
import pandas as pd
from sklearn import tree
from sklearn_porter import Porter
train_dataset = pd.read_csv('./result2.csv').as_matrix()
X_train = train_dataset[:90, :8]
Y_train = train_dataset[:90, 8:]
X_test = train_dataset[90:, :8]
Y_test = train_dataset[90:, 8:]
print X_train.shape
print Y_train.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
porter = Porter(clf, language='java')
output = porter.export(embed_data=True)
print(output)
In my case, I'm using a DecisionTreeClassifier, and the output of
print(output)
is the following code as text in the console:
class DecisionTreeClassifier {
private static int findMax(int[] nums) {
int index = 0;
for (int i = 0; i < nums.length; i++) {
index = nums[i] > nums[index] ? i : index;
}
return index;
}
public static int predict(double[] features) {
int[] classes = new int[2];
if (features[5] <= 51.5) {
if (features[6] <= 21.0) {
// HUGE amount of ifs..........
}
}
return findMax(classes);
}
public static void main(String[] args) {
if (args.length == 8) {
// Features:
double[] features = new double[args.length];
for (int i = 0, l = args.length; i < l; i++) {
features[i] = Double.parseDouble(args[i]);
}
// Prediction:
int prediction = DecisionTreeClassifier.predict(features);
System.out.println(prediction);
}
}
}
Here is some code for the JPMML solution:
--PYTHON PART--
# helper function to determine the string columns which have to be one-hot-encoded in order to apply an estimator.
def determine_categorical_columns(df):
categorical_columns = []
x = 0
for col in df.dtypes:
if col == 'object':
val = df[df.columns[x]].iloc[0]
if not isinstance(val,Decimal):
categorical_columns.append(df.columns[x])
x += 1
return categorical_columns
categorical_columns = determine_categorical_columns(df)
other_columns = list(set(df.columns).difference(categorical_columns))
#construction of transformators for our example
labelBinarizers = [(d, LabelBinarizer()) for d in categorical_columns]
nones = [(d, None) for d in other_columns]
transformators = labelBinarizers+nones
mapper = DataFrameMapper(transformators,df_out=True)
gbc = GradientBoostingClassifier()
#construction of the pipeline
lm = PMMLPipeline([
("mapper", mapper),
("estimator", gbc)
])
--JAVA PART --
//Initialisation.
String pmmlFile = "ScikitLearnNew.pmml";
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
//Determine which features are required as input
HashMap<String, Field>() inputFieldMap = new HashMap<String, Field>();
for (int i = 0; i < evaluator.getInputFields().size();i++) {
InputField curInputField = evaluator.getInputFields().get(i);
String fieldName = curInputField.getName().getValue();
inputFieldMap.put(fieldName.toLowerCase(),curInputField.getField());
}
//prediction
HashMap<String,String> argsMap = new HashMap<String,String>();
//... fill argsMap with input
Map<FieldName, ?> res;
// here we keep only features that are required by the model
Map<FieldName,String> args = new HashMap<FieldName, String>();
Iterator<String> iter = argsMap.keySet().iterator();
while (iter.hasNext()) {
String key = iter.next();
Field f = inputFieldMap.get(key);
if (f != null) {
FieldName name =f.getName();
String value = argsMap.get(key);
args.put(name, value);
}
}
//the model is applied to input, a probability distribution is obtained
res = evaluator.evaluate(args);
SegmentResult segmentResult = (SegmentResult) res;
Object targetValue = segmentResult.getTargetValue();
ProbabilityDistribution probabilityDistribution = (ProbabilityDistribution) targetValue;
I found myself in a similar situation.
I'll recommend carving out a classifier microservice. You could have a classifier microservice which runs in python and then expose calls to that service over some RESTFul API yielding JSON/XML data-interchange format. I think this is a cleaner approach.
Alternatively you can just generate a Python code from a trained model. Here is a tool that can help you with that https://github.com/BayesWitnesses/m2cgen