I already checked the "Making predictions" documentation of WEKA and it contains explicit instructions for command line and GUI predictions.
I want to know how to get a prediction value like the one below I got from the GUI using the Agrawal dataset (weka.datagenerators.classifiers.classification.Agrawal) in my own Java code:
inst#, actual, predicted, error, prediction
1, 1:0, 2:1, +, 0.941
2, 1:0, 1:0, , 1
3, 1:0, 1:0, , 1
4, 1:0, 1:0, , 1
5, 1:0, 1:0, , 1
6, 1:0, 1:0, , 1
7, 1:0, 2:1, +, 0.941
8, 2:1, 2:1, , 0.941
9, 2:1, 2:1, , 0.941
10, 2:1, 2:1, , 0.941
1, 1:0, 1:0, , 1
2, 1:0, 1:0, , 1
3, 1:0, 1:0, , 1
I can't replicate this result even though it said that:
Java
If you want to perform the classification within your own code, see the classifying instances section of this article, explaining the Weka API in general.
I went to the link and it said:
Classifying instances
In case you have an unlabeled dataset that you want to classify with your newly trained classifier, you can use the following code snippet. It loads the file /some/where/unlabeled.arff, uses the previously built classifier tree to label the instances, and saves the labeled data as /some/where/labeled.arff.
This is not the case I want because I just want the k-fold cross validation predictions on my current dataset modeled.
Update
predictions
public FastVector predictions()
Returns the predictions that have been collected.
Returns:
a reference to the FastVector containing the predictions that have been collected. This should be null if no predictions have been collected.
I found the predictions() method for objects of type Evaluation and by using the code:
Object[] preds = evaluation.predictions().toArray();
for(Object pred : preds) {
System.out.println(pred);
}
It resulted to:
...
NOM: 0.0 0.0 1.0 0.9466666666666667 0.05333333333333334
NOM: 0.0 0.0 1.0 0.8947368421052632 0.10526315789473684
NOM: 0.0 0.0 1.0 0.9934883720930232 0.0065116279069767444
NOM: 0.0 0.0 1.0 0.9466666666666667 0.05333333333333334
NOM: 0.0 0.0 1.0 0.9912575655682583 0.008742434431741762
NOM: 0.0 0.0 1.0 0.9934883720930232 0.0065116279069767444
...
Is this the same thing as the one above?
After deep Google searches (and because the documentation provides minimal help) I finally found the answer.
I hope this explicit answer helps others in the future.
For a sample code I saw the question "How to print out the predicted class after cross-validation in WEKA" and I'm glad I was able to decode the incomplete answer wherein some of it is hard to understand.
Here is my code that worked similar to the GUI's output
StringBuffer predictionSB = new StringBuffer();
Range attributesToShow = null;
Boolean outputDistributions = new Boolean(true);
PlainText predictionOutput = new PlainText();
predictionOutput.setBuffer(predictionSB);
predictionOutput.setOutputDistribution(true);
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(j48Model, data, numberOfFolds,
randomNumber, predictionOutput, attributesToShow,
outputDistributions);
To help you understand, we need to implement the StringBuffer to be casted in an AbstractOutput object so that the function crossValidateModel can recognize it.
Using StringBuffer only will cause a java.lang.ClassCastException similar the one in the question while using a PlainText without a StringBuffer will show a java.lang.IllegalStateException.
I would like to thank ManChon U (Kevin) and their question "How to identify the cross-evaluation result to its corresponding instance in the input data set?" for giving me a clue on what this meant:
... you just need a single addition argument that is a concrete subclass of weka.classifiers.evaluation.output.prediction.AbstractOutput. weka.classifiers.evaluation.output.prediction.PlainText is probably the
one you want to use. Source
and
... Try creating a PlainText object, which extends AbstractOutput (called output for example) instance and calling output.setBuffer(forPredictionsPrinting) and passing that in instead of the buffer. Source
These just actually meant to create a PlainText object, put a StringBuffer in it and use it to tweak the output with methods setOutput(boolean) and others.
Finally, to get our desired predictions, just use:
System.out.println(predictionOutput.getBuffer());
Wherein predictionOutput is an object from the AbstractOutput family (PlainText, CSV, XML, etc).
Additionally, the results of evaluation.predictions() is different from the one provided in the WEKA GUI. Fortunately Mark Hall explained this in the question "Print out the predict class after cross-validation"
Evaluation.predictions() returns a FastVector containing either NominalPrediction or NumericPrediction objects from the weka.classifiers.evaluation package. Calling
Evaluation.crossValidateModel() with the additional AbstractOutput object results in the evaluation object printing the prediction/distribution information from Nominal/NumericPrediction objects to the StringBuffer in the format that you see in the Explorer or from the command line.
References:
"Print out the predict class after cross-validation"
"How to identify the cross-evaluation result to its corresponding instance in the input data set?"
"How to print out the predicted class after cross-validation in WEKA"
Related
I have a Tensorflow program running in Python, and for some convenience reasons I want to run the same program on Java, so I have to save my model and load it in my Java application.
My problem is that a don't know how to save a Tensor object, here is my code :
class Main:
def __init__(self, checkpoint):
...
self.g = tf.Graph()
self.sess = tf.Session()
self.img_placeholder = tf.placeholder(tf.float32,
shape=(1, 679, 1024, 3), name='img_placeholder')
#self.preds is an instance of Tensor
self.preds = transform(self.img_placeholder)
self.saver = tf.train.Saver()
self.saver.restore(self.sess, checkpoint)
def ffwd(...):
...
_preds = self.sess.run(self.preds, feed_dict=
{self.img_placeholder: self.X})
...
So since I can't create my Tensor (the transform function creates the NN behind the scenes...), I'am obliged to save it and reload it into Java. I have found ways of saving the session but not Tensor instances.
Could someone give me some insights on how to achieve this ?
Python Tensor objects are symbolic references to a specific output of an operation in the graph.
An operation in a graph can be uniquely identified by its string name. A specific output of that operation is identified by an integer index into the list of outputs of that operation. That index is typically zero since a vast majority of operations produce a single output.
To obtain the name of an Operation and the output index referred to by a Tensor object in Python you could do something like:
print(preds.op.name)
print(preds.value_index) # Most likely will be 0
And then in Java, you can feed/fetch nodes by name.
Let's say preds.op.name returned the string foo, and preds.value_index returned the integer 1, then in Java, you'd do the following:
session.runner().feed("img_placeholder").fetch("foo", 1)
(See javadoc for org.tensorflow.Session.Runner for details).
You may find the slides linked to in https://github.com/tensorflow/models/tree/master/samples/languages/java along with the speaker notes in those slides useful.
Hope that helps.
I am looking through the example of deeplearning 4j for classifying movie reviews according to their sentiment.
ReviewExample
At line 124-142 the N-dimensional arrays are created and I am kind of unsure what is happening at these lines:
Line 132:
features.put(new INDArrayIndex[]{NDArrayIndex.point(i),
NDArrayIndex.all(), NDArrayIndex.point(j)}, vector);
I can image that .point(x) and .point(j) address the cell in the array, but what exactly does the NDArrayIndex.all() call do here?
While building the feature array is more or less ok what is happening there I get totally confused by the label mask and this lastIdx variable
Line 138 - 142
int idx = (positive[i] ? 0 : 1);
int lastIdx = Math.min(tokens.size(),maxLength);
labels.putScalar(new int[]{i,idx,lastIdx-1},1.0); //Set label: [0,1] for negative, [1,0] for positive
labelsMask.putScalar(new int[]{i,lastIdx-1},1.0); //Specify that an output exists at the final time step for this example
The label array itself is addressed by i, idx e.g. column/row that is set to 1.0 - but I don't really get how this time-step information fits in? Is this conventional that the last parameter has to mark the last entry?
Then why does the labelsMask use only i and not i, idx ?
Thanks for explanations or pointer that help to clarify some of my questions
It's an index per dimension. All() is an indicator (use this whole dimension). See the nd4j user guide:
http://nd4j.org/userguide
As for the 1. That 1 is meant to be the class for the label there. It's a text classification problem: Take the window from the text and word vectors and have the class be predicted from that.
As for the label mask: The prediction of a neural net happens at the end of a sequence. See:
http://deeplearning4j.org/usingrnns
write a test and you will know it.
val features = Nd4j.zeros(2, 2, 3)
val toPut = Nd4j.ones(2)
features.put(Array[INDArrayIndex](NDArrayIndex.point(0), NDArrayIndex.all, NDArrayIndex.point(1)), toPut)
the result is
[[[0.00, 1.00, 0.00],
[0.00, 1.00, 0.00]],
[[0.00, 0.00, 0.00],
[0.00, 0.00, 0.00]]]
it will put the 'toPut' vector to the features.
While I already know that there is not much documentation on using GLPK-Java library I'm going to ask anyway... (and no I can't use another solver)
I have a basic problem that involves scheduling. Student to course for semester with some basic constraints.
The example problem is:
We consider {s1, s2} a set of two students. They need to take two
courses {c1, c2} during two semesters {t1, t2}.
We assume the courses are the same. We also assume that the students
cannot take more than one course per semester, and we'd like to determine the
minimum capacity X that must be offered by the classroom, assuming they
all offer the same capacity.
The example we were given in CPLEX format looks like this:
minimize X
subject to
y111 + y112 = 1
y121 + y122 = 1
y211 + y212 = 1
y221 + y222 = 1
y111 + y112 <= 1
y121 + y122 <= 1
y211 + y212 <= 1
y221 + y222 <= 1
y111 + y112 -X <= 0
y121 + y122 -X <= 0
y211 + y212 -X <= 0
y221 + y222 -X <= 0
end
I can run this through the solver via glpsol command and get it to solve but I need to write this using the API. I've never really worked with Linear Programming and the documentation leaves something to be desired. While this is simplistic at best, the real problem involves solving 600 students over 12 semesters who have to take 12 courses out of 18 with certain classes only available certain semesters and some courses having prerequisites.
What I need help with it translating the simplistic problem into a coding example using the API. I'm assuming that once I can see how the very simplistic problem maps to the API calls I can then figure out how to create the application for the more complex issue.
From the examples in the library I can see that you set up columns which in this case would be the Semester and the rows are the Students
// Define Columns
GLPK.glp_add_cols(lp, 2); // adds the number of columns
GLPK.glp_set_col_name(lp, 1, "Sem_1");
GLPK.glp_set_col_kind(lp, 1, GLPKConstants.GLP_IV);
GLPK.glp_set_col_bnds(lp, 1, GLPKConstants.GLP_LO, 0, 0);
GLPK.glp_set_col_name(lp, 2, "Sem_2");
GLPK.glp_set_col_kind(lp, 2, GLPKConstants.GLP_IV);
GLPK.glp_set_col_bnds(lp, 2, GLPKConstants.GLP_LO, 0, 0);
At this point I would assume you need to set up the row constraints but I'm at a loss. Any direction would be greatly appreciated.
When using the API the optimization Problem is basically represented by a matrix of the Problem, where the columns are your variables and the rows your constraints.
For your Problem you have to define 9 Columns, representing y111, y112, ... and X.
Then you can go on with the constraints (rows) by setting the used variables (columns)
GLPK.glp_set_row_name(lp, 2, "constraint1");
GLPK.glp_set_row_bnds(lp, 2, GLPKConstants.GLP_FX, 0, 1.0); // equal to 1.0
GLPK.intArray_setitem(ind, 1, 1); // Add first column at first position
GLPK.intArray_setitem(ind, 2, 2); // Add second column at second position
// set the coefficients to the variables (for all 1. except X is -1. in the example)
GLPK.doubleArray_setitem(val, 1, 1.);
GLPK.doubleArray_setitem(val, 2, 1.);
GLPK.glp_set_mat_row(lp, 1, 2, ind, val);
This will represent the y111 + y112 = 1 constraint - 11 to go.
In the GLPK for Java package should also be a documentation for glpk which contains a good documentation of the GLPK functions (which are also available in glpk for java). Also have a look at the lp.java example.
Now I meet with a problem.
Bellow is the content for index I wrote,
was written for a formula, it's written as this, indeed.[latxt]$$ \left( {a + b} \right)\left( {{1 \over a} + {1 \over b}} \right) \ge \left( {a \cdot {1 \over a} + b \cdot {1 \over b}} \right)^2 = 4 $$[/latxt] was written for a formula, it's written as this, indeed.
When I search for 1 \over b, using highlighter's SimpleFragmenter to control the length of the highlighter content. The result is just parts of the codes.
{1 \over a} + b \cdot {<em>1 \over b</em>}} \right)^2
But what I really want is the whole content of the [latxt] marks, then compile it to a picture.
The methods I am considering are below:
If there are [latxt] marks, do not use the highlighter, and just compile it to a picture, then according to the offset of term,get some periods. But this method is not accurate enough.
To realize Fragmenter myself, wholly handle the the content where there are [latxt] marks, as I still can't master Fragmenter, maybe this method is not likely to choose.
So, I honestly hope you can show me some other ways which can be more convenient and easier to accomplish.
You will need to use TermVectors with position and offsets. This post explains how.
Consider the following simple Python function by way of example:
def quantize(data, nlevels, quantizer=lambda x, d: int(floor(x/d))):
llim = min(data)
delta = (max(data) - llim)/(nlevels - 1) # last level x == max(data) only
y = type(data)
if delta == 0:
return y([0] * len(data))
else:
return y([quantizer(x - llim, delta) for x in data])
And here it is in action:
>>> from random import random
>>> data = [10*random() for _ in range(10)]
>>> data
[6.6181668777075018, 9.0511321773967737, 1.8967672216187881, 7.3396890304913951,
4.0566699095012835, 2.3589022034131069, 0.76888247730320769, 8.994874996737197,
7.1717500363578246, 2.887112256757157]
>>> quantize(data, nlevels=5)
[2, 4, 0, 3, 1, 0, 0, 3, 3, 1]
>>> quantize(tuple(data), nlevels=5)
(2, 4, 0, 3, 1, 0, 0, 3, 3, 1)
>>> from math import floor
>>> quantize(data, nlevels=5, quantizer=lambda x, d: (floor(x/d) + 0.5))
[2.5, 4.5, 0.5, 3.5, 1.5, 0.5, 0.5, 3.5, 3.5, 1.5]
This function certainly has flaws --for one thing, it does not validate arguments, and it should be smarter about how it sets the type of the returned value--, but it has the virtue that it will work whether the elements in data are integers or floats or some other numeric type. Also, by default it returns a list of ints, though, by passing a suitable function as the optional quantizer argument, this type can be changed to something else. Furthermore, if the data parameter is a list, the returned value will be a list; if data is a tuple, the returned value will be a tuple. (This last feature is certainly the weakest one, but it is also the one that I'm least interested in replicating in Java, so I did not bother to make it more robust.)
I would like to write an efficient Java-equivalent of this function, which means figuring out how to get around Java's typing. Since I learned Java (aeons ago), generics were introduced into the language. I've tried learning about Java generics, but find them pretty incomprehensible. I don't know is this is due to early-onset senility, or because of the sheer growth in Java's complexity since I last programmed in it (ca. 2001), but every page I find on this topic is more confusing than the previous one. I'd really appreciate it if someone could show me how to do this in Java.
Thanks!
One solution to the input/output type question might be to use the Number class and its subclasses along with wildcards. If you wanted to accept any type of numerical argument, you could either specify the input type to be Number OR ? extends Number. If the input is a list, the latter form has an advantage as it will ensure that each element of the list is of the same type (which must be a subclass of Number). The ? is known as a Wildcard, and when it is expressed as ? extends Number it is a "Bounded Wildcard", and may only refer to a subtype of the bounding type.
Example:
public List<Number> func(List<? extends Number> data, Number nlevels)
This would take a List of a specific subclass of Number, a Number for the nlevels parameter, and return a List of Numbers
As for the function input parameter, it would be possible to input a Method, though the type checking at this point gets difficult, as you will be passing data of a bounded unknown parameter to a Method object. I am not exactly sure how that would work.
As for the return type, it would be possible to specify another parameter, a class object (likely a ? extends Number again) that the list elements would be cast (or converted) to.
public List<? extends Number> quantize(List<? extends Number> data,
Number nlevels,
Method quantizer,
Class<? extends Number> returnType)
That is an attempt at a possible declaration for your function in Java. The implementation, however, is somewhat more complex.
It's not quite what you asked, but may I suggest trying out Jython? You'd be able to take your Python code and compile directly to Java bytecode. As you haven't used Java since 2001 and you seem to be using Python these days, you may find Jython much easier to work with than having to bring yourself up to speed on all the changes in Java beforehand.