I'm trying to create an "automated trainning" using weka's java api but I guess I'm doing something wrong, whenever I test my ARFF file via weka's interface using MultiLayerPerceptron with 10 Cross Validation or 66% Percentage Split I get some satisfactory results (around 90%), but when I try to test the same file via weka's API every test returns basically a 0% match (every row returns false)
here's the output from weka's gui:
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 78 91.7647 %
Incorrectly Classified Instances 7 8.2353 %
Kappa statistic 0.8081
Mean absolute error 0.0817
Root mean squared error 0.24
Relative absolute error 17.742 %
Root relative squared error 51.0603 %
Total Number of Instances 85
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.885 0.068 0.852 0.885 0.868 0.958 1
0.932 0.115 0.948 0.932 0.94 0.958 0
Weighted Avg. 0.918 0.101 0.919 0.918 0.918 0.958
=== Confusion Matrix ===
a b <-- classified as
23 3 | a = 1
4 55 | b = 0
and here's the code I've using on java (actually it's on .NET using IKVM):
var classifier = new weka.classifiers.functions.MultilayerPerceptron();
classifier.setOptions(weka.core.Utils.splitOptions("-L 0.7 -M 0.3 -N 75 -V 0 -S 0 -E 20 -H a")); //these are the same options (the default options) when the test is run under weka gui
string trainingFile = Properties.Settings.Default.WekaTrainingFile; //the path to the same file I use to test on weka explorer
weka.core.Instances data = null;
data = new weka.core.Instances(new java.io.BufferedReader(new java.io.FileReader(trainingFile))); //loads the file
data.setClassIndex(data.numAttributes() - 1); //set the last column as the class attribute
cl.buildClassifier(data);
var tmp = System.IO.Path.GetTempFileName(); //creates a temp file to create an arff file with a single row with the instance I want to test taken from the arff file loaded previously
using (var f = System.IO.File.CreateText(tmp))
{
//long code to read data from db and regenerate the line, simulating data coming from the source I really want to test
}
var dataToTest = new weka.core.Instances(new java.io.BufferedReader(new java.io.FileReader(tmp)));
dataToTest.setClassIndex(dataToTest.numAttributes() - 1);
double prediction = 0;
for (int i = 0; i < dataToTest.numInstances(); i++)
{
weka.core.Instance curr = dataToTest.instance(i);
weka.core.Instance inst = new weka.core.Instance(data.numAttributes());
inst.setDataset(data);
for (int n = 0; n < data.numAttributes(); n++)
{
weka.core.Attribute att = dataToTest.attribute(data.attribute(n).name());
if (att != null)
{
if (att.isNominal())
{
if ((data.attribute(n).numValues() > 0) && (att.numValues() > 0))
{
String label = curr.stringValue(att);
int index = data.attribute(n).indexOfValue(label);
if (index != -1)
inst.setValue(n, index);
}
}
else if (att.isNumeric())
{
inst.setValue(n, curr.value(att));
}
else
{
throw new InvalidOperationException("Unhandled attribute type!");
}
}
}
prediction += cl.classifyInstance(inst);
}
//prediction is always 0 here, my ARFF file has two classes: 0 and 1, 92 zeroes and 159 ones
it's funny because if I change the classifier to let's say NaiveBayes the results match the test made via weka's gui
You are using a deprecated way of reading in ARFF files. See this documentation. Try this instead:
import weka.core.converters.ConverterUtils.DataSource;
...
DataSource source = new DataSource("/some/where/data.arff");
Instances data = source.getDataSet();
Note that that documentation also shows how to connect to a database directly, and bypass the creation of temporary ARFF files. You could, additionally, read from the database and manually create instances to populate the Instances object with.
Finally, if simply changing the classifier type at the top of the code to NaiveBayes solved the problem, then check the options in your weka gui for MultilayerPerceptron, to see if they are different from the defaults (different settings can cause the same classifier type to produce different results).
Update: it looks like you're using different test data in your code than in your weka GUI (from a database vs a fold of the original training file); it might also be the case that the particular data in your database actually does look like class 0 to the MLP classifier. To verify whether this is the case, you can use the weka interface to split your training arff into train/test sets, and then repeat the original experiment in your code. If the results are the same as the gui, there's a problem with your data. If the results are different, then we need to look more closely at the code. The function you would call is this (from the Doc):
public Instances trainCV(int numFolds, int numFold)
I had the same Problem.
Weka gave me different results in the Explorer compared to a cross-validation in Java.
Something that helped:
Instances dataSet = ...;
dataSet.stratify(numOfFolds); // use this
//before splitting the dataset into train and test set!
Related
Im trying to copy the exrcise about halfway down the page on this link:
https://d2l.ai/chapter_recurrent-neural-networks/sequence.html
The exercise uses a sine function to create 1000 data points between -1 through 1 and use a recurrent network to approximate the function.
Below is the code I used. I'm going back to study more why this isn't working as it doesn't make much sense to me now when I was easily able to use a feed forward network to approximate this function.
//get data
ArrayList<DataSet> list = new ArrayList();
DataSet dss = DataSetFetch.getDataSet(Constants.DataTypes.math, "sine", 20, 500, 0, 0);
DataSet dsMain = dss.copy();
if (!dss.isEmpty()){
list.add(dss);
}
if (list.isEmpty()){
return;
}
//format dataset
list = DataSetFormatter.formatReccurnent(list, 0);
//get network
int history = 10;
ArrayList<LayerDescription> ldlist = new ArrayList<>();
LayerDescription l = new LayerDescription(1,history, Activation.RELU);
ldlist.add(l);
LayerDescription ll = new LayerDescription(history, 1, Activation.IDENTITY, LossFunctions.LossFunction.MSE);
ldlist.add(ll);
ListenerDescription ld = new ListenerDescription(20, true, false);
MultiLayerNetwork network = Reccurent.getLstm(ldlist, 123, WeightInit.XAVIER, new RmsProp(), ld);
//train network
final List<DataSet> lister = list.get(0).asList();
DataSetIterator iter = new ListDataSetIterator<>(lister, 50);
network.fit(iter, 50);
network.rnnClearPreviousState();
//test network
ArrayList<DataSet> resList = new ArrayList<>();
DataSet result = new DataSet();
INDArray arr = Nd4j.zeros(lister.size()+1);
INDArray holder;
if (list.size() > 1){
//test on training data
System.err.println("oops");
}else{
//test on original or scaled data
for (int i = 0; i < lister.size(); i++) {
holder = network.rnnTimeStep(lister.get(i).getFeatures());
arr.putScalar(i,holder.getFloat(0));
}
}
//add originaldata
resList.add(dsMain);
//result
result.setFeatures(dsMain.getFeatures());
result.setLabels(arr);
resList.add(result);
//display
DisplayData.plot2DScatterGraph(resList);
Can you explain the code I would need for a 1 in 10 hidden and 1 out lstm network to approximate a sine function?
Im not using any normalization as function is already -1:1 and Im using the Y input as the feature and the following Y Input as the label to train the network.
You notice i am building a class that allows for easier construction of nets and I have tried throwing many changes at the problem but I am sick of guessing.
Here are some examples of my results. Blue is data red is result
This is one of those times were you go from wondering why was this not working to how in the hell were my original results were as good as they were.
My failing was not understanding the documentation clearly and also not understanding BPTT.
With feed forward networks each iteration is stored as a row and each input as a column. An example is [dataset.size, network inputs.size]
However with recurrent input its reversed with each row being a an input and each column an iteration in time necessary to activate the state of the lstm chain of events. At minimum my input needed to be [0, networkinputs.size, dataset.size] But could also be [dataset.size, networkinputs.size, statelength.size]
In my previous example I was training the network with data in this format [dataset.size, networkinputs.size, 1]. So from my low resolution understanding the lstm network should never have worked at all but somehow produced at least something.
There may have also been some issue with converting the dataset to a list as I also changed how I feed the network but but I think the bulk of the issue was a data structure issue.
Below are my new results
Hard to tell what is going on without seeing the full code. For a start I don't see an RnnOutputLayer specified. You could take a look this which shows you how to build an RNN in DL4J.
If your RNN setup is correct this could be a tuning issue. You can find more on tuning here. Adam is probably a better choice for an updater than RMSProp. And tanh probably is a good choice for the activation for your output layer since it's range is (-1,1). Other things to check/tweak - learning rate, number of epochs, set up of your data (like are you trying to predict to far out?).
I'm using an own bag of word model instead of wekas StringToWordVector (turns out to be a mistake, but as it's only a school project, I'd like to finish it with my approach), so I cannot use it's CrossFoldEvaluation, as my BoW dictionary would contain the words of the training data too.
for (int n = 0; n < folds; n++) {
List<String> allData = getAllReviews(); // 2000 reviews
List<String> trainingData = getTrainingReviews(n, folds); // random 1800 reviews
List<String> testData = getTestReviews(n, folds); // random 200 reviews
bagOfWordsModel.train(trainingData); // builds a vocabulary of 1800 training reviews
Instances inst = bagOfWordsModel.vectorize(allData); // returns 1800 instances with the class attribute set to positive or negative, and 200 without
// todo: evaluate
Classifier cModel = (Classifier) new NaiveBayes();
cModel.buildClassifier(inst);
Evaluation eTest = new Evaluation(inst);
eTest.evaluateModel(cModel, inst);
// print results
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);
}
How can I now evaluate this? I thought, weka will automatically try to determine the class attribute of the instances that have no value for the class attribute. But instead, it tells me weka.filters.supervised.attribute.Discretize: Cannot handle missing class values!
As you have both a training set and a testing set, you should train the classifier on the training data, which should be labelled, and then use the trained model to classify the unlabeled test data.
Classifier cModel = new NaiveBayes();
cModel.buildClassifier(trainingData);
And then, with the use of the following line you should be able to classify an unknown instance and get a prediction:
double clsLabel = cModel.classifyInstance(testData.instance(0));
Or you could use the Evaluation class to make predictions on the entire test set.
Evaluation evaluation = new Evaluation();
evaluation.evaluateModel(cModel, testData);
You have pointed out that you are attempting to implement your own cross-validation by taking a random subset of the data - There is a method that does k-fold cross-validation for you int he Evaluation class (crossValidateModel).
Evaluation evaluation = new Evaluation(trainingData);
evaluation.crossValidateModel(cModel, trainingData, 10, new Random(1));
Note: Cross-validation is used when you don't have a test set by taking a subset of the training data and holding it out of training and using that to evaluate performance cross-validation.
K-fold cross-validation splits the training data into K subsets. It puts one of the subsets aside and uses the remaining to train the classifier, returning to the subset set aside to evaluate the model. It then repeats this process until it has used each subset as the test set.
When Training, only Input the instances with set class.
In this line:
cModel.buildClassifier(inst);
you are Training a naive Bayes classifier. Input only the training examples(!). Evaluate against all data (with labels!). Evaluation checks the predicted Label against the actual Label, if I remember correctly.
The 200 data points without class Label seem useless, what are they for?
I have to monitor a log file in which is written the history of utilization of an app. This log file is formatted in this way:
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
... about 800000 rows
AppId is always the same, because is referenced at only one app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are expressed in % so for example:
<3ghffh3t482age20304,230720142245,0.2,3,5>
To be specific, I have to check the percentage of CPU usage and memory usage by this application to be monitored using spark and the map reduce algorithm.
My output is to print alert when the cpu or the memory are 100% of usage.
How can I start?
The idea is to declare a class and map the line into a scala object,
Lets declare the case class as follows,
case class App(name: String, date: String, cpuUsage: Double, memoryusage: Double)
Then initialize the SparkContext and create a RDD from the text file where the data is present,
val sc = new SparkContext(sparkConf)
val inFile = sc.textFile("log.txt")
then parse each line and map it to App object so that the range checking would be faster,
val mappedLines = inFile.map(x => (x.split(",")(0), parse(x)))
where the parse(x) method is defined as follows,
def parse(x: String):App = {
val splitArr = x.split(",");
val app = new App(splitArr(0),
splitArr(1),
splitArr(2).toDouble,
splitArr(3).toDouble)
return app
}
Note that i have assumed the input as follows, (this is just to give you the idea and not the entire program),
ffh3t482age20304,230720142245,0.2,100.5
Then do the filter transformation where you can perform the check and report the anamoly conditions,
val anamolyLines = mappedLines.filter(doCheckCPUAndMemoryUtilization)
anamolyLines.count()
where doCheckCPUAndMemoryUtilization function is defined as follows,
def doCheckCPUAndMemoryUtilization(x:(String, App)):Boolean = {
if(x._2.cpuUsage >= 100.0 ||
x._2.memoryusage >= 100.0) {
System.out.println("App name -> "+x._2.name +" exceed the limit")
return true
}
return false
}
Note: This is only a batch processing and not real-time processing.
I am creating a large number of output files, for example 500. I am getting already being created exception,as shoen below. The program recovers by itself when the number of output files is small. For ex. if its 50 files, though this exception occurs, the program starts running successfully after printing this exception several times.
But, for many files, it eventually fails with an IOException.
I have pasted the error and then the code below:
12/10/29 15:47:27 INFO mapred.JobClient: Task Id : attempt_201210231820_0235_r_000004_3, Status : FAILED
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /home/users/mlakshm/preopa406/data-r-00004 for DFSClient_attempt_201210231820_0235_r_000004_3 on client 10.0.1.100, because this file is already being created by DFSClient_attempt_201210231820_0235_r_000004_2 on 10.0.1.130
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1406)
I have pasted the code :
In the Reduce method, I have the below logic to generate ouputs:
int data_hash = (int)data_str.hashCode();
int data_int1 = 0;
int k = 500;
int check1 = 0;
for (int l = 10; l>0; l++)
{
if((data_hash%l==0)&&(check1 == 0))
{
check1 = 1;
int range = (int) k/10;
String check = "true";
while(range > 0 && check.equals("true"))
{
if(data_hash % range-1 == 0)
{
check = "false";
data_int1 = range*10;
}
}
}
}
mos.getCollector("/home/users/mlakshm/preopa407/cdata"+data_int1, reporter).collect(new Text(t+" "+alsort.get(0)+" "+alsort.get(1)), new Text(intersection));
PLs help!
The problem is that all the reducer are trying to write files with the same naming scheme.
The reason it's doing this because
mos.getCollector("/home/users/mlakshm/preopa407/cdata"+data_int1, reporter).collect(new Text(t+" "+alsort.get(0)+" "+alsort.get(1)), new Text(intersection));
Set's the file name based on a characteristic of the data not the identity of the reducer.
You have a couple of choices :
Rework your map job so so that the key that's emitted matches up with the hash that your calculating in this job. That would make sure that each reducer got a span of values.
Include in the file name a identifier that is unqiue to each mapper. This would leave you with a set of part files for each reducer.
Could you perhaps explain why your using multiple outputs here? I don't think you need to.
I wanted to use Apache math commons implementation for FFT (FastFourierTransformer class) to process some dummy data whose 8 data samples are contributing to one complete sinusoidal wave. The maximum being amplitude 230. The code snippet that I tried is below :
private double[] transform()
{
double [] input = new double[8];
input[0] = 0.0;
input[1] = 162.6345596729059;
input[2] = 230.0;
input[3] = 162.63455967290594;
input[4] = 2.8166876380389125E-14;
input[5] = -162.6345596729059;
input[6] = -230.0;
input[7] = -162.63455967290597;
double[] tempConversion = new double[input.length];
FastFourierTransformer transformer = new FastFourierTransformer();
try {
Complex[] complx = transformer.transform(input);
for (int i = 0; i < complx.length; i++) {
double rr = (complx[i].getReal());
double ri = (complx[i].getImaginary());
tempConversion[i] = Math.sqrt((rr * rr) + (ri * ri));
}
} catch (IllegalArgumentException e) {
System.out.println(e);
}
return tempConversion;
}
1) Now the data returned by method transform is an array of complex number. Does that array contains the frequency component information about input data? or the tempConversion array that I created will contain the frequency information? The values in tempConversion array is :
2.5483305001488234E-16
920.0
4.0014578493024757E-14
2.2914314707516465E-13
5.658858581079313E-14
2.2914314707516465E-13
4.0014578493024757E-14
920.0
2) I searched a lot but at most of the places there is no clear documentation on what format of data algorithm expects (in terms of sample code to understand better) and how do I use the array of results to calculate the frequencies contained in the signal?
Your output data looks correct. You've calculated the magnitude of the complex FFT output at each frequency bin which corresponds to the energy in the input signal at the corresponding frequency for that bin. Since your input is purely real, the output is complex conjugate symmetric, and the last 3 output values are redundant.
So you have:
Bin Freq Magnitude
0 0 (DC) 2.5483305001488234E-16
1 Fs/8 920.0
2 Fs/4 4.0014578493024757E-14
3 3Fs/8 2.2914314707516465E-13
4 Fs/2 (Nyq) 5.658858581079313E-14
5 3Fs/8 2.2914314707516465E-13 # redundant - mirror image of bin 3
6 Fs/4 4.0014578493024757E-14 # redundant - mirror image of bin 2
7 Fs/8 920.0 # redundant - mirror image of bin 1
All the values are effectively 0 apart from bin 1 (and bin 6) which corresponds to a frequency of Fs/8 as expected.