I have a csv file containing 24231 rows. I would like to apply LOOCV based on the project name instead of the observations of the whole dataset.
So if my dataset contains information for 15 projects, I would like to have the training set based on 14 projects and the test set based on the other project.
I was relying on weka's API, is there anything that automates this process?
For non-numeric attributes, Weka allows you to retrieve the unique values via Attribute.numValues() (how many are there) and Attribute.value(int) (the -th value).
package weka;
import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils;
public class LOOByValue {
/**
* 1st arg: ARFF file to load
* 2nd arg: 0-based index in ARFF to use for class
* 3rd arg: 0-based index in ARFF to use for LOO
*
* #param args the command-line arguments
* #throws Exception if loading/processing of data fails
*/
public static void main(String[] args) throws Exception {
// load data
Instances full = ConverterUtils.DataSource.read(args[0]);
full.setClassIndex(Integer.parseInt(args[1]));
int looCol = Integer.parseInt(args[2]);
Attribute looAtt = full.attribute(looCol);
if (looAtt.isNumeric())
throw new IllegalStateException("Attribute cannot be numeric!");
// iterate unique values to create train/test splits
for (int i = 0; i < looAtt.numValues(); i++) {
String value = looAtt.value(i);
System.out.println("\n" + (i+1) + "/" + full.attribute(looCol).numValues() + ": " + value);
Instances train = new Instances(full, full.numInstances());
Instances test = new Instances(full, full.numInstances());
for (int n = 0; n < full.numInstances(); n++) {
Instance inst = full.instance(n);
if (inst.stringValue(looCol).equals(value))
test.add((Instance) inst.copy());
else
train.add((Instance) inst.copy());
}
train.compactify();
test.compactify();
// TODO do something with the data
System.out.println("train size: " + train.numInstances());
System.out.println("test size: " + test.numInstances());
}
}
}
With Weka's anneal UCI dataset and the surface-quality for leave-one-out, you can generate something like this:
1/5: ?
train size: 654
test size: 244
2/5: D
train size: 843
test size: 55
3/5: E
train size: 588
test size: 310
4/5: F
train size: 838
test size: 60
5/5: G
train size: 669
test size: 229
There seems to be no problem when training my network because it converges and falls below 0.01 error. However when I load my trained network, and introduce the evaluation set, it outputs the same results for all the evaluation set rows (the actual prediction, not the training phase). I trained my network with resilient propagation with 9 inputs, 1 hidden layer with 7 hidden neurons and 1 output neuron. UPDATE: My data is normalized using min-max. i am trying to predict an electric load data.
Here is the sample data, first 9 rows are the inputs while the 10th is the ideal value:
0.5386671932975533, 1100000.0, 0.0, 1.0, 40.0, 1.0, 30.0, 9.0, 2014.0 , 0.5260616667545941
0.5260616667545941, 1100000.0, 0.0, 1.0, 40.0, 2.0, 30.0, 9.0, 2014.0, 0.5196499668339777
0.5196499668339777, 1100000.0, 0.0, 1.0, 40.0, 3.0, 30.0, 9.0, 2014.0, 0.5083828048375548
0.5083828048375548, 1100000.0, 0.0, 1.0, 40.0, 4.0, 30.0, 9.0, 2014.0, 0.49985462144799725
0.49985462144799725, 1100000.0, 0.0, 1.0, 40.0, 5.0, 30.0, 9.0, 2014.0, 0.49085956670499675
0.49085956670499675, 1100000.0, 0.0, 1.0, 40.0, 6.0, 30.0, 9.0, 2014.0, 0.485008112408512
Here's the full code:
public class ANN
{
//training
//public final static String SQL = "SELECT load_input, day_of_week, weekend_day, type_of_day, week_num, time, day_date, month, year, ideal_value FROM sample WHERE (year,month,day_date,time) between (2012,4,1,1) and (2014,9,29, 96) ORDER BY ID";
//testing
public final static String SQL = "SELECT load_input, day_of_week, weekend_day, type_of_day, week_num, time, day_date, month, year, ideal_value FROM sample WHERE (year,month,day_date,time) between (2014,9,30,1) and (2014,9,30, 92) ORDER BY ID";
//validation
//public final static String SQL = "SELECT load_input, day_of_week, weekend_day, type_of_day, week_num, time, day_date, month, year, ideal_value FROM sample WHERE (year,month,day_date,time) between (2014,9,30,93) and (2014,9,30, 96) ORDER BY ID";
public final static int INPUT_SIZE = 9;
public final static int IDEAL_SIZE = 1;
public final static String SQL_DRIVER = "org.postgresql.Driver";
public final static String SQL_URL = "jdbc:postgresql://localhost/ANN";
public final static String SQL_UID = "postgres";
public final static String SQL_PWD = "";
public static void main(String args[])
{
Mynetwork();
//train network. will add customizable params later.
//train(trainingData());
//evaluate network
evaluate(trainingData());
Encog.getInstance().shutdown();
}
public static void evaluate(MLDataSet testSet)
{
BasicNetwork network = (BasicNetwork)EncogDirectoryPersistence.loadObject(new File("directory"));
// test the neural network
System.out.println("Neural Network Results:");
for(MLDataPair pair: testSet ) {
final MLData output = network.compute(pair.getInput());
System.out.println(pair.getInput().getData(0) + "," + pair.getInput().getData(1) + "," + pair.getInput().getData(2) + "," + pair.getInput().getData(3) + "," + pair.getInput().getData(4) + "," + pair.getInput().getData(5) + "," + pair.getInput().getData(6) + "," + pair.getInput().getData(7) + "," + pair.getInput().getData(8) + "," + "Predicted=" + output.getData(0) + ", Actual=" + pair.getIdeal().getData(0));
}
}
public static BasicNetwork Mynetwork()
{
//basic neural network template. Inputs should'nt have activation functions
//because it affects data coming from the previous layer and there is no previous layer before the input.
BasicNetwork network = new BasicNetwork();
//input layer with 2 neurons.
//The 'true' parameter means that it should have a bias neuron. Bias neuron affects the next layer.
network.addLayer(new BasicLayer(null , true, 9));
//hidden layer with 3 neurons
network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 5));
//output layer with 1 neuron
network.addLayer(new BasicLayer(new ActivationSigmoid(), false, 1));
network.getStructure().finalizeStructure() ;
network.reset();
return network;
}
public static void train(MLDataSet trainingSet)
{
//Backpropagation(network, dataset, learning rate, momentum)
//final Backpropagation train = new Backpropagation(Mynetwork(), trainingSet, 0.1, 0.9);
final ResilientPropagation train = new ResilientPropagation(Mynetwork(), trainingSet);
//final QuickPropagation train = new QuickPropagation(Mynetwork(), trainingSet, 0.9);
int epoch = 1;
do {
train.iteration();
System.out.println("Epoch #" + epoch + " Error:" + train.getError());
epoch++;
} while((train.getError() > 0.01));
System.out.println("Saving network");
System.out.println("Saving Done");
EncogDirectoryPersistence.saveObject(new File("directory"), Mynetwork());
}
public static MLDataSet trainingData()
{
MLDataSet trainingSet = new SQLNeuralDataSet(
ANN.SQL,
ANN.INPUT_SIZE,
ANN.IDEAL_SIZE,
ANN.SQL_DRIVER,
ANN.SQL_URL,
ANN.SQL_UID,
ANN.SQL_PWD);
return trainingSet;
}
}
Here is my result:
Predicted=0.4451817588640455, Actual=0.5260616667545941
Predicted=0.4451817588640455, Actual=0.5196499668339777
Predicted=0.4451817588640455, Actual=0.5083828048375548
Predicted=0.4451817588640455, Actual=0.49985462144799725
Predicted=0.4451817588640455, Actual=0.49085956670499675
Predicted=0.4451817588640455, Actual=0.485008112408512
Predicted=0.4451817588640455, Actual=0.47800504210686795
Predicted=0.4451817588640455, Actual=0.4693212349328293
(...and so on with the same "predicted")
Results im expecting (I changed the "predicted" with something random for demonstration purposes, indicating that the network is actually predicting):
Predicted=0.4451817588640455, Actual=0.5260616667545941
Predicted=0.5123312331212122, Actual=0.5196499668339777
Predicted=0.435234234234254365, Actual=0.5083828048375548
Predicted=0.673424556563455, Actual=0.49985462144799725
Predicted=0.2344673345345544235, Actual=0.49085956670499675
Predicted=0.123346457544324, Actual=0.485008112408512
Predicted=0.5673452342342342, Actual=0.47800504210686795
Predicted=0.678435234423423423, Actual=0.4693212349328293
The first reason to consider when you get weird results with neural networks is normalization. Your data must be normalized, otherwise, yes, the training will result in skewed NN which will produce the same outcome all the time, it is a common symptom.
Always normalize your data before feeding it into a neural network. This is important because if you consider the sigmoid activation function it is basically flat for larger values (positive and negative), resulting in a constant behavior of your neural net. Try normalizing as such input = (input-median(input)) / std(input)
I'm studying Java and I just wrote a program that turns CME quotes of soybeans, wheat or corn to "price per 60kg bag" in BRL(Brazilian Currency). I used JSmooth to wrap it and it runs perfectly on my machine. I tried sending it to my wife's PC in a Zip file and I also tried to execute it from the command line directly calling the Main class. In both cases, the program goes until the part where it asks the USD/BRL quote and after one enters it and presses "Enter", the program shows an aparent Runtime error:
Options: soybean / wheat / corn
Type name of commodity(lowercase):
soybean
Type current exchange rate(USD/BRL):
3.11
Exception in thread "main" java.util.InputMismatchException
at java.util.Scanner.throwFor(Unknown Source)
at java.util.Scanner.next(Unknown Source)
at java.util.Scanner.nextDouble(Unknown Source)
at PriceCME.getExchange_Rate(PriceCME.java:26)
at PriceCMEExecute.main(PriceCMEExecute.java:8)
I thought that maybe the version of Java on her pc was outdated, so I removed all previous versions, installed the JDK 111 and tried to run it again.
The same issue happened.
I then tried to recompile the .java files on her PC and there was no compile time error. When I tried to execute it, again the same issue.
The code is the following:
import java.util.Scanner;
public class PriceCME {
private String nameOfCommodity;
private final double BUSHEL_SOY_WHEAT = 27.2155;
private final double BUSHEL_CORN = 25.4012;
private final int KGPERSACA = 60;
private double quote;
private double exchRate;
private double pricePerSaca;
public void getNameOfComodity()
{
System.out.println("Options: soybean / wheat / corn");
System.out.print("Type name of commodity(lowercase): ");
System.out.println();
Scanner commodity = new Scanner(System.in);
nameOfCommodity = commodity.next();
}
public void getExchange_Rate()
{
System.out.print("Type current exchange rate(USD/BRL): ");
System.out.println();
Scanner exchangeRate = new Scanner(System.in);
exchRate = exchangeRate.nextDouble();
}
public void getQuote()
{
System.out.println("Source for quotes:
http://www.cmegroup.com/trading/agricultural/");
System.out.print("Type quote of commodity on CME: ");
System.out.println();
Scanner getQuote = new Scanner(System.in);
quote = getQuote.nextDouble();
}
public void CalculatePricePerSaca()
{
switch(nameOfCommodity)
{
case "soybean":
pricePerSaca = (((quote * KGPERSACA) / 100) /
BUSHEL_SOY_WHEAT) * exchRate;
break;
case "wheat":
pricePerSaca = (((quote * KGPERSACA) / 100) /
BUSHEL_SOY_WHEAT) * exchRate;
break;
case "corn":
pricePerSaca = (((quote * KGPERSACA) / 100) /
BUSHEL_CORN) * exchRate;
break;
}
}
public void getPricePerSaca()
{
System.out.printf("The price of %s \"Por Saca\" is:\nR$ %.2f ",
nameOfCommodity, pricePerSaca);
System.out.println();
System.out.println("Type \"false\" to exit.");
Scanner end = new Scanner(System.in);
boolean theEnd = end.hasNext();
}
}
public class PriceCMEExecute {
public static void main(String[] args)
{
PriceCME myPriceCME = new PriceCME();
myPriceCME.getNameOfComodity();
myPriceCME.getExchange_Rate();
myPriceCME.getQuote();
myPriceCME.CalculatePricePerSaca();
myPriceCME.getPricePerSaca();
}
}
How can this be happening? Would it be a problem on the JVM on her machine?