Apply LOOCV in java splitting with a specific condition

Apply LOOCV in java splitting with a specific condition - java

I have a csv file containing 24231 rows. I would like to apply LOOCV based on the project name instead of the observations of the whole dataset.
So if my dataset contains information for 15 projects, I would like to have the training set based on 14 projects and the test set based on the other project.
I was relying on weka's API, is there anything that automates this process?

For non-numeric attributes, Weka allows you to retrieve the unique values via Attribute.numValues() (how many are there) and Attribute.value(int) (the -th value).
package weka;
import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils;
public class LOOByValue {
/**
* 1st arg: ARFF file to load
* 2nd arg: 0-based index in ARFF to use for class
* 3rd arg: 0-based index in ARFF to use for LOO
*
* #param args the command-line arguments
* #throws Exception if loading/processing of data fails
*/
public static void main(String[] args) throws Exception {
// load data
Instances full = ConverterUtils.DataSource.read(args[0]);
full.setClassIndex(Integer.parseInt(args[1]));
int looCol = Integer.parseInt(args[2]);
Attribute looAtt = full.attribute(looCol);
if (looAtt.isNumeric())
throw new IllegalStateException("Attribute cannot be numeric!");
// iterate unique values to create train/test splits
for (int i = 0; i < looAtt.numValues(); i++) {
String value = looAtt.value(i);
System.out.println("\n" + (i+1) + "/" + full.attribute(looCol).numValues() + ": " + value);
Instances train = new Instances(full, full.numInstances());
Instances test = new Instances(full, full.numInstances());
for (int n = 0; n < full.numInstances(); n++) {
Instance inst = full.instance(n);
if (inst.stringValue(looCol).equals(value))
test.add((Instance) inst.copy());
else
train.add((Instance) inst.copy());
}
train.compactify();
test.compactify();
// TODO do something with the data
System.out.println("train size: " + train.numInstances());
System.out.println("test size: " + test.numInstances());
}
}
}
With Weka's anneal UCI dataset and the surface-quality for leave-one-out, you can generate something like this:
1/5: ?
train size: 654
test size: 244
2/5: D
train size: 843
test size: 55
3/5: E
train size: 588
test size: 310
4/5: F
train size: 838
test size: 60
5/5: G
train size: 669
test size: 229

Related

R wrapper for a java method in a jar using rjava

I am trying to access a java program MELTING 5 in R using the rjava package.
I can do it using the system function as follows using the batch file.
path <- "path/to/melting.bat"
sequence = "GTCGTATCCAGTGCAGGGTCCGAGGTATTCGCACTGGATACGACTTCCAC"
hybridisation.type = "dnadna"
OligomerConc = 5e-8
Sodium = 0.05
command=paste("-S", sequence,
"-H", hybridisation.type,
"-P", OligomerConc,
"-E", paste("Na=", Sodium, sep = ""))
system(paste("melting.bat", command))
I am trying to do the same using a wrapper, following the steps in hellowjavaworld without any success.
.jaddClassPath('path/to/melting5.jar')
main <- .jnew("melting/Main")
out <- .jcall(obj = main, returnSig = "V", method = "main", .jarray(list(), "java/lang/String"),
argument = command)
The java code in melting/Main.java in the melting5.jar that I am trying to access is as follows.
package melting;
import java.text.NumberFormat;
import melting.configuration.OptionManagement;
import melting.configuration.RegisterMethods;
import melting.methodInterfaces.MeltingComputationMethod;
import melting.nearestNeighborModel.NearestNeighborMode;
/**
* The Melting main class which contains the public static void main(String[] args) method.
*/
public class Main {
// private static methods
/**
* Compute the entropy, enthalpy and the melting temperature and display the results.
* #param args : contains the options entered by the user.
* #param OptionManagement optionManager : the OptionManegement which allows to manage
* the different options entered by the user.
*/
private static ThermoResult runMelting(String [] args, OptionManagement optionManager){
try {
ThermoResult results =
getMeltingResults(args, optionManager);
displaysMeltingResults(results);
return results;
} catch (Exception e) {
OptionManagement.logError(e.getMessage());
return null;
}
}
/**
* Compute the entropy, enthalpy and melting temperature, and return
* these results.
* #param args options (entered by the user) that determine the
* sequence, hybridization type and other features of the
* environment.
* #param optionManager the {#link
* melting.configuration.OptionManagement
* <code>OptionManagement</code>} which
* allows the program to manage the different
* options entered by the user.
* #return The results of the Melting computation.
*/
public static ThermoResult getMeltingResults(String[] args,
OptionManagement optionManager)
{
NumberFormat format = NumberFormat.getInstance();
format.setMaximumFractionDigits(2);
// Set up the environment from the supplied arguments and get the
// results.
Environment environment = optionManager.createEnvironment(args);
RegisterMethods register = new RegisterMethods();
MeltingComputationMethod calculMethod =
register.getMeltingComputationMethod(environment.getOptions());
ThermoResult results = calculMethod.computesThermodynamics();
results.setCalculMethod(calculMethod);
environment.setResult(results);
// Apply corrections to the results.
results = calculMethod.getRegister().
computeOtherMeltingCorrections(environment);
environment.setResult(results);
return environment.getResult();
}
/**
* displays the results of Melting : the computed enthalpy and entropy (in cal/mol and J/mol), and the computed
* melting temperature (in degrees).
* #param results : the ThermoResult containing the computed enthalpy, entropy and
* melting temperature
* #param MeltingComputationMethod calculMethod : the melting computation method (Approximative or nearest neighbor computation)
*/
private static void displaysMeltingResults(ThermoResult results)
{
NumberFormat format = NumberFormat.getInstance();
format.setMaximumFractionDigits(2);
MeltingComputationMethod calculMethod =
results.getCalculMethod();
double enthalpy = results.getEnthalpy();
double entropy = results.getEntropy();
OptionManagement.logInfo("\n The MELTING results are : ");
if (calculMethod instanceof NearestNeighborMode){
OptionManagement.logInfo("Enthalpy : " + format.format(enthalpy) + " cal/mol ( " + format.format(results.getEnergyValueInJ(enthalpy)) + " J /mol)");
OptionManagement.logInfo("Entropy : " + format.format(entropy) + " cal/mol-K ( " + format.format(results.getEnergyValueInJ(entropy)) + " J /mol-K)");
}
OptionManagement.logInfo("Melting temperature : " + format.format(results.getTm()) + " degrees C.\n");
}
// public static main method
/**
* #param args : contains the options entered by the user.
*/
public static void main(String[] args) {
OptionManagement optionManager = new OptionManagement();
if (args.length == 0){
optionManager.initialiseLogger();
optionManager.readMeltingHelp();
}
else if (optionManager.isMeltingInformationOption(args)){
try {
optionManager.readOptions(args);
} catch (Exception e) {
OptionManagement.logError(e.getMessage());
}
}
else {
runMelting(args, optionManager);
}
}
}
How to pass arguments in command to public static void main in java jar?

Over at https://github.com/hrbrmstr/melting5jars I made a pkg wrapper for the MELTING 5 jar (melting5.jar) and also put the Data/ directory in it so you don't have to deal with jar-file management. It can be installed via devtools::install_github("hrbrmstr/melting5jars"),
BEFORE you load that library, you need to set the NN_PATH since the Data/ dir is not where the jar expects it to be by default and you may run into issues setting it afterwards (YMMV).
NOTE: I don't work with this Java library and am not in your field, so please double check the results with the command-line you're used to running!
So, the first things to do to try to get this to work are:
Sys.setenv("NN_PATH"=system.file("extdata", "Data", package="melting5jars"))
library(melting5jars) # devtools::install_github("hrbrmstr/melting5jars")
Now, one of the cooler parts of rJava is that you get to work in R (code) if you want to vs Java (code). We can recreate the core parts of that Main class right in R.
First, get a new melting.Main object and a new OptionManagement object just like the Java code does:
melting <- new(J("melting.Main"))
optionManager <- new(J("melting.configuration.OptionManagement"))
Next, we setup your options. I left Sodium the way it is just to ensure I didn't mess anything up.
Sodium <- 0.05
opts <- c(
"-S", "GTCGTATCCAGTGCAGGGTCCGAGGTATTCGCACTGGATACGACTTCCAC",
"-H", "dnadna",
"-P", 5e-8,
"-E", paste("Na=", Sodium, sep = "")
)
Now, we can call getMeltingResults() from that Main class directly:
results <- melting$getMeltingResults(opts, optionManager)
and then perform the same calls on those results:
calculMethod <- results$getCalculMethod()
enthalpy <- results$getEnthalpy()
entropy <- results$getEntropy()
if (.jinstanceof(calculMethod, J("melting.nearestNeighborModel.NearestNeighborMode"))) {
enthalpy <- results$getEnergyValueInJ(enthalpy)
entropy <- results$getEnergyValueInJ(entropy)
}
melting_temperature <- results$getTm()
enthalpy
## [1] -1705440
entropy
## [1] -4566.232
melting_temperature
## [1] 72.04301
We can wrap all that up into a function that will make it easier to call in the future:
get_melting_results <- function(opts = c()) {
stopifnot(length(opts) > 2) # a sanity check that could be improved
Sys.setenv("NN_PATH"=system.file("extdata", "Data", package="melting5jars"))
require(melting5jars)
melting <- new(J("melting.Main"))
optionManager <- new(J("melting.configuration.OptionManagement"))
results <- melting$getMeltingResults(opts, optionManager)
calculMethod <- results$getCalculMethod()
enthalpy_cal <- results$getEnthalpy()
entropy_cal <- results$getEntropy()
enthalpy_J <- entropy_J <- NULL
if (.jinstanceof(calculMethod, J("melting.nearestNeighborModel.NearestNeighborMode"))) {
enthalpy_J <- results$getEnergyValueInJ(enthalpy_cal)
entropy_J <- results$getEnergyValueInJ(entropy_cal)
}
melting_temp_C <- results$getTm()
list(
enthalpy_cal = enthalpy_cal,
entropy_cal = entropy_cal,
enthalpy_J = enthalpy_J,
entropy_J = entropy_J,
melting_temp_C = melting_temp_C
) -> out
class(out) <- c("melting_res")
out
}
That also has separate values for enthalpy and entropy depending on the method result.
We can also make a print helper function since we classed the list() we're returning:
print.melting_res <- function(x, ...) {
cat(
"The MELTING results are:\n\n",
" - Enthalpy: ", prettyNum(x$enthalpy_cal), " cal/mol",
{if (!is.null(x$enthalpy_J)) paste0(" (", prettyNum(x$enthalpy_J), " J /mol)", collapse="") else ""}, "\n",
" - Entropy: ", prettyNum(x$entropy_cal), " cal/mol-K",
{if (!is.null(x$entropy_J)) paste0(" (", prettyNum(x$entropy_J), " J /mol-K)", collapse="") else ""}, "\n",
" - Meltng temperature: ", prettyNum(x$melting_temp_C), " degress C\n",
sep=""
)
}
(I made an assumption you're used to seeing the MELTING 5 command line output)
And, finally, re-run the computation:
Sodium <- 0.05
opts <- c(
"-S", "GTCGTATCCAGTGCAGGGTCCGAGGTATTCGCACTGGATACGACTTCCAC",
"-H", "dnadna",
"-P", 5e-8,
"-E", paste("Na=", Sodium, sep = "")
)
res <- get_melting_results(opts)
res
## The MELTING results are:
##
## - Enthalpy: -408000 cal/mol (-1705440 J /mol)
## - Entropy: -1092.4 cal/mol-K (-4566.232 J /mol-K)
## - Meltng temperature: 72.04301 degress C
str(res)
## List of 5
## $ enthalpy_cal : num -408000
## $ entropy_cal : num -1092
## $ enthalpy_J : num -1705440
## $ entropy_J : num -4566
## $ melting_temp_C: num 72
## - attr(*, "class")= chr "melting_res"
You should be able to use the above methodology to wrap other components (if any) in the MELTING library.

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file.
Here is my code:
public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());
boolean clearOutputBeforeSaving = false;
if (extraArgs != null && extraArgs.length > 0) {
if (extraArgs[0].equals("clearOutput")) {
clearOutputBeforeSaving = true;
} else {
logger.warn("Unknown param " + extraArgs[0]);
}
}
Date currRunDate = new Date(writeStartDate.getTime());
while (currRunDate.getTime() < writeEndDate.getTime()) {
try {
SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
inputDir,
currRunDate,
getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
// Normalize to 1 hours & 0.25 degrees
JavaRDD<FirstData> distinctData1 = data1.distinct();
// Floor all (distinct) values to 6 hour windows
JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
d1.getId(),
TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
d1.getLatitude(),
d1.getLongitude()));
// Convert Data1 to Dataframes
DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
data1DF.registerTempTable("data1");
// Read Data2 DataFrame
String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
String inputS3Path = basedirInput + "/dt=" + currDateString;
DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
data2DF.registerTempTable("data2");
// Join data1 and data2
DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
"FROM data1 as D1,data2 as D2 " +
"WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
"GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");
// Create histogram per ID
JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());
logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());
// Save to parquet
String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
if (clearOutputBeforeSaving) {
writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
} else {
write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
}
} finally {
TimeUtils.progressToNextDay(currRunDate);
}
}
}
public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
// Apply a schema to an RDD of JavaBeans and save it as Parquet.
DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
fullDataDF.write().parquet(outputS3Path);
}
public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
SQLContext sqlContext, S3Client s3Client) {
String fileKey = S3Utils.getS3Key(outputS3Path);
String bucket = S3Utils.getS3Bucket(outputS3Path);
logger.info("Deleting existing dir: " + outputS3Path);
s3Client.deleteAll(bucket, fileKey);
write(outputS3Path, outputRDD, outputClass, sqlContext);
}
public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
if (endOfDay < proposedEndTime.getTime()) {
return new Date(endOfDay);
}
return proposedEndTime;
}
The size of data1 is around 150,000 and data2 is around 500,000.
What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet.
The spark has 25GB of memory per server, and the code runs fine.
Each iteration takes about 2-3 minutes.
The problem starts when I run it on a large set of dates.
After a while, I get an OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
Last time it ran, it crashed after 233 iterations.
The line it crashed on was this:
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
Can anyone please tell me what can be the reason for the eventual crashes?

I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue.
I have ran my application for several days now, and had no crashes yet.

This error occurs when GC takes up over 98% of the total execution time of process. You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040.
Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application.
Another thing to try is to change the garbage collector that the java is using. Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html. It very clearly explains why GC overhead error occurs and which garbage collector is best for your application.

Weka output predictions

I've used the Weka GUI for training and testing a file (making predictions), but can't do the same with the API. The error I'm getting says there's a different number of attributes in the train and test files. In the GUI, this can be solved by checking "Output predictions".
How to do something similar using the API? do you know of any samples out there?
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.meta.FilteredClassifier;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.NominalToBinary;
import weka.filters.unsupervised.attribute.Remove;
public class WekaTutorial
{
public static void main(String[] args) throws Exception
{
DataSource trainSource = new DataSource("/tmp/classes - edited.arff"); // training
Instances trainData = trainSource.getDataSet();
DataSource testSource = new DataSource("/tmp/classes_testing.arff");
Instances testData = testSource.getDataSet();
if (trainData.classIndex() == -1)
{
trainData.setClassIndex(trainData.numAttributes() - 1);
}
if (testData.classIndex() == -1)
{
testData.setClassIndex(testData.numAttributes() - 1);
}
String[] options = weka.core.Utils.splitOptions("weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 "
+ "-tokenizer \"weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"");
Remove remove = new Remove();
remove.setOptions(options);
remove.setInputFormat(trainData);
NominalToBinary filter = new NominalToBinary();
NaiveBayes nb = new NaiveBayes();
FilteredClassifier fc = new FilteredClassifier();
fc.setFilter(filter);
fc.setClassifier(nb);
// train and make predictions
fc.buildClassifier(trainData);
for (int i = 0; i < testData.numInstances(); i++)
{
double pred = fc.classifyInstance(testData.instance(i));
System.out.print("ID: " + testData.instance(i).value(0));
System.out.print(", actual: " + testData.classAttribute().value((int) testData.instance(i).classValue()));
System.out.println(", predicted: " + testData.classAttribute().value((int) pred));
}
}
}
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Src and Dest differ in # of attributes: 2 != 17152
This was not an issue for the GUI.

You need to ensure that categories in train and test sets are compatible, try to
combine train and test sets
List item
preprocess them
save them as arff
open two empty files
copy the header from the top to line "#data"
copy in training set into first file and test set into second file

Java Exceptions on weka k fold programme

I would like to perform a 10 fold cross validation on my data and I used the weka java programme. However, I encountered exception problems.
Here is the exceptions:
---Registering Weka Editors---
Trying to add database driver (JDBC): jdbc.idbDriver - Error, not in CLASSPATH?
Exception in thread "main" java.lang.IllegalArgumentException: No suitable converter found for ''!
at weka.core.converters.ConverterUtils$DataSource.<init>(ConverterUtils.java:137)
at weka.core.converters.ConverterUtils$DataSource.read(ConverterUtils.java:441)
at crossvalidationmultipleruns.CrossValidationMultipleRuns.main(CrossValidationMultipleRuns.java:45)
C:\Users\TomXavier\AppData\Local\NetBeans\Cache\8.1\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 1 second)
Here is the programme I used:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.Utils;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import java.util.Random;
/**
* Performs a single run of cross-validation.
*
* Command-line parameters:
* <ul>
* <li>-t filename - the dataset to use</li>
* <li>-x int - the number of folds to use</li>
* <li>-s int - the seed for the random number generator</li>
* <li>-c int - the class index, "first" and "last" are accepted as well;
* "last" is used by default</li>
* <li>-W classifier - classname and options, enclosed by double quotes;
* the classifier to cross-validate</li>
* </ul>
*
* Example command-line:
* <pre>
* java CrossValidationSingleRun -t anneal.arff -c last -x 10 -s 1 -W "weka.classifiers.trees.J48 -C 0.25"
* </pre>
*
* #author FracPete (fracpete at waikato dot ac dot nz)
*/
public class CrossValidationSingleRun {
/**
* Performs the cross-validation. See Javadoc of class for information
* on command-line parameters.
*
* #param args the command-line parameters
* #throws Excecption if something goes wrong
*/
public static void main(String[] args) throws Exception {
// loads data and set class index
Instances data = DataSource.read(Utils.getOption("C:/Users/TomXavier/Documents/MATLAB/total_data.arff", args));
String clsIndex = Utils.getOption("first", args);
if (clsIndex.length() == 0)
clsIndex = "last";
if (clsIndex.equals("first"))
data.setClassIndex(0);
else if (clsIndex.equals("last"))
data.setClassIndex(data.numAttributes() - 1);
else
data.setClassIndex(Integer.parseInt(clsIndex) - 1);
// classifier
String[] tmpOptions;
String classname;
tmpOptions = Utils.splitOptions(Utils.getOption("weka.classifiers.trees.J48", args));
classname = tmpOptions[0];
tmpOptions[0] = "";
Classifier cls = (Classifier) Utils.forName(Classifier.class, classname, tmpOptions);
// other options
int seed = Integer.parseInt(Utils.getOption("1", args));
int folds = Integer.parseInt(Utils.getOption("10", args));
// randomize data
Random rand = new Random(seed);
Instances randData = new Instances(data);
randData.randomize(rand);
if (randData.classAttribute().isNominal())
randData.stratify(folds);
// perform cross-validation
Evaluation eval = new Evaluation(randData);
for (int n = 0; n < folds; n++) {
Instances train = randData.trainCV(folds, n);
Instances test = randData.testCV(folds, n);
// the above code is used by the StratifiedRemoveFolds filter, the
// code below by the Explorer/Experimenter:
// Instances train = randData.trainCV(folds, n, rand);
// build and evaluate classifier
Classifier clsCopy = Classifier.makeCopy(cls);
clsCopy.buildClassifier(train);
eval.evaluateModel(clsCopy, test);
}
// output evaluation
System.out.println();
System.out.println("=== Setup ===");
System.out.println("Classifier: " + cls.getClass().getName() + " " + Utils.joinOptions(cls.getOptions()));
System.out.println("Dataset: " + data.relationName());
System.out.println("Folds: " + folds);
System.out.println("Seed: " + seed);
System.out.println();
System.out.println(eval.toSummaryString("=== " + folds + "-fold Cross-validation ===", false));
}
}
Is there any solution for this problem?
Many thanks!

Java: SimpleDateFormat timestamp not updating

Evening,
I'm trying to create a timestamp for when an entity is added to my PriorityQueue using the following SimpleDate format: [yyyy/MM/dd - hh:mm:ss a] (Samples of results below)
Nano-second precision NOT 100% necessary
1: 2012/03/09 - 09:58:36 PM
Do you know how I can maintain an 'elapsed time' timestamp that shows when customers have been added to the PriorityQueue?
In the StackOverflow threads I've come across, most say to use System.nanoTime(); although I can't find resources online to implement this into a SimpleDateFormat. I have also consulted with colleagues.
Also, I apologize for not using syntax highlighting (if S.O supports it)
Code excerpt [unused methods omitted]:
<!-- language: java -->
package grocerystoresimulation;
/*****************************************************************************
* #import
*/
import java.util.PriorityQueue;
import java.util.Random;
import java.util.ArrayList;
import java.util.Date;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
/************************************************************************************
public class GroceryStoreSimulation {
/************************************************************************************
* #fields
*/
private PriorityQueue<Integer> pq = new PriorityQueue<Integer>();
private Random rand = new Random(); //instantiate new Random object
private Date date = new Date();
private DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd - hh:mm:ss a");
private ArrayList<String> timeStamp = new ArrayList<String>(); //store timestamps
private int customersServed; //# of customers served during simulation
/************************************************************************************
* #constuctor
*/
public GroceryStoreSimulation(){
System.out.println("Instantiated new GroceryStoreSimulation # ["
+ dateFormat.format(date) + "]\n" + insertDivider());
//Program body
while(true){
try{
Thread.sleep(generateWaitTime());
newCustomer(customersServed);
} catch(InterruptedException e){/*Catch 'em all*/}
}
}
/************************************************************************************
* #param String ID
*/
private void newCustomer(int ID){
System.out.println("Customer # " + customersServed + " added to queue. . .");
pq.offer(ID); //insert element into PriorityQueue
customersServed++;
assignArrivalTime(ID); //call assignArrivalTime() method
} //newCustomer()
/************************************************************************************
* #param String ID
*/
private void assignArrivalTime(int ID){
timeStamp.add(ID + ": " + dateFormat.format(date));
System.out.println(timeStamp.get(customersServed-1));
} //assignArrivalTime()
/************************************************************************************
* #return int
*/
private int generateWaitTime(){
//Local variables
int Low = 1000; //1000ms
int High = 4000; //4000ms
int waitTime = rand.nextInt(High-Low) + Low;
System.out.println("Delaying for: " + waitTime);
return waitTime;
}
//***********************************************************************************
private static String insertDivider(){
return ("******************************************************************");
}
//***********************************************************************************
} //GroceryStoreSimulation
Problem:
Timestamp does not update, only represents initial runtime (see below)
Delaying by 1-4 seconds w/Thread.sleep(xxx) (pseudo-randomly generated)
Problem may be in the assignArrivalTime() method
Output:
run:
Instantiated new GroceryStoreSimulation # [2012/03/09 - 09:58:36 PM]
******************************************************************
Delaying for: 1697
Customer # 0 added to queue. . .
0: 2012/03/09 - 09:58:36 PM
Delaying for: 3550
Customer # 1 added to queue. . .
1: 2012/03/09 - 09:58:36 PM
Delaying for: 2009
Customer # 2 added to queue. . .
2: 2012/03/09 - 09:58:36 PM
Delaying for: 1925
BUILD STOPPED (total time: 8 seconds)
Thank you for your assistance, I hope my question is clear enough & I`ve followed your formatting guidelines sufficiently.

You have to use a new instance of Date everytime to get most recent timestamp.
private void assignArrivalTime(int ID){
timeStamp.add(ID + ": " + dateFormat.format(date));
------------------------------------------------^^^^
Try replacing date by new Date() in above line.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apply LOOCV in java splitting with a specific condition - java

Related

R wrapper for a java method in a jar using rjava

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

Weka output predictions

Java Exceptions on weka k fold programme

Java: SimpleDateFormat timestamp not updating

Categories

Resources