"Convert" command line to java code for Weka - java

I am using this command line
java -cp weka.jar weka.classifiers.trees.RandomForest -T tdata.arff -l rndforrest.model -p 0 > data.out
But I want to do it in java without using files, everything should be on the fly. The model can be loaded once at the beginning and the tdata.arff should be one data row for which I need the prediction (classification?).
Like this:
weka.classifiers.Classifier rndForrest = (weka.classifiers.Classifier)weka.core.SerializationHelper.read("rndforrest.model");
var dataInst = new weka.core.Instance(1, new double[] { 0, 9, -96, 62, 1, 200, 35, 1 });
double pred = rndForrest.classifyInstance(dataInst);
I get an error
Instance doesn't have access to a dataset!
Thank you for help.
edit: my code
Stopwatch sw = new Stopwatch();
var values = new double[] { 0, 9, -96, 62, 1, 200, 35, 0 };
weka.classifiers.Classifier rndForrest = (weka.classifiers.Classifier)weka.core.SerializationHelper.read("rndforrest.model");
var dataInst = new weka.core.Instance(1, values);
FastVector atts = new FastVector();
for(int i=0; i < values.Length; i++) {
atts.addElement(new weka.core.Attribute("att" + i));
weka.core.Instances data = new Instances("MyRelation", atts, 0);
data.setClassIndex(data.numAttributes() - 1);
double pred = rndForrest.classifyInstance(data.firstInstance());
Console.WriteLine("prediction is " + pred);

Well, the error says it, doesn't it?
Instances doesn't have access to a dataset!
The Javadoc for the constructor you use says:
public Instance(double weight, double[] attValues)
Constructor that inititalizes instance variable with given values. Reference to the dataset is set to null. (ie. the instance doesn't have access to information about the attribute types)
Every Instance has to belong to a data set (Instances), because in Weka each value of an instance is stored as a double value. Additional information is needed to determine how to interpret that double value (e.g. as double, string, nominal, ...) and this information is provided through the data set.
You need to do something like:
FastVector atts = new FastVector();
// assuming all your eight attributes are numeric
for( int i = 1; i <= 8; i++ ) {
atts.addElement(new Attribute("att" + i)); // - numeric
Instances data = new Instances("MyRelation", atts, 0);
(Also see Creating an ARFF file for additional examples on how to create attributes of a certain type)


Deeplearning4j - how to iterate multiple DataSets for large data?

I'm studying Deeplearning4j (ver. 1.0.0-M1.1) for building neural networks.
I use IrisClassifier from Deeplearning4j as an example, it works fine:
//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(DownloaderUtility.IRISDATA.Download(),"iris.txt")));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 4; //5 values in each row of the iris.txt CSV: 4 input features followed by an integer label (class) index. Labels are the 5th value (index 4) in each row
int numClasses = 3; //3 classes (types of iris flowers) in the iris data set. Classes have integer values 0, 1 or 2
int batchSize = 150; //Iris data set: 150 examples total. We are loading all of them into one DataSet (not recommended for large data sets)
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSet allData = iterator.next();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65); //Use 65% of data for training
DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(trainingData); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(trainingData); //Apply normalization to the training data
normalizer.transform(testData); //Apply normalization to the test data. This is using statistics calculated from the *training* set
final int numInputs = 4;
int outputNum = 3;
long seed = 6;
log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.updater(new Sgd(0.1))
.layer(new DenseLayer.Builder().nIn(numInputs).nOut(3)
.layer(new DenseLayer.Builder().nIn(3).nOut(3)
.layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));
for(int i=0; i<1000; i++ ) {
//evaluate the model on the test set
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());
eval.eval(testData.getLabels(), output);
For my project, I have inputs ~30000 records (in iris example - 150).
Each record is a vector size ~7000 (in iris example - 4).
Obviously, I can't process the whole data in one DataSet - in will produce OOM for JVM.
How I can process data in multiple DataSets?
I assume it should be something like this (store DataSets in List and iterate):
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65); //Use 65% of data for training
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
for (DataSet dataSetTraining : trainingData) {
normalizer.fit(dataSetTraining); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(dataSetTraining); //Apply normalization to the training data
for (DataSet dataSetTest : testData) {
normalizer.transform(dataSetTest); //Apply normalization to the test data. This is using statistics calculated from the *training* set
for(int i=0; i<1000; i++ ) {
for (DataSet dataSetTraining : trainingData) {
But when I start evaluation, I got this error:
Exception in thread "main" java.lang.NullPointerException: Cannot read field "javaShapeInformation" because "this.jvmShapeInfo" is null
at org.nd4j.linalg.api.ndarray.BaseNDArray.dataType(BaseNDArray.java:5507)
at org.nd4j.linalg.api.ndarray.BaseNDArray.validateNumericalArray(BaseNDArray.java:5575)
at org.nd4j.linalg.api.ndarray.BaseNDArray.add(BaseNDArray.java:3087)
at com.aarcapital.aarmlclassifier.classification.FAClassifierLearning.main(FAClassifierLearning.java:117)
Evaluation eval = new Evaluation(26);
INDArray output = new NDArray();
for (DataSet dataSetTest : testData) {
output.add(model.output(dataSetTest.getFeatures())); // ERROR HERE
System.out.println("--- Output ---");
INDArray labels = new NDArray();
for (DataSet dataSetTest : testData) {
System.out.println("--- Labels ---");
eval.eval(labels, output);
What is correct way to iterate miltiple DataSet for learning network?
Firstly, always use Nd4j.create(..) for ndarrays.
Never use the implementation. That allows you to safely create ndarrays that will work whether you use cpus or gpus.
2nd: Always use the RecordReaderDataSetIterator's builder rather than the constructor. It's very long and error prone.
That is why we made the builder in the first place.
Your NullPointer actually isn't coming from where you think it is. it's due to how you're creating the ndarray. There's no data type or anything so it can't know what to expect. Nd4j.create(..) will properly setup the ndarray for you.
Beyond that you are doing things the right way. The record reader handles the batching for you.

protobuf-net serialize/deserialize DateTime & Guid types

I have some issue with getting values from TimeDate and Guid types from protobuf-net:
I have .Net Client and .Net Server and they are communicating via protobuf-net. And now I have to implement a communication from a java client to this .Net Server and I can not change already existed .Net Server communication logic so I have to use already existed protobuf communication and the issue is following:
protobuf-net understands two .net types: DateTime and Guid but I'm not able to parse it via google protobuf:
.Net Server class example:
public class SomeClass
public DateTime? CurrentDateTime { get; set; }
public Guid CurrentGiud { get; set; }
I'm not able to parse it via google protobuf because it knows nothing about DateTime and Guid types so I can only get byte[] from these fields, .proto example:
message SomeClass
bytes CurrentDateTime = 1;
bytes CurrentGiud = 2;
So after serialization/deserialization a stream I can get byte[] from those fields and now I need somehow convert it to appropriate values, so I need something like this:
var customDateTime = ConvertByteArrayToCustomDateTime(byteArray);
byte[] byteArray = ConvertCustomDateTimeToByteArray(customDateTime);
var customGuid = ConvertByteArrayToCustomGuid(byteArray);
byte[] byteArray = ConvertCustomGuidToByteArray(customGuid);
or this:
string strDateTime = ConvertByteArrayToStringDateTime(byteArray); //e.g. "13.08.2019 17:42:31"
byte[] byteArray = ConvertStringDateTimeToByteArray(strDateTime);
string strGuid = ConvertByteArrayToStringGuid(byteArray); // e.g. "{7bb7cdac-ebad-4acf-90ff-a5525be3caac}"
byte[] byteArray = ConvertStringGuidToByteArray(strGuid);
DateTime real example:
Example N1:
DateTime = 13.08.2019 17:42:31
after serialization/deserialization
byte[] = { 8, 142, 218, 151, 213, 11, 16, 3 }
Example N2:
DateTime = 25.06.2019 20:15:10
after serialization/deserialization
byte[] = { 8, 156, 131, 148, 209, 11, 16, 3 }
Guid real example:
Example N1:
Guid = {7bb7cdac-ebad-4acf-90ff-a5525be3caac}
after serialization/deserialization
byte[] = { 9, 172, 205, 183, 123, 173, 235, 207, 74, 17, 144, 255, 165, 82, 91, 227, 202, 172 }
Example N2:
Guid = {900246bb-3a7b-44d4-9b2f-1da035ca51f4}
after serialization/deserialization
byte[] = { 9, 187, 70, 2, 144, 123, 58, 212, 68, 17, 155, 47, 29, 160, 53, 202, 81, 244 }
Add following messages into your .proto:
message CustomDateTime
sint64 value = 1; // the offset (in units of the selected scale) from 1970/01/01
CustomTimeSpanScale scale = 2; // the scale of the timespan [default = DAYS]
CustomDateTimeKind kind = 3; // the kind of date/time being represented [default = UNSPECIFIED]
enum CustomTimeSpanScale
DAYS = 0;
HOURS = 1;
TICKS = 5;
MINMAX = 15; // dubious
enum CustomDateTimeKind
// The time represented is not specified as either local time or Coordinated Universal Time (UTC).
// The time represented is UTC.
UTC = 1;
// The time represented is local time.
LOCAL = 2;
message CustomGuid
fixed64 lo = 1; // the first 8 bytes of the guid (note:crazy-endian)
fixed64 hi = 2; // the second 8 bytes of the guid (note:crazy-endian)
Now your .proto class should look like this:
message SomeClass
CustomDateTime CurrentDateTime = 1;
CustomGuid CurrentGiud = 2;
Simple DateTime parser:
public static string ConvertCustomDateTimeToString(CustomDateTime customDateTime)
var dateTime = DateTime.Parse("01.01.1970 00:00:00");
if (customDateTime.Scale == CustomDateTime.Types.CustomTimeSpanScale.Seconds)
dateTime = dateTime.AddSeconds(customDateTime.Value);
throw new Exception("CustomDateTime supports only seconds");
return dateTime.ToString();
public static CustomDateTime ConvertStringToCustomDateTime(string strDateTime)
var defaultTime = DateTime.Parse("01.01.1970 00:00:00");
var dateTime = DateTime.Parse(strDateTime);
var customDateTime = new CustomDateTime
Kind = CustomDateTime.Types.CustomDateTimeKind.Unspecified,
Scale = CustomDateTime.Types.CustomTimeSpanScale.Seconds,
Value = (long) (dateTime - defaultTime).TotalSeconds
return customDateTime;
Simple Guid parser:
public static string ConvertCustomGuidToString(CustomGuid customGuid)
var str = string.Empty;
var array = BitConverter.GetBytes(customGuid.Lo);
var newArray = new byte[8];
newArray[0] = array[3];
newArray[1] = array[2];
newArray[2] = array[1];
newArray[3] = array[0];
newArray[4] = array[5];
newArray[5] = array[4];
newArray[6] = array[7];
newArray[7] = array[6];
str += BitConverter.ToString(newArray).Replace("-", "");
str += BitConverter.ToString(BitConverter.GetBytes(customGuid.Hi)).Replace("-", "");
return str;
public static CustomGuid ConvertStringToCustomGuid(string strGuid)
strGuid = strGuid.Replace(" ", "");
strGuid = strGuid.Replace("-", "");
strGuid = strGuid.Replace("{", "");
strGuid = strGuid.Replace("}", "");
if (strGuid.Length != 32)
throw new Exception("Wrong Guid format");
byte[] array = new byte[16];
for (int i = 0; i < 32; i += 2)
array[i / 2] = Convert.ToByte(strGuid.Substring(i, 2), 16);
var newArrayLo = new byte[8];
newArrayLo[0] = array[3];
newArrayLo[1] = array[2];
newArrayLo[2] = array[1];
newArrayLo[3] = array[0];
newArrayLo[4] = array[5];
newArrayLo[5] = array[4];
newArrayLo[6] = array[7];
newArrayLo[7] = array[6];
var newArrayHi = new byte[8];
newArrayHi[0] = array[8];
newArrayHi[1] = array[9];
newArrayHi[2] = array[10];
newArrayHi[3] = array[11];
newArrayHi[4] = array[12];
newArrayHi[5] = array[13];
newArrayHi[6] = array[14];
newArrayHi[7] = array[15];
var customGuid = new CustomGuid
Lo = BitConverter.ToUInt64(newArrayLo, 0),
Hi = BitConverter.ToUInt64(newArrayHi, 0)
return customGuid;
We need to talk about the two types separately. There's back-story to each!
DateTime / TimeSpan - so: way back in history, .NET folks kept wanting protobuf-net to round-trip DateTime / TimeSpan. There was nothing defined by Google for this kind of purpose, so protobuf-net made something up. The details are in bcl.proto, but I don't recommend worrying about that. The short version would be: "they're kinda awkward to work with if you're not protobuf-net".
Roll forward 5+ years, and Google finally defined the well-known Duration and Timestamp types. Unfortunately, they're not 1:1 matches for how protobuf-net decided to implement them, and I can't change the default layout without breaking existing consumers. But! For new code, or for cross-platform purposes, protobuf-net knows how to talk Duration / Timestamp, and if it is even remotely possible, I strongly recommend changing your layout. The good news is: this is really simple:
[ProtoMember(1, DataFormat = DataFormat.WellKnown)]
public DateTime? CurrentDateTime { get; set; }
This will now use .google.protobuf.Timestamp instead of .bcl.DateTime; this also works with TimeSpan / .google.protobuf.Duration.
The key point here: there is a simple option that you can switch to that will make this "just work"; the default is for compatibility with something that protobuf-net had to invent before Google had decided on a layout.
Note that changing to DataFormat.WellKnown is a data breaking change; the layout is different. If there was a way to automatically detect and compensate, it already would; there isn't.
Guid - this should have been much simpler; the sensible idea here would have been to just to serialize it as bytes in .proto terms, but ... and I regret this, I did a stupid and tried to do something clever. It backfired and I regret it. What it does is ... kinda silly, although it does make sense internally. It accesses the Guid as two consecutive fixed64 fields (again, look in bcl.proto) where these are the low/high bytes in Microsoft's craz-endian layout. By crazy-endian, I mean where the guid 00112233-4455-6677-8899-AABBCCDDEEFF is represented by the bytes 33-22-11-00-55-44-77-66-88-99-AA-BB-CC-DD-EE-FF (emphasis: this bit isn't me; this is what Microsoft and .NET do internally with guids). So; to take your N1 example, the two half fragments you're seeing there are:
0x09 = "field 1, type fixed64"
0x AC-CD-B7-7B-AD-EB-CF-4A - first half in crazy-endian
0x11 (decimal 17) = "field 2, type fixed64"
0x 90-FF-A5-52-5B-E3-CA-AC - second half in crazy-endian
Frankly, for cross-platform work, I would suggest for Guid, expose it as string or byte[] instead, and accept my sincere apologies for the inconvenience!
In both cases: if you can't change the layout, look in bcl.proto for what is actually happening. If you use Serializer.GetProto<T>(), it should generate a schema that imports bcl.proto automatically.

RandomForest with Weka in Java

I am working on a project and I need some examples how to implement RandomForest in Java with weka? I did it with IBk(), it worked. If I do it with RandomForest in the same way, it does not work.
Does anyone have a simple example for me how to implement RandomForest and how to get probability for each class (i did it with IBk withclassifier.distributionForInstance(instance) Function and it returned me probabilities for each class). How can I do it for RandomForest? I will need to get probability of every tree and to combine it?
ConverrterUtils.DataSource source = new ConverterUtils.DataSource ("..../edit.arff);
Instances dataset = source.getDataSet();
dataset.setClassIndex(dataset.numAttributes() - 1);
IBk classifier = new IBk(5); classifier.buildClassifier(dataset);
Instance instance = new SparseInstance(2);
instance.setValue(0, 65) //example data
instance.setValue(1, 120); //example data
double[] prediction = classifier.distributionForInstance(instance);
//now I get the probability for the first class
System.out.println("Prediction for the first class is: "+prediction[0]);
You can calculate the the infogain while buidling the Model in the RandomForest. It is much slower and requires alot of memory while buidling model. I am not so sure about the documentation. you can add options or setValues while buiilding the model.
//numFolds in number of crossvalidations usually between 1-10
//br is your bufferReader
Instances trainData = new Instances(br);
trainData.setClassIndex(trainData.numAttributes() - 1);
RandomForest rf = new RandomForest();
//You can set the options here
String[] options = new String[2];
options[0] = "-R";
weka.filters.supervised.attribute.AttributeSelection as = new weka.filters.supervised.attribute.AttributeSelection();
Ranker ranker = new Ranker();
InfoGainAttributeEval infoGainAttrEval = new InfoGainAttributeEval();
trainData = Filter.useFilter(trainData, as);
Evaluation evaluation = new Evaluation(trainData);
evaluation.crossValidateModel(rf, trainData, numFolds, new Random(1));
// Using HashMap to store the infogain values of the attributes
int count = 0;
Map<String, Double> infogainscores = new HashMap<String, Double>();
for (int i = 0; i < trainData.numAttributes(); i++) {
String t_attr = trainData.attribute(i).name();
double infogain = infoGainAttrEval.evaluateAttribute(i);
if(infogain != 0){
//System.out.println(t_attr + "= "+ infogain);
infogainscores.put(t_attr, infogain);
count = count+1;
//iterating over the hashmap
Iterator it = infogainscores.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println(pair.getKey()+" = "+pair.getValue());
System.out.println(pair.getKey()+" = "+pair.getValue());
it.remove(); // avoids a ConcurrentModificationException

Likelihood Ratio Java

I'm searching for a library or an example on how to implement in java a likelihood ratio test like in matlab.
I have two different vector of double values and want to receive a scalar value.
Every value correspond to a feature for my machine learning algorithm so one the first vector is the training pattern and the second one a test.
Could you please help me?
On matlab i just use division on two matrix like LR= test_matrix/training_matrix
I've tryied with apache mahout but i'm not sure i'm using it correctly.
Here the code:
FastByIDMap<FastByIDMap<Long>> timestamps = new FastByIDMap<>();
Collection<Preference> prefs = new ArrayList<>(2);
FastByIDMap<Collection<Preference>> data = new FastByIDMap<>(); //Preferecens for user0
Preference newPrefs = new GenericPreference(0, 0, (float) -0.5);
Preference pref = new GenericPreference(0, 1, 50);
Preference pref2 = new GenericPreference(0, 2, 51);
data.put(0, prefs);
Collection<Preference> prefs_1 = new ArrayList<>(2);
newPrefs = new GenericPreference(1, 0, (float) -0.5);
pref = new GenericPreference(1, 1, 50);
pref2 = new GenericPreference(1, 2, 51);
data.put(1, prefs_1);
GenericDataModel model = new GenericDataModel(GenericDataModel.toDataMap(data, true), timestamps);
FastByIDMap<PreferenceArray> us = model.getRawUserData();
System.out.println("us:"+ us.toString());
LogLikelihoodSimilarity l = new LogLikelihoodSimilarity(model);
System.out.println(l.userSimilarity(0, 1));
In this case, user similarity alwasy return 0.

Predicting data created on-the-fly in WEKA using a premade model file

I want to create a WEKA Java program that reads a group of newly created data that will be fed to a premade model from the GUI version.
Here is the program:
import java.util.ArrayList;
import weka.classifiers.Classifier;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;
import weka.core.Utils;
public class UseModelWithData {
public static void main(String[] args) throws Exception {
// load model
String rootPath = "G:/";
Classifier classifier = (Classifier) weka.core.SerializationHelper.read(rootPath+"j48.model");
// create instances
Attribute attr1 = new Attribute("age");
Attribute attr2 = new Attribute("menopause");
Attribute attr3 = new Attribute("tumor-size");
Attribute attr4 = new Attribute("inv-nodes");
Attribute attr5 = new Attribute("node-caps");
Attribute attr6 = new Attribute("deg-malig");
Attribute attr7 = new Attribute("breast");
Attribute attr8 = new Attribute("breast-quad");
Attribute attr9 = new Attribute("irradiat");
Attribute attr10 = new Attribute("Class");
ArrayList<Attribute> attributes = new ArrayList<Attribute>();
// predict instance class values
Instances testing = new Instances("Test dataset", attributes, 0);
// add data
double[] values = new double[testing.numAttributes()];
values[0] = testing.attribute(0).addStringValue("60-69");
values[1] = testing.attribute(1).addStringValue("ge40");
values[2] = testing.attribute(2).addStringValue("10-14");
values[3] = testing.attribute(3).addStringValue("15-17");
values[4] = testing.attribute(4).addStringValue("yes");
values[5] = testing.attribute(5).addStringValue("2");
values[6] = testing.attribute(6).addStringValue("right");
values[7] = testing.attribute(7).addStringValue("right_up");
values[8] = testing.attribute(0).addStringValue("yes");
values[9] = Utils.missingValue();
// add data to instance
testing.add(new DenseInstance(1.0, values));
// instance row to predict
int index = 10;
// perform prediction
double myValue = classifier.classifyInstance(testing.instance(10));
// get the name of class value
String prediction = testing.classAttribute().value((int) myValue);
System.out.println("The predicted value of the instance ["
+ Integer.toString(index) + "]: " + prediction);
My references include:
Using a premade WEKA model in Java
the WEKA Manual provided in the 3.7.10 version - 17.3 Creating datasets in memory
Creating a single instance for classification in WEKA
So far the part where I create a new Instance inside the script causes the following error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 10, Size: 1
in the line
double myValue = classifier.classifyInstance(testing.instance(10));
I just want to use a latest row of instance values to a premade WEKA model. How do I solve this?
Program file
Arff file
You have the error because you are trying to access the 11th instance and have only created one.
If you always want to access the last instance you might try the following:
double myValue = classifier.classifyInstance(testing.lastInstance());
Additionally, I don't believe that you are creating the instances you hope for. After looking at your provided ".arff" file, which I believe you are trying to mimic, I think you should proceed making instances as follows:
FastVector atts;
FastVector attAge;
Instances testing;
double[] vals;
// 1. set up attributes
atts = new FastVector();
attAge = new FastVector();
atts.addElement(new Attribute("age", attAge));
// 2. create Instances object
testing = new Instances("breast-cancer", atts, 0);
// 3. fill with data
vals = new double[testing.numAttributes()];
vals[0] = attAge.indexOf("10-19");
testing.add(new DenseInstance(1.0, vals));
// 4. output data
Of course I did not create the whole dataset, but the technique would be the same.

