RandomForest with Weka in Java

RandomForest with Weka in Java - java

I am working on a project and I need some examples how to implement RandomForest in Java with weka? I did it with IBk(), it worked. If I do it with RandomForest in the same way, it does not work.
Does anyone have a simple example for me how to implement RandomForest and how to get probability for each class (i did it with IBk withclassifier.distributionForInstance(instance) Function and it returned me probabilities for each class). How can I do it for RandomForest? I will need to get probability of every tree and to combine it?
//example
ConverrterUtils.DataSource source = new ConverterUtils.DataSource ("..../edit.arff);
Instances dataset = source.getDataSet();
dataset.setClassIndex(dataset.numAttributes() - 1);
IBk classifier = new IBk(5); classifier.buildClassifier(dataset);
Instance instance = new SparseInstance(2);
instance.setValue(0, 65) //example data
instance.setValue(1, 120); //example data
double[] prediction = classifier.distributionForInstance(instance);
//now I get the probability for the first class
System.out.println("Prediction for the first class is: "+prediction[0]);

You can calculate the the infogain while buidling the Model in the RandomForest. It is much slower and requires alot of memory while buidling model. I am not so sure about the documentation. you can add options or setValues while buiilding the model.
//numFolds in number of crossvalidations usually between 1-10
//br is your bufferReader
Instances trainData = new Instances(br);
trainData.setClassIndex(trainData.numAttributes() - 1);
RandomForest rf = new RandomForest();
rf.setNumTrees(50);
//You can set the options here
String[] options = new String[2];
options[0] = "-R";
rf.setOptions(options);
rf.buildClassifier(trainData);
weka.filters.supervised.attribute.AttributeSelection as = new weka.filters.supervised.attribute.AttributeSelection();
Ranker ranker = new Ranker();
InfoGainAttributeEval infoGainAttrEval = new InfoGainAttributeEval();
as.setEvaluator(infoGainAttrEval);
as.setSearch(ranker);
as.setInputFormat(trainData);
trainData = Filter.useFilter(trainData, as);
Evaluation evaluation = new Evaluation(trainData);
evaluation.crossValidateModel(rf, trainData, numFolds, new Random(1));
// Using HashMap to store the infogain values of the attributes
int count = 0;
Map<String, Double> infogainscores = new HashMap<String, Double>();
for (int i = 0; i < trainData.numAttributes(); i++) {
String t_attr = trainData.attribute(i).name();
//System.out.println(i+trainData.attribute(i).name());
double infogain = infoGainAttrEval.evaluateAttribute(i);
if(infogain != 0){
//System.out.println(t_attr + "= "+ infogain);
infogainscores.put(t_attr, infogain);
count = count+1;
}
}
//iterating over the hashmap
Iterator it = infogainscores.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println(pair.getKey()+" = "+pair.getValue());
System.out.println(pair.getKey()+" = "+pair.getValue());
it.remove(); // avoids a ConcurrentModificationException
}

Related

Deeplearning4j - how to iterate multiple DataSets for large data?

I'm studying Deeplearning4j (ver. 1.0.0-M1.1) for building neural networks.
I use IrisClassifier from Deeplearning4j as an example, it works fine:
//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(DownloaderUtility.IRISDATA.Download(),"iris.txt")));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 4; //5 values in each row of the iris.txt CSV: 4 input features followed by an integer label (class) index. Labels are the 5th value (index 4) in each row
int numClasses = 3; //3 classes (types of iris flowers) in the iris data set. Classes have integer values 0, 1 or 2
int batchSize = 150; //Iris data set: 150 examples total. We are loading all of them into one DataSet (not recommended for large data sets)
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65); //Use 65% of data for training
DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(trainingData); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(trainingData); //Apply normalization to the training data
normalizer.transform(testData); //Apply normalization to the test data. This is using statistics calculated from the *training* set
final int numInputs = 4;
int outputNum = 3;
long seed = 6;
log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(0.1))
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(numInputs).nOut(3)
.build())
.layer(new DenseLayer.Builder().nIn(3).nOut(3)
.build())
.layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
.nIn(3).nOut(outputNum).build())
.build();
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));
for(int i=0; i<1000; i++ ) {
model.fit(trainingData);
}
//evaluate the model on the test set
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());
eval.eval(testData.getLabels(), output);
log.info(eval.stats());
For my project, I have inputs ~30000 records (in iris example - 150).
Each record is a vector size ~7000 (in iris example - 4).
Obviously, I can't process the whole data in one DataSet - in will produce OOM for JVM.
How I can process data in multiple DataSets?
I assume it should be something like this (store DataSets in List and iterate):
...
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65); //Use 65% of data for training
trainingData.add(testAndTrain.getTrain());
testData.add(testAndTrain.getTest());
}
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
for (DataSet dataSetTraining : trainingData) {
normalizer.fit(dataSetTraining); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(dataSetTraining); //Apply normalization to the training data
}
for (DataSet dataSetTest : testData) {
normalizer.transform(dataSetTest); //Apply normalization to the test data. This is using statistics calculated from the *training* set
}
...
for(int i=0; i<1000; i++ ) {
for (DataSet dataSetTraining : trainingData) {
model.fit(dataSetTraining);
}
}
But when I start evaluation, I got this error:
Exception in thread "main" java.lang.NullPointerException: Cannot read field "javaShapeInformation" because "this.jvmShapeInfo" is null
at org.nd4j.linalg.api.ndarray.BaseNDArray.dataType(BaseNDArray.java:5507)
at org.nd4j.linalg.api.ndarray.BaseNDArray.validateNumericalArray(BaseNDArray.java:5575)
at org.nd4j.linalg.api.ndarray.BaseNDArray.add(BaseNDArray.java:3087)
at com.aarcapital.aarmlclassifier.classification.FAClassifierLearning.main(FAClassifierLearning.java:117)
...
Evaluation eval = new Evaluation(26);
INDArray output = new NDArray();
for (DataSet dataSetTest : testData) {
output.add(model.output(dataSetTest.getFeatures())); // ERROR HERE
}
System.out.println("--- Output ---");
System.out.println(output);
INDArray labels = new NDArray();
for (DataSet dataSetTest : testData) {
labels.add(dataSetTest.getLabels());
}
System.out.println("--- Labels ---");
System.out.println(labels);
eval.eval(labels, output);
log.info(eval.stats());
What is correct way to iterate miltiple DataSet for learning network?
Thanx!

Firstly, always use Nd4j.create(..) for ndarrays.
Never use the implementation. That allows you to safely create ndarrays that will work whether you use cpus or gpus.
2nd: Always use the RecordReaderDataSetIterator's builder rather than the constructor. It's very long and error prone.
That is why we made the builder in the first place.
Your NullPointer actually isn't coming from where you think it is. it's due to how you're creating the ndarray. There's no data type or anything so it can't know what to expect. Nd4j.create(..) will properly setup the ndarray for you.
Beyond that you are doing things the right way. The record reader handles the batching for you.

Improving code design to enhance performance

I am trying to do something like below. I don't like the design of this as I am using 4 for loops to achieve this. Can I further enhance the design to achieve this?
Creating a map with dates as keys.
Sort the list values inside the map on dates(dates have hours and minutes here)
Giving an incremental id to each dto.
int serialNumber = 1;
if (hList != null && !hList.isEmpty()) {
// create a Map with dates as keys
HashMap<String, ArrayList<BookDTO>> mapObj = new HashMap<>();
for (int count = 0; count < hList.size(); count++) {
BookDTO bookDTO = (BookDTO) hList.get(count);
ArrayList<BookDTO> list = new ArrayList<>();
list.add(bookDTO);
Calendar depDate = bookDTO.getDepartureDate();
SimpleDateFormat format = new SimpleDateFormat("dd-MM-yyyy");
if (depDate != null) {
String formattedDate = format.format(depDate.getTime());
if (mapObj.containsKey(formattedDate)) {
mapObj.get(formattedDate).add(bookDTO);
} else {
mapObj.put(formattedDate, list);
}
}
}
// Sort the values inside the map based on dates
for (Entry<String, ArrayList<BookingDTO>> entry : mapObj.entrySet()) {
Collections.sort(entry.getValue(), new BookDTOComparator(DATES));
}
for (Entry<String, ArrayList<BookDTO>> entry : mapObj.entrySet()) {
serialNumber = setItinerarySerialNumber(entry.getValue(), serialNumber);
}

I believe you can merge two last loops. So, we will see only two loops. (for now I can see three loops)
You can also try Arrays.parallelSort(entry.getValue()) if lists are too large and it is applicable
Also, if it applicable, see code below:
int serialNumber = 1;
SimpleDateFormat format = new SimpleDateFormat("dd-MM-yyyy");
ArrayList<BookDTO> hListCopy = new ArrayList<>(hList);
Collections.sort(hListCopy, new NewBookDTOComparator()); // 2. sorting
HashMap<String, ArrayList<BookDTO>> mapObj = new HashMap<>();
for (BookDTO bookDTO : hListCopy) {
serialNumber = setItinerarySerialNumber(bookDTO, serialNumber); // 3. serialNumber
Calendar depDate = bookDTO.getDepartureDate();
if (depDate != null) {
String formattedDate = format.format(depDate.getTime());
if (mapObj.containsKey(formattedDate)) {
mapObj.get(formattedDate).add(bookDTO);
} else {
ArrayList<BookDTO> list = new ArrayList<>();
list.add(bookDTO);
mapObj.put(formattedDate, list);
}
}
}
So, only one loop (and one sorting algorithm).
For list cope-constructor used System.arraycopy internally, you can google performance of.
You can sort hList instead without creating new 'hListCopy' if applicable.
Beware of NewBookDTOComparator, you should sort not only by minutes and hours, but also by 'DepartureDate'
I think SimpleDateFormat format should be static field or class field.
You can also try Arrays.parallelSort(hListCopy) if lists are too large and it is applicable

Apache Ignite updating previously trained ML model

I have a dataset that is used for training a KNN model. Later I'd like to update the model with new training data. What I'm seeing is that the updated model only takes the new training data ignoring what was previously trained.
Vectorizer vec = new DummyVectorizer<Integer>(1, 2).labeled(0);
DatasetTrainer<KNNClassificationModel, Double> trainer = new KNNClassificationTrainer();
KNNClassificationModel model;
KNNClassificationModel modelUpdated;
Map<Integer, Vector> trainingData = new HashMap<Integer, Vector>();
Map<Integer, Vector> trainingDataNew = new HashMap<Integer, Vector>();
Double[][] data1 = new Double[][] {
{0.136,0.644,0.154},
{0.302,0.634,0.779},
{0.806,0.254,0.211},
{0.241,0.951,0.744},
{0.542,0.893,0.612},
{0.334,0.277,0.486},
{0.616,0.259,0.121},
{0.738,0.585,0.017},
{0.124,0.567,0.358},
{0.934,0.346,0.863}};
Double[][] data2 = new Double[][] {
{0.300,0.236,0.193}};
Double[] observationData = new Double[] { 0.8, 0.7 };
// fill dataset (in cache)
for (int i = 0; i < data1.length; i++)
trainingData.put(i, new DenseVector(data1[i]));
// first training / prediction
model = trainer.fit(trainingData, 1, vec);
System.out.println("First prediction : " + model.predict(new DenseVector(observationData)));
// new training data
for (int i = 0; i < data2.length; i++)
trainingDataNew.put(data1.length + i, new DenseVector(data2[i]));
// second training / prediction
modelUpdated = trainer.update(model, trainingDataNew, 1, vec);
System.out.println("Second prediction: " + modelUpdated.predict(new DenseVector(observationData)));
As an output I get this:
First prediction : 0.124
Second prediction: 0.3
This looks like the second prediction only used data2 which must lead to 0.3 as prediction.
How does model update work? If I would have to add data2 to data1 and then train on data1 again, what would be the difference compared to a complete new training on all combined data?

How does model update work?
For KNN specifically:
Add data2 to data1 and call modelUpdate on the combined data.
see this test as an example: https://github.com/apache/ignite/blob/635dafb7742673494efa6e8e91e236820156d38f/modules/ml/src/test/java/org/apache/ignite/ml/knn/KNNClassificationTest.java#L167
Follow the instructions in that test:
set up your trainer:
KNNClassificationTrainer trainer = new KNNClassificationTrainer()
.withK(3)
.withDistanceMeasure(new EuclideanDistance())
.withWeighted(false);
Then set up your vectorizer: (note how the labeled coordinate is created)
model = trainer.fit(
trainingData,
parts,
new DoubleArrayVectorizer<Integer>().labeled(Vectorizer.LabelCoordinate.LAST)
);
then call the updateModel as needed.
KNNClassificationModel updatedOnData = trainer.update(
originalMdlOnEmptyDataset,
newData,
parts,
new DoubleArrayVectorizer<Integer>().labeled(Vectorizer.LabelCoordinate.LAST)
);
docs for KNN classification: https://ignite.apache.org/docs/latest/machine-learning/binary-classification/knn-classification
KNN Classification example: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/knn/KNNClassificationExample.java

Getting latest data from AWS custom Cloudwatch in Java

I have a custom metric in AWS cloudwatch and i am putting data into it through AWS java API.
for(int i =0;i<collection.size();i++){
String[] cell = collection.get(i).split("\\|\\|");
List<Dimension> dimensions = new ArrayList<>();
dimensions.add(new Dimension().withName(dimension[0]).withValue(cell[0]));
dimensions.add(new Dimension().withName(dimension[1]).withValue(cell[1]));
MetricDatum datum = new MetricDatum().withMetricName(metricName)
.withUnit(StandardUnit.None)
.withValue(Double.valueOf(cell[2]))
.withDimensions(dimensions);
PutMetricDataRequest request = new PutMetricDataRequest().withNamespace(namespace+"_"+cell[3]).withMetricData(datum);
String response = String.valueOf(cw.putMetricData(request));
GetMetricDataRequest res = new GetMetricDataRequest().withMetricDataQueries();
//cw.getMetricData();
com.amazonaws.services.cloudwatch.model.Metric m = new com.amazonaws.services.cloudwatch.model.Metric();
m.setMetricName(metricName);
m.setDimensions(dimensions);
m.setNamespace(namespace);
MetricStat ms = new MetricStat().withMetric(m);
MetricDataQuery metricDataQuery = new MetricDataQuery();
metricDataQuery.withMetricStat(ms);
metricDataQuery.withId("m1");
List<MetricDataQuery> mqList = new ArrayList<MetricDataQuery>();
mqList.add(metricDataQuery);
res.withMetricDataQueries(mqList);
GetMetricDataResult result1= cw.getMetricData(res);
}
Now i want to be able to fetch the latest data entered for a particular namespace, metric name and dimention combination through Java API. I am not able to find appropriate documenation from AWS regarding the same. Can anyone please help me?

I got the results from cloudwatch by the below code.\
GetMetricDataRequest getMetricDataRequest = new GetMetricDataRequest().withMetricDataQueries();
Integer integer = new Integer(300);
Iterator<Map.Entry<String, String>> entries = dimensions.entrySet().iterator();
List<Dimension> dList = new ArrayList<Dimension>();
while (entries.hasNext()) {
Map.Entry<String, String> entry = entries.next();
dList.add(new Dimension().withName(entry.getKey()).withValue(entry.getValue()));
}
com.amazonaws.services.cloudwatch.model.Metric metric = new com.amazonaws.services.cloudwatch.model.Metric();
metric.setNamespace(namespace);
metric.setMetricName(metricName);
metric.setDimensions(dList);
MetricStat ms = new MetricStat().withMetric(metric)
.withPeriod(integer)
.withUnit(StandardUnit.None)
.withStat("Average");
MetricDataQuery metricDataQuery = new MetricDataQuery().withMetricStat(ms)
.withId("m1");
List<MetricDataQuery> mqList = new ArrayList<>();
mqList.add(metricDataQuery);
getMetricDataRequest.withMetricDataQueries(mqList);
long timestamp = 1536962700000L;
long timestampEnd = 1536963000000L;
Date d = new Date(timestamp );
Date dEnd = new Date(timestampEnd );
getMetricDataRequest.withStartTime(d);
getMetricDataRequest.withEndTime(dEnd);
GetMetricDataResult result1= cw.getMetricData(getMetricDataRequest);

Predicting data created on-the-fly in WEKA using a premade model file

I want to create a WEKA Java program that reads a group of newly created data that will be fed to a premade model from the GUI version.
Here is the program:
import java.util.ArrayList;
import weka.classifiers.Classifier;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;
import weka.core.Utils;
public class UseModelWithData {
public static void main(String[] args) throws Exception {
// load model
String rootPath = "G:/";
Classifier classifier = (Classifier) weka.core.SerializationHelper.read(rootPath+"j48.model");
// create instances
Attribute attr1 = new Attribute("age");
Attribute attr2 = new Attribute("menopause");
Attribute attr3 = new Attribute("tumor-size");
Attribute attr4 = new Attribute("inv-nodes");
Attribute attr5 = new Attribute("node-caps");
Attribute attr6 = new Attribute("deg-malig");
Attribute attr7 = new Attribute("breast");
Attribute attr8 = new Attribute("breast-quad");
Attribute attr9 = new Attribute("irradiat");
Attribute attr10 = new Attribute("Class");
ArrayList<Attribute> attributes = new ArrayList<Attribute>();
attributes.add(attr1);
attributes.add(attr2);
attributes.add(attr3);
attributes.add(attr4);
attributes.add(attr5);
attributes.add(attr6);
attributes.add(attr7);
attributes.add(attr8);
attributes.add(attr9);
attributes.add(attr10);
// predict instance class values
Instances testing = new Instances("Test dataset", attributes, 0);
// add data
double[] values = new double[testing.numAttributes()];
values[0] = testing.attribute(0).addStringValue("60-69");
values[1] = testing.attribute(1).addStringValue("ge40");
values[2] = testing.attribute(2).addStringValue("10-14");
values[3] = testing.attribute(3).addStringValue("15-17");
values[4] = testing.attribute(4).addStringValue("yes");
values[5] = testing.attribute(5).addStringValue("2");
values[6] = testing.attribute(6).addStringValue("right");
values[7] = testing.attribute(7).addStringValue("right_up");
values[8] = testing.attribute(0).addStringValue("yes");
values[9] = Utils.missingValue();
// add data to instance
testing.add(new DenseInstance(1.0, values));
// instance row to predict
int index = 10;
// perform prediction
double myValue = classifier.classifyInstance(testing.instance(10));
// get the name of class value
String prediction = testing.classAttribute().value((int) myValue);
System.out.println("The predicted value of the instance ["
+ Integer.toString(index) + "]: " + prediction);
}
}
My references include:
Using a premade WEKA model in Java
the WEKA Manual provided in the 3.7.10 version - 17.3 Creating datasets in memory
Creating a single instance for classification in WEKA
So far the part where I create a new Instance inside the script causes the following error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 10, Size: 1
in the line
double myValue = classifier.classifyInstance(testing.instance(10));
I just want to use a latest row of instance values to a premade WEKA model. How do I solve this?
Resources
Program file
Arff file
j48.model

You have the error because you are trying to access the 11th instance and have only created one.
If you always want to access the last instance you might try the following:
double myValue = classifier.classifyInstance(testing.lastInstance());
Additionally, I don't believe that you are creating the instances you hope for. After looking at your provided ".arff" file, which I believe you are trying to mimic, I think you should proceed making instances as follows:
FastVector atts;
FastVector attAge;
Instances testing;
double[] vals;
// 1. set up attributes
atts = new FastVector();
//age
attAge = new FastVector();
attAge.addElement("10-19");
attAge.addElement("20-29");
attAge.addElement("30-39");
attAge.addElement("40-49");
attAge.addElement("50-59");
attAge.addElement("60-69");
attAge.addElement("70-79");
attAge.addElement("80-89");
attAge.addElement("90-99");
atts.addElement(new Attribute("age", attAge));
// 2. create Instances object
testing = new Instances("breast-cancer", atts, 0);
// 3. fill with data
vals = new double[testing.numAttributes()];
vals[0] = attAge.indexOf("10-19");
testing.add(new DenseInstance(1.0, vals));
// 4. output data
System.out.println(testing);
Of course I did not create the whole dataset, but the technique would be the same.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RandomForest with Weka in Java - java

Related

Deeplearning4j - how to iterate multiple DataSets for large data?

Improving code design to enhance performance

Apache Ignite updating previously trained ML model

Getting latest data from AWS custom Cloudwatch in Java

Predicting data created on-the-fly in WEKA using a premade model file

Categories

Resources