Print actual and predicted class labels using Random Forest in Java - java
I have a large datasets with 10000 records such that 5000 belong to class 1 and remaining 5000 to class -1. I used Random Forest and obtained a good accuracy over 90%.
Now if I have an arff file
#relation cds_orf
#attribute start numeric
#attribute end numeric
#attribute score numeric
#attribute orf_coverage numeric
#attribute class {1,-1}
#data
(suppose this contains 5 records)
my output should be something like this
No Actual_class Predicted class
1 1 1
2 1 1
3 -1 -1
4 1 -1
5 1 1
I want the Java code to print this output. Thanks.
(Note: I have used classifier.classifyInstance() but it gives NullPointerException)
Well I found the answer myself after a lot of research. The following code does the same and writes the output to anther file orf_out.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.util.Random;
import weka.classifiers.Evaluation;
import weka.classifiers.trees.RandomForest;
import weka.core.Instances;
/**
*
* #author samy
*/
public class WekaTest {
/**
* #throws java.lang.Exception
*/
public static void rfnew() throws Exception {
BufferedReader br;
int numFolds = 10;
br = new BufferedReader(new FileReader("orf_arff"));
Instances trainData = new Instances(br);
trainData.setClassIndex(trainData.numAttributes() - 1);
br.close();
RandomForest rf = new RandomForest();
rf.setNumTrees(100);
Evaluation evaluation = new Evaluation(trainData);
evaluation.crossValidateModel(rf, trainData, numFolds, new Random(1));
rf.buildClassifier(trainData);
PrintWriter out = new PrintWriter("orf_out");
out.println("No.\tTrue\tPredicted");
for (int i = 0; i < trainData.numInstances(); i++)
{
String trueClassLabel;
trueClassLabel = trainData.instance(i).toString(trainData.classIndex());
// Discreet prediction
double predictionIndex =
rf.classifyInstance(trainData.instance(i));
// Get the predicted class label from the predictionIndex.
String predictedClassLabel;
predictedClassLabel = trainData.classAttribute().value((int) predictionIndex);
out.println((i+1)+"\t"+trueClassLabel+"\t"+predictedClassLabel);
}
out.println(evaluation.toSummaryString("\nResults\n======\n", true));
out.println(evaluation.toClassDetailsString());
out.println("Results For Class -1- ");
out.println("Precision= " + evaluation.precision(0));
out.println("Recall= " + evaluation.recall(0));
out.println("F-measure= " + evaluation.fMeasure(0));
out.println("Results For Class -2- ");
out.println("Precision= " + evaluation.precision(1));
out.println("Recall= " + evaluation.recall(1));
out.println("F-measure= " + evaluation.fMeasure(1));
out.close();
}
}
I needed to use buildClassifier in my code.
Related
Sorting strings via stream
I am doing a coding exercise where I take the the raw data from a csv file and I print it in order of lowest to highest ranked literacy rates. For example: Adult literacy rate, population 15+ years, female (%),United Republic of Tanzania,2015,76.08978 Adult literacy rate, population 15+ years, female (%),Zimbabwe,2015,85.28513 Adult literacy rate, population 15+ years, male (%),Honduras,2014,87.39595 Adult literacy rate, population 15+ years, male (%),Honduras,2015,88.32135 Adult literacy rate, population 15+ years, male (%),Angola,2014,82.15105 Turns into: Niger (2015), female, 11.01572 Mali (2015), female, 22.19578 Guinea (2015), female, 22.87104 Afghanistan (2015), female, 23.87385 Central African Republic (2015), female, 24.35549 My code: import java.io.IOException; import java.nio.file.Paths; import java.util.ArrayList; import java.util.List; import java.util.Scanner; public class LiteracyComparison { public static void main(String[] args) throws IOException { List<String> literacy = new ArrayList<>(); try (Scanner scanner = new Scanner(Paths.get("literacy.csv"))) { while(scanner.hasNextLine()){ String row = scanner.nextLine(); String[] line = row.split(","); line[2] = line[2].trim().substring(0, line[2].length() - 5); line[3] = line[3].trim(); line[4] = line[4].trim(); line[5] = line[5].trim(); String l = line[3] + " (" + line[4] + "), " + line[2] + ", " + line[5]; literacy.add(l); } } // right about where I get lost literacy.stream().sorted(); } } Now I have converted the raw data into the correct format, it's just I am lost on how to sort it. I am also wondering if there is a more efficient way to do this via the streams method. Please and thank you.
I took a few liberties while refactoring your code, but the idea is the same. This could be further improved but it is not intended to be a perfect solution, just something to answer your question and put you on the right track. The main idea here is to create a nested class called LiteracyData, which stores the summary you had before as a String. However, we also want to store the literacy rate so we have something to sort by. Then you can use a Java Comparator to define your own method for comparing custom classes, in this case LiteracyData. Finally, tie it all together by calling the sort function on your List, while passing in the custom Comparator as an argument. That will sort your list. You can then print it to view the results. import java.io.IOException; import java.nio.file.Paths; import java.util.ArrayList; import java.util.List; import java.util.Scanner; import java.util.Comparator; public class LiteracyComparison { // Define a class that stores your data public class LiteracyData { private String summary; private float rate; public LiteracyData(String summary, float rate) { super(); this.summary = summary; this.rate = rate; } } // This is a custom Comparator we defined for sorting LiteracyData public class LiteracySorter implements Comparator<LiteracyData> { #Override public int compare(LiteracyData d1, LiteracyData d2) { return Float.compare(d1.rate, d2.rate); } } public void run() { List<LiteracyData> literacy = new ArrayList<>(); try (Scanner scanner = new Scanner(Paths.get("literacy.csv"))) { while(scanner.hasNextLine()){ String row = scanner.nextLine(); String[] line = row.split(","); line[2] = line[2].trim().substring(0, line[2].length() - 5); line[3] = line[3].trim(); line[4] = line[4].trim(); line[5] = line[5].trim(); String l = line[3] + " (" + line[4] + "), " + line[2] + ", " + line[5]; LiteracyData data = new LiteracyData(l, Float.parseFloat(line[5])); literacy.add(data); } } catch (Exception e) { System.out.println(e.getMessage()); } // Sort the list using your custom LiteracyData comparator literacy.sort(new LiteracySorter()); // Iterate through the list and print each item to ensure it is sorted for(LiteracyData data : literacy) { System.out.println(data.summary); } } public static void main(String[] args) throws IOException { LiteracyComparison comparison = new LiteracyComparison(); comparison.run(); } }
validate ArrayList contents against specific set of data
I want to check and verify that all of the contents in the ArrayList are similar to the value of a String variable. If any of the value is not similar, the index number to be printed with an error message like (value at index 2 didn't match the value of expectedName variable). After I run the code below, it will print all the three indexes with the error message, it will not print only the index number 1. Please note that here I'm getting the data from CSV file, putting it into arraylist and then validating it against the expected data in String variable. import org.apache.commons.csv.CSVFormat; import org.apache.commons.csv.CSVParser; import org.apache.commons.csv.CSVRecord; import java.io.IOException; import java.io.Reader; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; public class ValidateVideoDuration { private static final String CSV_FILE_PATH = "C:\\Users\\videologs.csv"; public static void main(String[] args) throws IOException { String expectedVideo1Duration = "00:00:30"; String expectedVideo2Duration = "00:00:10"; String expectedVideo3Duration = "00:00:16"; String actualVideo1Duration = ""; String actualVideo2Duration = ""; String actualVideo3Duration = ""; ArrayList<String> actualVideo1DurationList = new ArrayList<String>(); ArrayList<String> actualVideo2DurationList = new ArrayList<String>(); ArrayList<String> actualVideo3DurationList = new ArrayList<String>(); try (Reader reader = Files.newBufferedReader(Paths.get(CSV_FILE_PATH)); CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());) { for (CSVRecord csvRecord : csvParser) { // Accessing values by Header names actualVideo1Duration = csvRecord.get("Video 1 Duration"); actualVideo1DurationList.add(actualVideo1Duration); actualVideo2Duration = csvRecord.get("Video 2 Duration"); actualVideo2DurationList.add(actualVideo2Duration); actualVideo3Duration = csvRecord.get("Video 3 Duration"); actualVideo3DurationList.add(actualVideo3Duration); } } for (int i = 0; i < actualVideo2DurationList.size(); i++) { if (actualVideo2DurationList.get(i) != expectedVideo2Duration) { System.out.println("Duration of Video 1 at index number " + Integer.toString(i) + " didn't match the expected duration"); } } The data inside my CSV file look like the following: video 1 duration, video 2 duration, video 3 duration 00:00:30, 00:00:10, 00:00:16 00:00:30, 00:00:15, 00:00:15 00:00:25, 00:00:10, 00:00:16
Don't use == or != for string compare. == checks the referential equality of two Strings and not the equality of the values. Use the .equals() method instead. Change your if condition to if (!actualVideo2DurationList.get(i).equals(expectedVideo2Duration))
number of times the combinations of strings ( length>3) occurred in the given ArrayList
I want to find the number of times the combinations of strings (whose length is more than 3) occurred in the given input. input: scientists found way to reduce global warming scientists, found way to minimize water pollution scientists said that they are successful Rony said that they are successful johnny said that he failed desired output: scientists found-2 said that-3 "scientists found" is in 1st and 2nd statement, "said that" is in 3rd ,4th and 5th statement, "they are successful" is not included as length of "are" is not more than 3. i have divided my program in blocks and added comment of what these blocks are doing, how to get the desired output?? is there a more efficient solution for this ?? package project1; import java.io.ByteArrayOutputStream; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.net.URL; import java.net.URLConnection; import java.util.ArrayList; import java.util.HashMap; import java.util.Iterator; import java.util.Map; public class combo{ //----------Block 1 starts--------------------------------------- public static void main(String args[]) { ArrayList<String> exampleList = new ArrayList<>(); exampleList.add("scientists found way to reduce global warming".toLowerCase()); exampleList.add("scientists, found way to minimize water pollution".toLowerCase()); exampleList.add("scientists, said that they are successful".toLowerCase()); exampleList.add("Rony, said that they are successful".toLowerCase()); exampleList.add("johnny, said that he failed".toLowerCase()); Map<String, Integer> keywordList = new HashMap<String, Integer>(); ArrayList<String> strmatch=new ArrayList<>(); for(int i=0;i<exampleList.size();i++){ String[] tokens = exampleList.get(i).split("[ ,-;()//:']"); for (String token : tokens) { if(token.length()>3){ if(!keywordList.containsKey(token)) keywordList.put(token,1); else{ keywordList.put(token,keywordList.get(token)+1); } } } for (int j=0;j<tokens.length;j++)//content of tokens array { System.out.println(tokens[j]); //to check content of tokens. } } //------------Block 1 ends--------------------------------------- //content of keywordList /*for (String name: keywordList.keySet()){ String key =name.toString(); String value = keywordList.get(name).toString(); System.out.println(key + " " + value); //to check keywordList content. } */ //------------Block 2 starts------------------------------------- System.out.println(keywordList.size()); Iterator it = keywordList.entrySet().iterator(); while (it.hasNext()) { Map.Entry pair = (Map.Entry)it.next(); if((int)pair.getValue()<2) it.remove(); System.out.println(pair.getKey() + " = " + pair.getValue()); /*to get content of keywordList which are repeated more than once.*/ } //-----------Block 2 ends-------------------------------------- //-----------Block 3 starts------------------------------------ it = keywordList.entrySet().iterator(); while (it.hasNext()) { Map.Entry pair = (Map.Entry)it.next(); System.out.println(pair.getKey() + " ::" + pair.getValue()); strmatch.add((String)pair.getKey()); } //-----------Block 3 ends---------------------------------------- //-----------Block 4 starts-------------------------------------- System.out.println(strmatch);//content of strmatch String[] str= new String[strmatch.size()]; //int[][] variable2=new int[keywordList.size()][keywordList.size()]; for(int i=0;i<exampleList.size();i++){ for(int j=0;j<strmatch.size();j++) for (int k=0;k<strmatch.size();k++){ if(j==k) continue; if(exampleList.get(i).contains(strmatch.get(j))&&exampleList.get(i).contains(strmatch.get(k))) str[i]=strmatch.get(j)+" "+strmatch.get(k); } } //-----------Block 4 ends---------------------------------------- for(int p=0;p<strmatch.size();p++)//contents of str array { System.out.println(str[p]); //to get desired output } }
Retrieving nth qualifier in hbase using java
This question is quite out of box but i need it. In list(collection), we can retrieve the nth element in the list by list.get(i); similarly is there any method, in hbase, using java API, where i can get the nth qualifier given the row id and ColumnFamily name. NOTE: I have million qualifiers in single row in single columnFamily.
Sorry for being unresponsive. Busy with something important. Try this for right now : package org.myorg.hbasedemo; import java.io.IOException; import java.util.Scanner; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.KeyValue; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; public class GetNthColunm { public static void main(String[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "TEST"); Get g = new Get(Bytes.toBytes("4")); Result r = table.get(g); System.out.println("Enter column index :"); Scanner reader = new Scanner(System.in); int index = reader.nextInt(); System.out.println("index : " + index); int count = 0; for (KeyValue kv : r.raw()) { if(++count!=index) continue; System.out.println("Qualifier : " + Bytes.toString(kv.getQualifier())); System.out.println("Value : " + Bytes.toString(kv.getValue())); } table.close(); System.out.println("Done."); } } Will let you know if I get a better way to do this.
Classify and Predict Multiple Attributes with Weka
I need to input 6 attributes and classify/predict 3 attributes from that input using Java/Weka programmatically. I've figured out how to predict 1 (the last) attribute, but how can I change this to train and predict the last 3 at the same time? The numbers in the .arff files correspond to movie objects in a database. Here is my Java code: import java.io.BufferedReader; import java.io.FileReader; import weka.classifiers.meta.FilteredClassifier; import weka.classifiers.trees.DecisionStump; import weka.classifiers.trees.J48; import weka.classifiers.trees.RandomForest; import weka.classifiers.trees.RandomTree; import weka.core.Instances; import weka.filters.unsupervised.attribute.Remove; public class WekaTrial { /** * #param args * #throws Exception */ public static void main(String[] args) throws Exception { // Create training data instance Instances training_data = new Instances( new BufferedReader( new FileReader( "C:/Users/Me/Desktop/File_Project/src/movie_training.arff"))); training_data.setClassIndex(training_data.numAttributes() - 1); // Create testing data instance Instances testing_data = new Instances( new BufferedReader( new FileReader( "C:/Users/Me/Desktop/FileProject/src/movie_testing.arff"))); testing_data.setClassIndex(training_data.numAttributes() - 1); // Print initial data summary String summary = training_data.toSummaryString(); int number_samples = training_data.numInstances(); int number_attributes_per_sample = training_data.numAttributes(); System.out.println("Number of attributes in model = " + number_attributes_per_sample); System.out.println("Number of samples = " + number_samples); System.out.println("Summary: " + summary); System.out.println(); // a classifier for decision trees: J48 j48 = new J48(); // filter for removing samples: Remove rm = new Remove(); rm.setAttributeIndices("1"); // remove 1st attribute // filtered classifier FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(rm); fc.setClassifier(j48); // Create counters and print values float correct = 0; float incorrect = 0; // train using stock_training_data.arff: fc.buildClassifier(training_data); // test using stock_testing_data.arff: for (int i = 0; i < testing_data.numInstances(); i++) { double pred = fc.classifyInstance(testing_data.instance(i)); System.out.print("Expected values: " + testing_data.classAttribute().value( (int) testing_data.instance(i).classValue())); System.out.println(", Predicted values: " + testing_data.classAttribute().value((int) pred)); // Increment correct/incorrect values if (testing_data.classAttribute().value( (int) testing_data.instance(i).classValue()) == testing_data.classAttribute().value((int) pred)) { correct += 1; } else { incorrect += 1; } } // Print correct/incorrect float percent_correct = correct/(correct+incorrect)*100; System.out.println("Number correct: " + correct + "\nNumber incorrect: " + incorrect + "\nPercent correct: " + percent_correct + "%"); } } This is my .arff training file (with excess rows removed): #relation movie_data #attribute movie1_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #data 18,18,18,18,18,18,18,18,18 28,18,36,18,53,10769,18,53,10769 37,37,37,28,12,14,28,12,14 27,53,27,18,10749,10769,27,53,27 12,12,12,35,10751,35,12,12,12 35,18,10749,18,18,18,35,18,10749 28,12,878,53,53,53,53,53,53 18,18,18,28,37,10769,18,18,18 18,53,18,28,12,35,18,53,18 28,80,53,80,18,10749,28,80,53 18,10749,18,18,10756,18,18,10756,18 18,10749,10769,28,12,878,18,10749,10769 18,10756,18,16,35,10751,16,35,10751 35,18,10751,35,18,10752,35,18,10751 And the .arff testing file: #relation movie_data #attribute movie1_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #data 18,27,53,18,53,10756,18,27,53 35,18,10749,18,10769,18,18,10769,18 16,878,53,16,18,16,16,18,16 35,10749,10757,18,18,18,18,18,18 80,18,10748,18,10749,18,18,10749,18 28,18,36,35,18,10751,28,18,36 18,10749,10769,35,18,10402,35,18,10402 28,12,878,18,10749,10769,18,10749,10769 35,10749,35,14,10402,10751,14,10402,10751
If I understood you correctly, you have a "Multi-Class" or "Multi-Target" problem. You have several simple options to solve the problem: Create a new target class which incorporates all 3 (concatenation of decision_one, decision_two and decision_three) Train each target separately.
I think the simplest approach would be, as Bella said, to train three separate models, one for each class, possibly removing the rest of the class attribs (depending on whether or not you want the other two classes to influence your classification).