Print out prediction with WEKA in Java - java
I am trying to make a prediction with Weka in Java, using the Naive Bayes Classifier, with the following code:
JAVA
public class Run {
public static void main(String[] args) throws Exception {
ConverterUtils.DataSource source1 = new ConverterUtils.DataSource("./data/train.arff");
Instances train = source1.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (train.classIndex() == -1)
train.setClassIndex(train.numAttributes() - 1);
ConverterUtils.DataSource source2 = new ConverterUtils.DataSource("./data/test.arff");
Instances test = source2.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (test.classIndex() == -1)
test.setClassIndex(train.numAttributes() - 1);
// model
NaiveBayes naiveBayes = new NaiveBayes();
naiveBayes.buildClassifier(train);
Evaluation evaluation = new Evaluation(train);
evaluation.evaluateModel(naiveBayes, test);
}
}
TRAIN
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
...
PREDICT
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
sunny,85,85,FALSE,?
In the GUI the predicted output is
=== Predictions on test split ===
inst#, actual, predicted, error, probability distribution
1 ? 2:no + 0.145 *0.855
How can I get this output with Java? Which method do I need to use to get this?
public class Run {
public static void main(String[] args) throws Exception {
ConverterUtils.DataSource source1 = new ConverterUtils.DataSource("./data/train.arff");
Instances train = source1.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (train.classIndex() == -1)
train.setClassIndex(train.numAttributes() - 1);
ConverterUtils.DataSource source2 = new ConverterUtils.DataSource("./data/test.arff");
Instances test = source2.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (test.classIndex() == -1)
test.setClassIndex(train.numAttributes() - 1);
// model
NaiveBayes naiveBayes = new NaiveBayes();
naiveBayes.buildClassifier(train);
// this does the trick
double label = naiveBayes.classifyInstance(test.instance(0));
test.instance(0).setClassValue(label);
System.out.println(test.instance(0).stringValue(4));
}
}
Related
Apache Beam how to filter data based on date value
I am trying to read records from a CSV file and filter the records based on the date. I have implemented this in the following way. But is this a correct way? The steps are: Creating pipeline Read the data from a file Perform necessary filtering Create a MapElement Object and convert the OrderRequest to String Mapping the OrderRequest Entity to String Write the output to a file Code: // Creating pipeline Pipeline pipeline = Pipeline.create(); // For transformations Reading from a file PCollection<String> orderRequest = pipeline .apply(TextIO.read().from("src/main/resources/ST/STCheck/OrderRequest.csv")); PCollection<OrderRequest> pCollectionTransformation = orderRequest .apply(ParDo.of(new DoFn<String, OrderRequest>() { private static final long serialVersionUID = 1L; #ProcessElement public void processElement(ProcessContext c) { String rowString = c.element(); if (!rowString.contains("order_id")) { String[] strArr = rowString.split(","); OrderRequest orderRequest = new OrderRequest(); orderRequest.setOrder_id(strArr[0]); // Condition to check if the String source1 = strArr[1]; DateTimeFormatter fmt1 = DateTimeFormat.forPattern("mm/dd/yyyy"); DateTime d1 = fmt1.parseDateTime(source1); System.out.println(d1); String source2 = "4/24/2017"; DateTimeFormatter fmt2 = DateTimeFormat.forPattern("mm/dd/yyyy"); DateTime d2 = fmt2.parseDateTime(source2); System.out.println(d2); orderRequest.setOrder_date(strArr[1]); System.out.println(strArr[1]); orderRequest.setAmount(Double.valueOf(strArr[2])); orderRequest.setCounter_id(strArr[3]); if (DateTimeComparator.getInstance().compare(d1, d2) > -1) { c.output(orderRequest); } } } })); // Create a MapElement Object and convert the OrderRequest to String MapElements<OrderRequest, String> mapElements = MapElements.into(TypeDescriptors.strings()) .via((OrderRequest orderRequestType) -> orderRequestType.getOrder_id() + " " + orderRequestType.getOrder_date() + " " + orderRequestType.getAmount() + " " + orderRequestType.getCounter_id()); // Mapping the OrderRequest Entity to String PCollection<String> pStringList = pCollectionTransformation.apply(mapElements); // Now Writing the elements to a file pStringList.apply(TextIO.write().to("src/main/resources/ST/STCheck/OrderRequestOut.csv").withNumShards(1) .withSuffix(".csv")); // To run pipeline pipeline.run(); System.out.println("We are done!!"); Pojo Class: public class OrderRequest implements Serializable{ String order_id; String order_date; double amount; String counter_id; } Though I am getting the correct result, is this a correct way? My two main problem is 1) How to i access individual columns? So that, I can specify conditions based on that column value. 2) Can we specify headers when reading the data?
Yes, you can process CSV files like this using TextIO.read() provided they do not contain fields embedding newlines and you can recognize/skip the header lines. Your pipeline looks good, though as a minor style issue I would probably have the first ParDo do only the parsing, followed by a Filter that looked at the date to filter things out. If you want to automatically infer the header lines, you could open read the first line in your main program (using standard java libraries, or Beams FileSystems class) and extract this out manually, passing it into your parsing DoFn. I agree a more columnar approach would be more natural. We have this in Python as our Dataframes API which is now available for general use. You would write something like with beam.Pipeline() as p: df = p | beam.dataframe.io.read_csv("src/main/resources/ST/STCheck/OrderRequest.csv") filtered = df[df.order_date > limit] filtered.write_csv("src/main/resources/ST/STCheck/OrderRequestOut.csv")
How to generate large csv with random data using java
How can I generate million row records in a csv format using java with some unique data.
Check out this tutorial. The code can be quite simple: MockNeat m = MockNeat.threadLocal(); final Path path = Paths.get("./test.csv"); m.fmt("#{id},#{first},#{last},#{email},#{salary}") .param("id", m.longSeq()) .param("first", m.names().first()) .param("last", m.names().last()) .param("email", m.emails()) .param("salary", m.money().locale(GERMANY).range(2000, 5000)) .list(1000) .consume(list -> { try { Files.write(path, list, CREATE, WRITE); } catch (IOException e) { e.printStackTrace(); } }); And the possible result is: 0,Ailene,Greener,auldsoutache#gmx.com,4.995,59 € 1,Yung,Skovira,sereglady#mail.com,2.850,23 € 2,Shanelle,Hevia,topslawton#mac.com,2.980,19 € 3,Venice,Lepe,sagelyshroud#mail.com,4.611,83 € 4,Mi,Repko,nonedings#email.com,3.811,38 € 5,Leonie,Slomski,plumpcreola#aol.com,4.584,28 € 6,Elisabeth,Blasl,swartjeni#mail.com,2.839,69 € 7,Ernestine,Syphard,prestoshod#aol.com,3.471,93 € 8,Honey,Winfrey,pseudpatria#email.com,4.276,56 € 9,Dian,Holecek,primbra#att.net,3.643,66 € 10,Mitchell,Lawer,lessjoellen#yahoo.com,3.260,92 € 11,Kayla,Labbee,hobnailmastella#mail.com,2.504,99 € 12,Jann,Grafenstein,douremile#verizon.net,4.535,70 € 13,Shaunna,Uknown,taughtclifton#gmx.com,3.028,81 € ...
This can give you an idea of how to build a generator. The random data can be generated using Random class and adapting it to the data you need to generate. public interface ICsvRandomRenerator{ /* Adds the field definition to an array list that describes the csv */ public void addFieldDefinition(FieldDefinition fieldDefinition); /* Runs a loop for the number of records needed and for each one it goes through the FieldDefinition ArrayList, generates the random data based on field definition, and adds it to the curret record. Last field changes to a new record*/ public void generateFile(String fileName); }; public class FieldDefinition(){ String fieldName; String fieldType; //Alphabetic, Number, Date, etc.. int length; <getters and setters> } public abstract class CsvRandomGenerator implements ICsvRandomGenerator{ ArrayList<FieldDefinition> fields = new ArrayList<>(); <#Override interface classes to implement them >. private String generateRandomAlpha(); private String generateRandomDate(); private String generateRandomNumber(); ... }
Classify and Predict Multiple Attributes with Weka
I need to input 6 attributes and classify/predict 3 attributes from that input using Java/Weka programmatically. I've figured out how to predict 1 (the last) attribute, but how can I change this to train and predict the last 3 at the same time? The numbers in the .arff files correspond to movie objects in a database. Here is my Java code: import java.io.BufferedReader; import java.io.FileReader; import weka.classifiers.meta.FilteredClassifier; import weka.classifiers.trees.DecisionStump; import weka.classifiers.trees.J48; import weka.classifiers.trees.RandomForest; import weka.classifiers.trees.RandomTree; import weka.core.Instances; import weka.filters.unsupervised.attribute.Remove; public class WekaTrial { /** * #param args * #throws Exception */ public static void main(String[] args) throws Exception { // Create training data instance Instances training_data = new Instances( new BufferedReader( new FileReader( "C:/Users/Me/Desktop/File_Project/src/movie_training.arff"))); training_data.setClassIndex(training_data.numAttributes() - 1); // Create testing data instance Instances testing_data = new Instances( new BufferedReader( new FileReader( "C:/Users/Me/Desktop/FileProject/src/movie_testing.arff"))); testing_data.setClassIndex(training_data.numAttributes() - 1); // Print initial data summary String summary = training_data.toSummaryString(); int number_samples = training_data.numInstances(); int number_attributes_per_sample = training_data.numAttributes(); System.out.println("Number of attributes in model = " + number_attributes_per_sample); System.out.println("Number of samples = " + number_samples); System.out.println("Summary: " + summary); System.out.println(); // a classifier for decision trees: J48 j48 = new J48(); // filter for removing samples: Remove rm = new Remove(); rm.setAttributeIndices("1"); // remove 1st attribute // filtered classifier FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(rm); fc.setClassifier(j48); // Create counters and print values float correct = 0; float incorrect = 0; // train using stock_training_data.arff: fc.buildClassifier(training_data); // test using stock_testing_data.arff: for (int i = 0; i < testing_data.numInstances(); i++) { double pred = fc.classifyInstance(testing_data.instance(i)); System.out.print("Expected values: " + testing_data.classAttribute().value( (int) testing_data.instance(i).classValue())); System.out.println(", Predicted values: " + testing_data.classAttribute().value((int) pred)); // Increment correct/incorrect values if (testing_data.classAttribute().value( (int) testing_data.instance(i).classValue()) == testing_data.classAttribute().value((int) pred)) { correct += 1; } else { incorrect += 1; } } // Print correct/incorrect float percent_correct = correct/(correct+incorrect)*100; System.out.println("Number correct: " + correct + "\nNumber incorrect: " + incorrect + "\nPercent correct: " + percent_correct + "%"); } } This is my .arff training file (with excess rows removed): #relation movie_data #attribute movie1_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #data 18,18,18,18,18,18,18,18,18 28,18,36,18,53,10769,18,53,10769 37,37,37,28,12,14,28,12,14 27,53,27,18,10749,10769,27,53,27 12,12,12,35,10751,35,12,12,12 35,18,10749,18,18,18,35,18,10749 28,12,878,53,53,53,53,53,53 18,18,18,28,37,10769,18,18,18 18,53,18,28,12,35,18,53,18 28,80,53,80,18,10749,28,80,53 18,10749,18,18,10756,18,18,10756,18 18,10749,10769,28,12,878,18,10749,10769 18,10756,18,16,35,10751,16,35,10751 35,18,10751,35,18,10752,35,18,10751 And the .arff testing file: #relation movie_data #attribute movie1_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie1_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute movie2_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_one {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_two {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #attribute decision_three {28,12,16,35,80,105,99,18,82,2916,10751,10750,14,10753,10769,36,10595,27,10756,10402,22,9648,10754,1115,10749,878,10755,9805,10758,10757,10748,10770,53,10752,37} #data 18,27,53,18,53,10756,18,27,53 35,18,10749,18,10769,18,18,10769,18 16,878,53,16,18,16,16,18,16 35,10749,10757,18,18,18,18,18,18 80,18,10748,18,10749,18,18,10749,18 28,18,36,35,18,10751,28,18,36 18,10749,10769,35,18,10402,35,18,10402 28,12,878,18,10749,10769,18,10749,10769 35,10749,35,14,10402,10751,14,10402,10751
If I understood you correctly, you have a "Multi-Class" or "Multi-Target" problem. You have several simple options to solve the problem: Create a new target class which incorporates all 3 (concatenation of decision_one, decision_two and decision_three) Train each target separately.
I think the simplest approach would be, as Bella said, to train three separate models, one for each class, possibly removing the rest of the class attribs (depending on whether or not you want the other two classes to influence your classification).
Predicting data created on-the-fly in WEKA using a premade model file
I want to create a WEKA Java program that reads a group of newly created data that will be fed to a premade model from the GUI version. Here is the program: import java.util.ArrayList; import weka.classifiers.Classifier; import weka.core.Attribute; import weka.core.DenseInstance; import weka.core.Instances; import weka.core.Utils; public class UseModelWithData { public static void main(String[] args) throws Exception { // load model String rootPath = "G:/"; Classifier classifier = (Classifier) weka.core.SerializationHelper.read(rootPath+"j48.model"); // create instances Attribute attr1 = new Attribute("age"); Attribute attr2 = new Attribute("menopause"); Attribute attr3 = new Attribute("tumor-size"); Attribute attr4 = new Attribute("inv-nodes"); Attribute attr5 = new Attribute("node-caps"); Attribute attr6 = new Attribute("deg-malig"); Attribute attr7 = new Attribute("breast"); Attribute attr8 = new Attribute("breast-quad"); Attribute attr9 = new Attribute("irradiat"); Attribute attr10 = new Attribute("Class"); ArrayList<Attribute> attributes = new ArrayList<Attribute>(); attributes.add(attr1); attributes.add(attr2); attributes.add(attr3); attributes.add(attr4); attributes.add(attr5); attributes.add(attr6); attributes.add(attr7); attributes.add(attr8); attributes.add(attr9); attributes.add(attr10); // predict instance class values Instances testing = new Instances("Test dataset", attributes, 0); // add data double[] values = new double[testing.numAttributes()]; values[0] = testing.attribute(0).addStringValue("60-69"); values[1] = testing.attribute(1).addStringValue("ge40"); values[2] = testing.attribute(2).addStringValue("10-14"); values[3] = testing.attribute(3).addStringValue("15-17"); values[4] = testing.attribute(4).addStringValue("yes"); values[5] = testing.attribute(5).addStringValue("2"); values[6] = testing.attribute(6).addStringValue("right"); values[7] = testing.attribute(7).addStringValue("right_up"); values[8] = testing.attribute(0).addStringValue("yes"); values[9] = Utils.missingValue(); // add data to instance testing.add(new DenseInstance(1.0, values)); // instance row to predict int index = 10; // perform prediction double myValue = classifier.classifyInstance(testing.instance(10)); // get the name of class value String prediction = testing.classAttribute().value((int) myValue); System.out.println("The predicted value of the instance [" + Integer.toString(index) + "]: " + prediction); } } My references include: Using a premade WEKA model in Java the WEKA Manual provided in the 3.7.10 version - 17.3 Creating datasets in memory Creating a single instance for classification in WEKA So far the part where I create a new Instance inside the script causes the following error: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 10, Size: 1 in the line double myValue = classifier.classifyInstance(testing.instance(10)); I just want to use a latest row of instance values to a premade WEKA model. How do I solve this? Resources Program file Arff file j48.model
You have the error because you are trying to access the 11th instance and have only created one. If you always want to access the last instance you might try the following: double myValue = classifier.classifyInstance(testing.lastInstance()); Additionally, I don't believe that you are creating the instances you hope for. After looking at your provided ".arff" file, which I believe you are trying to mimic, I think you should proceed making instances as follows: FastVector atts; FastVector attAge; Instances testing; double[] vals; // 1. set up attributes atts = new FastVector(); //age attAge = new FastVector(); attAge.addElement("10-19"); attAge.addElement("20-29"); attAge.addElement("30-39"); attAge.addElement("40-49"); attAge.addElement("50-59"); attAge.addElement("60-69"); attAge.addElement("70-79"); attAge.addElement("80-89"); attAge.addElement("90-99"); atts.addElement(new Attribute("age", attAge)); // 2. create Instances object testing = new Instances("breast-cancer", atts, 0); // 3. fill with data vals = new double[testing.numAttributes()]; vals[0] = attAge.indexOf("10-19"); testing.add(new DenseInstance(1.0, vals)); // 4. output data System.out.println(testing); Of course I did not create the whole dataset, but the technique would be the same.
weka: add new instance to dataset
I have a weka dataset: #attribute uid numeric #attribute itemid numeric #attribute rating numeric #attribute timestamp numeric #data 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 196 51 5 881250949 244 51 2 880606923 if I want to add a new instance like this: 244 59 2 880606923 how can I do it ? something like this ? Instances newData = arffLoader.getDataSet(); for (int i = 0; i < newData.numInstances(); i++) { Instance one = newData.instance(i); one.setDataset(data); data.add(one); }
try following code. What you need to do create a double array for your new values. Use DenseInstance class to add them to your Instances object. public static void main(String[] args) { String dataSetFileName = "stackoverflowQuestion.arff"; Instances data = MyUtilsForWekaInstanceHelper.getInstanceFromFile(dataSetFileName); System.out.println("Before adding"); System.out.println(data); double[] instanceValue1 = new double[data.numAttributes()]; instanceValue1[0] = 244; instanceValue1[1] = 59; instanceValue1[2] = 2; instanceValue1[3] = 880606923; DenseInstance denseInstance1 = new DenseInstance(1.0, instanceValue1); data.add(denseInstance1); System.out.println("-----------------------------------------------------------"); System.out.println("After adding"); System.out.println(data); public class MyUtilsForWekaInstanceHelper { public static Instances getInstanceFromFile(String pFileName) { Instances data = null; try { BufferedReader reader = new BufferedReader(new FileReader(pFileName)); data = new Instances(reader); reader.close(); // setting class attribute data.setClassIndex(data.numAttributes() - 1); } catch (Exception e) { throw new RuntimeException(e); } return data; } } output is following. Before adding #relation stackoverflowQuestion #attribute uid numeric #attribute itemid numeric #attribute rating numeric #attribute timestamp numeric #data 196,242,3,881250949 186,302,3,891717742 22,377,1,878887116 196,51,5,881250949 244,51,2,880606923 --------------------------------------------------------------------------------- After adding #relation stackoverflowQuestion #attribute uid numeric #attribute itemid numeric #attribute rating numeric #attribute timestamp numeric #data 196,242,3,881250949 186,302,3,891717742 22,377,1,878887116 196,51,5,881250949 244,51,2,880606923 244,59,2,880606923
you can simply append the new line to your arff file like: String filename= "MyDataset.arff"; FileWriter fwriter = new FileWriter(filename,true); //true will append the new instance fwiter.write("244 59 2 880606923\n");//appends the string to the file fwriter.close();
New instances can be easily added to any existing dataset as follows: //assuming we already have arff loaded in a variable called dataset Instance newInstance = new Instance(); for(int i = 0 ; i < dataset.numAttributes() ; i++) { newInstance.setValue(i , value); //i is the index of attribute //value is the value that you want to set } //add the new instance to the main dataset at the last position dataset.add(newInstance); //repeat as necessary