WEKA: Classify instances with a deserialized model - java

I used Weka Explorer:
Loaded the arff file
Applied StringToWordVector filter
Selected IBk as the best classifier
Generated/Saved my_model.model binary
In my Java code I deserialize the model:
URL curl = ClassUtility.findClasspathResource( "models/my_model.model" );
final Classifier cls = (Classifier) weka.core.SerializationHelper.read( curl.openConnection().getInputStream() );
Now, I have the classifier BUT I need somehow the information on the filter. Where I am getting is: how do I prepare an instance to be classified by my deserialized model (how do I apply the filter before classification) - (The raw instance that I have to classify has a field text with tokens in it. The filter was supposed to transform that into a list of new attributes)
I even tried to use a FilteredClassifier where I set the classifier to the deserialized on and the filter to a manually created instance of StringToWordVector
final StringToWordVector filter = new StringToWordVector();
filter.setOptions(new String[]{"-C", "-P x_", "-L"});
FilteredClassifier fcls = new FilteredClassifier();
fcls.setFilter(filter);
fcls.setClassifier(cls);
The above does not work either. It throws the exception:
Exception in thread "main" java.lang.NullPointerException: No output instance format defined
What I am trying to avoid is doing the training in the Java code. It can be very slow and the prospect is that I might have multiple classifiers to train (different algorithms as well) and I want my app to start fast.

Your problem is that your model doesn't know anything about what the filter did to the data. The StringToWordVector filter changes the data, but depending on the input (training) data. A model trained on this transformed data set will only work on data that underwent the exact same transformation. To guarantee this, the filter needs to be part of your model.
Using a FilteredClassifier is the correct idea, but you have to use it from the beginning:
Load the ARFF file
Select FilteredClassifier as classifier
Select StringToWordVector as filter for it
Select IBk as classifier for the FilteredClassifier
Generate/Save the model to my_model.binary
The trained and serialized model will then also contain the intialized filter, including the information on how to transform data.

Another way to do this is to use the same filter to your testing data as the one used on training data. I describe the procedure analytically. In your case you just need to follow steps after the loading of your serialized classifier.
Create your training file (e.g training.arff)
Create Instances from training file. Instances trainingData = ..
Use StringToWordVector to transform your string attributes to number representation:
sample code:
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Select a classifier to create your model
sample code for LibLinear classifier
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
Save model
sample code
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create a testing file (e.g testing.arff)
Create Instances from training file: Instances testingData=...
Load classifier
sample code
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will keep the format of training set and will not add words that are not in training set.
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
sample code
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}

Related

Not able to apply trained model to classify testdata using Weka in Java

I am using Weka to do a text classification. I have created a NaiveBayes model using the Weka GUI, and I have saved that model and then was trying to use this model to classify instances of a training set. This is my code :
Classifier clsClassifier = (Classifier) weka.core.SerializationHelper.read("Source/test/80percentModel.model");
StringToWordVector filter = new StringToWordVector();
BufferedReader reader = new BufferedReader(
new FileReader("Source/test/clt.train.arff"));
Instances trainingData = new Instances(reader);
reader.close();
trainingData.setClassIndex(trainingData.numAttributes() - 1);
filter.setInputFormat(trainingData);
BufferedReader reader2 = new BufferedReader(
new FileReader("Source/test/clt.test.arff"));
Instances testingData = new Instances(reader2);
reader2.close();
testingData.setClassIndex(testingData.numAttributes() - 1);
testingData = Filter.useFilter(testingData, filter);
System.out.println(testingData.numInstances());
for (int j = 0; j < testingData.numInstances(); j++) {
double res = clsClassifier.classifyInstance(testingData.get(j));
System.out.println(testingData.classAttribute().value((int)res));
}
I am getting the following error :
java.lang.IllegalArgumentException: Src and Dest differ in # of attributes: 1 != 1781
at weka.core.RelationalLocator.copyRelationalValues(RelationalLocator.java:87)
at weka.filters.Filter.copyValues(Filter.java:405)
at weka.filters.Filter.push(Filter.java:326)
at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:655)
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:672)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:699)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:173)
at test.WekaClassification.main(WekaClassification.java:66)
I dont quite get what I am doing wrong here. Why is there a mismatch in the number of attributes ? and is this the correct way to apply a trained model in a testData set ?
There are some possibilities, but your Error Shows that the number of attributes are not equal in your trained and dataset. They must be completely in a same format, type and value. The test file also should have the values for the label attribute. Check to see if in Weka GUI you get the same Error or not? StringToWordVector may not Filter in an affordable manner. check its output by watching the contents.

Weka how to predict new unseen Instance using Java Code?

I wrote a WEKA java code to train 4 classifiers. I saved the classifiers models and want to use them to predict new unseen instances (think about it as someone who wants to test whether a tweet is positive or negative).
I used StringToWordsVector filter on the training data. And to avoid the "Src and Dest differ in # of attributes" error I used the following code to train the filter using the trained data before applying the filter on the new instance to try and predict whether a new instance is positive or negative. And I just can't get it right.
Classifier cls = (Classifier) weka.core.SerializationHelper.read("models/myModel.model"); //reading one of the trained classifiers
BufferedReader datafile = readDataFile("Tweets/tone1.ARFF"); //read training data
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
Filter filter = new StringToWordVector(50);//keep 50 words
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
// rebuild classifier
cls.buildClassifier(filteredData);
String testInstance= "Text that I want to use as an unseen instance and predict whether it's positive or negative";
System.out.println(">create test instance");
FastVector attributes = new FastVector(2);
attributes.addElement(new Attribute("text", (FastVector) null));
// Add class attribute.
FastVector classValues = new FastVector(2);
classValues.addElement("Negative");
classValues.addElement("Positive");
attributes.addElement(new Attribute("Tone", classValues));
// Create dataset with initial capacity of 100, and set index of class.
Instances tests = new Instances("test istance", attributes, 100);
tests.setClassIndex(tests.numAttributes() - 1);
Instance test = new Instance(2);
// Set value for message attribute
Attribute messageAtt = tests.attribute("text");
test.setValue(messageAtt, messageAtt.addStringValue(testInstance));
test.setDataset(tests);
Filter filter2 = new StringToWordVector(50);
filter2.setInputFormat(tests);
Instances filteredTests = Filter.useFilter(tests, filter2);
System.out.println(">train Test filter using training data");
Standardize sfilter = new Standardize(); //Match the number of attributes between src and dest.
sfilter.setInputFormat(filteredData); // initializing the filter with training set
filteredTests = Filter.useFilter(filteredData, sfilter); // create new test set
ArffSaver saver = new ArffSaver(); //save test data to ARFF file
saver.setInstances(filteredTests);
File unseenFile = new File ("Tweets/unseen.ARFF");
saver.setFile(unseenFile);
saver.writeBatch();
When I try to Standardize the Input data using the filtered training data I get a new ARFF file (unseen.ARFF) but with 2000 (same number of training data) instances where most of the values are negative. I don't understand why or how to remove those instances.
System.out.println(">Evaluation"); //without the following 2 lines I get ArrayIndexOutOfBoundException.
filteredData.setClassIndex(filteredData.numAttributes() - 1);
filteredTests.setClassIndex(filteredTests.numAttributes() - 1);
Evaluation eval = new Evaluation(filteredData);
eval.evaluateModel(cls, filteredTests);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
Printing the evaluation results I want to see for example a percentage of how positive or negative this instance is but instead I get the following. I also want to see 1 instance instead of 2000. Any help on how to do this will be great.
> Results
======
Correlation coefficient 0.0285
Mean absolute error 0.8765
Root mean squared error 1.2185
Relative absolute error 409.4123 %
Root relative squared error 121.8754 %
Total Number of Instances 2000
Thanks
use eval.predictions(). It is an java.util.ArrayList<Prediction>. Then you can use Prediction.weight() method to get how much positive or negative your test variable is....
cls.distributionForInstance(newInst) returns the probability distribution for an instance. Check the docs
I have reached a good solution and here I share my code with you. This trains a classifier using WEKA Java code then use it to predict new unseen instances. Some parts - like paths - are hardcoded but you can easily modify the method to take parameters.
/**
* This method performs classification of unseen instance.
* It starts by training a model using a selection of classifiers then classifiy new unlabled instances.
*/
public static void predict() throws Exception {
//start by providing the paths for your training and testing ARFF files make sure both files have the same structure and the exact classes in the header
//initialise classifier
Classifier classifier = null;
System.out.println("read training arff");
Instances train = new Instances(new BufferedReader(new FileReader("Train.arff")));
train.setClassIndex(0);//in my case the class was the first attribute thus zero otherwise it's the number of attributes -1
System.out.println("read testing arff");
Instances unlabeled = new Instances(new BufferedReader(new FileReader("Test.arff")));
unlabeled.setClassIndex(0);
// training using a collection of classifiers (NaiveBayes, SMO (AKA SVM), KNN and Decision trees.)
String[] algorithms = {"nb","smo","knn","j48"};
for(int w=0; w<algorithms.length;w++){
if(algorithms[w].equals("nb"))
classifier = new NaiveBayes();
if(algorithms[w].equals("smo"))
classifier = new SMO();
if(algorithms[w].equals("knn"))
classifier = new IBk();
if(algorithms[w].equals("j48"))
classifier = new J48();
System.out.println("==========================================================================");
System.out.println("training using " + algorithms[w] + " classifier");
Evaluation eval = new Evaluation(train);
//perform 10 fold cross validation
eval.crossValidateModel(classifier, train, 10, new Random(1));
String output = eval.toSummaryString();
System.out.println(output);
String classDetails = eval.toClassDetailsString();
System.out.println(classDetails);
classifier.buildClassifier(train);
}
Instances labeled = new Instances(unlabeled);
// label instances (use the trained classifier to classify new unseen instances)
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));
}
//save the model for future use
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("myModel.dat"));
out.writeObject(classifier);
out.close();
System.out.println("===== Saved model =====");
}

java initialize object from file

I am currently writing a program that deals with graphs created by the jgrapht library. I have multiple graphs of the form:
UndirectedGraph <Integer, DefaultEdge> g_x = new SimpleGraph<Integer, DefaultEdge (DefaultEdge.class);
g.addVertex(1);
g.addVertex(2);
g.addVertex(3);
g.addEdge(1, 2);
g.addEdge(2, 4);
...
which are constant graphs associated with street maps that I am given as files. Right now I have all of my graphs declared in my main method and just reference the graph I want when a map is loaded. What I would like to do is have another file paired with each map (i.e map1.map and map1.graph) so that when I load the map from a file I can also load the graph like:
map = loadMap(mapName);
g_x = loadGraph(mapName);
where mapName is the file name prefix and not have to store it in my source code. Is it possible to do this in java and if so how would I create the files and load them? Would it also be possible to do this with a generic Object?
One option is to serialize your objects to xml or json (you could change the .xml to .map if you really wanted). Then you can open the xml in your code for each object you wish to load.
Serializing:
File file = new File(**filename**);
FileOutputStream out = new FileOutputStream(file);
XStream xmlStream = new XStream(new DomDriver());
out.write(xmlStream.toXML(**ObjectToSave**).getBytes());
out.close();
Deserializing:
try {
XStream xmlStream = new XStream(new DomDriver());
state = (**ClassNameYouWishToSave**) xmlStream.fromXML(new FileInputStream(**filename**));
} catch(IOException e) { e.printStackTrace(); }
You will need these imports:
import com.thoughtworks.xstream.XStream;
import com.thoughtworks.xstream.io.xml.DomDriver;
It is a simplistic way to do it, but it works. Hope it helps.

How to write union when creating Avro file in Java

I'm trying to create Avro file in Java (just testing code at the moment). Everything works fine, the code looks about like this:
GenericRecord record = new GenericData.Record(schema);
File file = new File("test.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(record);
dataFileWriter.close();
The problem I'm facing now is - what kind of Java object do I instantiate when I want to write Union? Not necessarily on the top level, possibly attach the union to a record being written. There are a few objects for complex types prepared, like GenericData.Record, GenericData.Array etc. For those that are not prepared, usually the right object is simply a standard Java object (java.util.Map implementing classes for "map" Avro type etc.).
But I cannot figure out what is the right object to instantiate for writing a Union.
This question refers to writing Avro file WITHOUT code generation. Any help is very much appreciated.
Here's what I did:
Suppose the schema is defined like this:
record MyStructure {
...
record MySubtype {
int p1;
}
union {null, MySubtype} myField = null;
...
}
And this is the Java code:
Schema schema; // the schema of the main structure
// ....
GenericRecord rec = new GenericData.Record(schema);
int i = schema.getField("myField").schema().getIndexNamed("MySubtype");
GenericRecord myField = new GenericData.Record(schema.getField("myField").schema().getTypes().get(i));
myField.put("p1", 100);
rec.put("myField", myField);

Convert embedded pictures in database

I have a 'small' problem. In a database documents contain a richtextfield. The richtextfield contains a profile picture of a certain contact. The problem is that this content is not saved as mime and therefore I can not calculate the url of the image.
I'm using a pojo to retrieve data from the person profile and use this in my xpage control to display its contents. I need to build a convert agent which takes the content of the richtextitem and converts it to mime to be able to calculate the url something like
http://host/database.nsf/($users)/D40FE4181F2B86CCC12579AB0047BD22/Photo/M2?OpenElement
Could someone help me with converting the contents of the richtextitem to mime? When I check for embedded objects in the rt field there are none. When I get the content of the field as stream and save it to a new richtext field using the following code. But the new field is not created somehow.
System.out.println("check if document contains a field with name "+fieldName);
if(!doc.hasItem(fieldName)){
throw new PictureConvertException("Could not locate richtextitem with name"+fieldName);
}
RichTextItem pictureField = (RichTextItem) doc.getFirstItem(fieldName);
System.out.println("Its a richtextfield..");
System.out.println("Copy field to backup field");
if(doc.hasItem("old_"+fieldName)){
doc.removeItem("old_"+fieldName);
}
pictureField.copyItemToDocument(doc, "old_"+fieldName);
// Vector embeddedPictures = pictureField.getEmbeddedObjects();
// System.out.println(doc.hasEmbedded());
// System.out.println("Retrieved embedded objects");
// if(embeddedPictures.isEmpty()){
// throw new PictureConvertException("No embedded objects could be found.");
// }
//
// EmbeddedObject photo = (EmbeddedObject) embeddedPictures.get(0);
System.out.println("Create inputstream");
//s.setConvertMime(false);
InputStream iStream = pictureField.getInputStream();
System.out.println("Create notesstream");
Stream nStream = s.createStream();
nStream.setContents(iStream);
System.out.println("Create mime entity");
MIMEEntity mEntity = doc.createMIMEEntity("PictureTest");
MIMEHeader cdheader = mEntity.createHeader("Content-Disposition");
System.out.println("Set header withfilename picture.gif");
cdheader.setHeaderVal("attachment;filename=picture.gif");
System.out.println("Setcontent type header");
MIMEHeader cidheader = mEntity.createHeader("Content-ID");
cidheader.setHeaderVal("picture.gif");
System.out.println("Set content from stream");
mEntity.setContentFromBytes(nStream, "application/gif", mEntity.ENC_IDENTITY_BINARY);
System.out.println("Save document..");
doc.save();
//s.setConvertMime(true);
System.out.println("Done");
// Clean up if we are done..
//doc.removeItem(fieldName);
Its been a little while now and I didn't go down the route of converting existing data to mime. I could not get it to work and after some more research it seemed to be unnecessary. Because the issue is about displaying images bound to a richtextbox I did some research on how to compute the url for an image and I came up with the following lines of code:
function getImageURL(doc:NotesDocument, strRTItem,strFileType){
if(doc!=null && !"".equals(strRTItem)){
var rtItem = doc.getFirstItem(strRTItem);
if(rtItem!=null){
var personelDB = doc.getParentDatabase();
var dbURL = getDBUrl(personelDB);
var imageURL:java.lang.StringBuffer = new java.lang.StringBuffer(dbURL);
if("file".equals(strFileType)){
var embeddedObjects:java.util.Vector = rtItem.getEmbeddedObjects();
if(!embeddedObjects.isEmpty()){
var file:NotesEmbeddedObject = embeddedObjects.get(0);
imageURL.append("(lookupView)\\");
imageURL.append(doc.getUniversalID());
imageURL.append("\\$File\\");
imageURL.append(file.getName());
imageURL.append("?Open");
}
}else{
imageURL.append(doc.getUniversalID());
imageURL.append("/"+strRTItem+"/");
if(rtItem instanceof lotus.domino.local.RichTextItem){
imageURL.append("0.C4?OpenElement");
}else{
imageURL.append("M2?OpenElement");
}
}
return imageURL.toString()
}
}
}
It will check if a given RT field is present. If this is the case it assumes a few things:
If there are files in the rtfield the first file is the picture to display
else it will create a specified url if the item is of type Rt otherwhise it will assume it is a mime entity and will generate another url.
Not sure if this is an answer but I can't seem to add comments yet. Have you verified that there is something in your stream?
if (stream.getBytes() != 0) {
The issue cannot be resolved "ideally" in Java.
1) if you convert to MIME, you screw up the original Notes rich text. MIME allows only for sad approximation of original content; this might or might not matter.
If it matters, it's possible to convert a copy of the original field to MIME used only for display purposes, or scrape it out using DXL and storing separately - however this approach again means an issue of synchronization every time somebody changes the image in the original RT item.
2) computing URL as per OP code in the accepted self-answer is not possible in general as the constant 0.C4 in this example relates to the offset of the image in binary data of the RT item. Meaning any other design of rich text field, manually entered images, created by different version of Notes - all influence the offset.
3) the url can be computed correctly only by using C API that allows to investigate binary data in rich text item. This cannot be done from Java. IMO (without building JNI bridges etc)

Categories

Resources