How to use StringToWordVector (weka) in java? - java

This is my arff file
#relation hamspam
#attribute text string
#attribute class {ham,spam}
#data
'good',ham
'very good',ham
'bad',spam
'very bad',spam
'very bad, very bad',spam
What i want to do is to classify it with weka clasiffier in my java program, but i don't know how to use StringToWordVector and then classify it.
this my code:
Classifier j48tree = new J48();
Instances train = new Instances(new BufferedReader(new FileReader("data.arff")));
StringToWordVector filter = new StringToWordVector();
What next?, i don't know what to do..

import weka.core.Instance;
//import required classes
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.stemmers.LovinsStemmer;
import weka.classifiers.meta.FilteredClassifier;
import weka.classifiers.trees.J48;
import weka.filters.unsupervised.attribute.Remove;
import weka.filters.unsupervised.attribute.StringToWordVector;
public class ClassifierWithFilter{
public static void main(String args[]) throws Exception{
//load dataset
DataSource source = new DataSource("/Users/amaryadav/Desktop/spamham.arff");
Instances dataset = source.getDataSet();
//set class index to the last attribute
dataset.setClassIndex(dataset.numAttributes()-1);
//the base classifier
J48 tree = new J48();
//the filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataset);
filter.setIDFTransform(true);
filter.setUseStoplist(true);
LovinsStemmer stemmer = new LovinsStemmer();
filter.setStemmer(stemmer);
filter.setLowerCaseTokens(true);
//Create the FilteredClassifier object
FilteredClassifier fc = new FilteredClassifier();
//specify filter
fc.setFilter(filter);
//specify base classifier
fc.setClassifier(tree);
//Build the meta-classifier
fc.buildClassifier(dataset);
System.out.println(tree.graph());
System.out.println(tree);
}
}
This code uses J48 decision tree to build a classifier trained with spamham.arff. Hope that helps.

Related

How to create an attribute in Weka

I working on a data mining project using WEKA in Java and the instructions says that I have to create an Attribute object for each attribute in the dataset and add them to a FastVector. I try to look at the API but I don't think I'm doing it right can someone show me the right way to do it. I'm using the iris.arff file
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import weka.core.Attribute;
import weka.core.FastVector;
import weka.core.Instances;
import weka.core.converters.ArffSaver;
public class StartWeka {
public static void main(String[]args)throws Exception{
Instances dataset = new Instances(new BufferedReader(new FileReader("C:/Users/Student/workspace/Data Mining/src/iris.arff.txt")));
Instances train = new Instances(dataset);
train.setClassIndex(train.numAttributes()-1);
System.out.println(dataset.toSummaryString());
Attribute a1 = new Attribute("sepallength", 0);
Attribute a2 = new Attribute("sepalwidth", 1);
Attribute a3 = new Attribute("petalwidth", 2);
FastVector attrs = new FastVector();
attrs.addElement(a1);
}
}
FastVector is deprecated. You can use an ArrayList instead.
If you use an arff file, however, you don't have to do any of that. You can just do the following:
ArffLoader loader = new ArffLoader();
loader.setFile(new File("iris.arff");
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
From here, you can create a classifier based on your instances. (structure).

Jena Fuseki API add new data to an exsisting dataset [java]

i was trying to upload an RDF/OWL file to my Sparql endpoint (given by Fuseki). Right now i'm able to upload a single file, but if i try to repeat the action, the new dataset will override the old one. I'm searching a way to "merge" the content of the data in the dataset with the new ones of the rdf file just uploaded. Anyone can help me? thanks.
Following the code to upload/query the endpoint (i'm not the author)
// Written in 2015 by Thilo Planz
// To the extent possible under law, I have dedicated all copyright and related and neighboring rights
// to this software to the public domain worldwide. This software is distributed without any warranty.
// http://creativecommons.org/publicdomain/zero/1.0/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.ByteArrayOutputStream;
import org.apache.jena.query.DatasetAccessor;
import org.apache.jena.query.DatasetAccessorFactory;
import org.apache.jena.query.QueryExecution;
import org.apache.jena.query.QueryExecutionFactory;
import org.apache.jena.query.QuerySolution;
import org.apache.jena.query.ResultSet;
import org.apache.jena.query.ResultSetFormatter;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.RDFNode;
class FusekiExample {
public static void uploadRDF(File rdf, String serviceURI)
throws IOException {
// parse the file
Model m = ModelFactory.createDefaultModel();
try (FileInputStream in = new FileInputStream(rdf)) {
m.read(in, null, "RDF/XML");
}
// upload the resulting model
DatasetAccessor accessor = DatasetAccessorFactory.createHTTP(serviceURI);
accessor.putModel(m);
}
public static void execSelectAndPrint(String serviceURI, String query) {
QueryExecution q = QueryExecutionFactory.sparqlService(serviceURI,
query);
ResultSet results = q.execSelect();
// write to a ByteArrayOutputStream
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
//convert to JSON format
ResultSetFormatter.outputAsJSON(outputStream, results);
//turn json to string
String json = new String(outputStream.toByteArray());
//print json string
System.out.println(json);
}
public static void execSelectAndProcess(String serviceURI, String query) {
QueryExecution q = QueryExecutionFactory.sparqlService(serviceURI,
query);
ResultSet results = q.execSelect();
while (results.hasNext()) {
QuerySolution soln = results.nextSolution();
// assumes that you have an "?x" in your query
RDFNode x = soln.get("x");
System.out.println(x);
}
}
public static void main(String argv[]) throws IOException {
// uploadRDF(new File("test.rdf"), );
uploadRDF(new File("test.rdf"), "http://localhost:3030/MyEndpoint/data");
}
}
Use accessor.add(m) instead of putModel(m). As you can see in the Javadoc, putModel replaces the existing data.

Classify tweets in Java using Weka

I have some tweets to do sentiment analysis. Thus, i fetched tweets by using Twitter4J then i decided to use Weka libraries for using methods like KMeans,Naive Bayes, SVM etc.
Firstly, i moved tweets into a text file by hand, and wrote their classes myself. This is my training data. In my code i read this file and tried to train and test my model. But i got the error
"Exception in thread "main" weka.core.UnsupportedAttributeTypeException: Cannot handle string attributes!"
To fix it i used StringtoWordVector filter but it didn't work either. Here is my code:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.lazy.IBk;
import weka.classifiers.meta.FilteredClassifier;
import weka.core.Attribute;
import weka.core.FastVector;
import weka.core.Instance;
import weka.core.Instances;
import weka.filters.unsupervised.attribute.StringToWordVector;
public class Driver {
public static BufferedReader readDataFile(String filename) {
BufferedReader inputReader = null;
try {
inputReader = new BufferedReader(new FileReader(filename));
} catch (FileNotFoundException ex) {
System.err.println("File not found: " + filename);
}
return inputReader;
}
public static void main(String[] args) throws Exception{
BufferedReader datafile = readDataFile("file.txt");
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
FilteredClassifier fc = new FilteredClassifier();
/
Classifier cModel = (Classifier)new IBk();
cModel.buildClassifier(data);
StringToWordVector swv = new StringToWordVector();
fc.setFilter(swv);
fc.setClassifier(cModel);
// Test the model
Evaluation eTest = new Evaluation(data);
eTest.evaluateModel(cModel, data);
// Print the result à la Weka explorer:
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);
// Get the confusion matrix
double[][] cmMatrix = eTest.confusionMatrix();
for(int row_i=0; row_i<cmMatrix.length; row_i++){
for(int col_i=0; col_i<cmMatrix.length; col_i++){
System.out.print(cmMatrix[row_i][col_i]);
System.out.print("|");
}
System.out.println();
}
}
}
I also want to show my file.txt:
#relation twitter
#attribute tweetMsg string
#attribute class{positive,negative,neutral}
#data
"bugün hava çok güzel",positive
"hiç iyi hissetmiyorum",negative
"hayat çok normal",neutral
"Diriliş Ertuğrul izlerken her türlü kumpasın döndüğünü görmek ama günün birinde Osmanlı Beyliği' nin kurulacağını bilmenin huzuru ?",positive
"Diriliş Ertuğrul dizisi ile tarihe merakim arttı ??",positive
"Kanka moralim bozuk diyorum boşver kanka gel diriliş ertuğrul izleyelim diyor yemin ederim kanka gibi kanka .",positive
"Diriliş Ertuğrul beni son zamanlarda futbol dışında TVde tutan tek yapım kurgusu, görseli süper",positive
"#kösemsultan Osmanlının gerçek yüzünü çıkardıkları için mi hoşunuza gitmiyor Diriliş Ertuğrul saçmalığın alası hadi onuda şikayet edin!!!",negative
"Benim için LeylaileMecnun neyse abim için Diriliş Ertuğrul da o.",neutral
"#MutlulukNeDiyeSorsalar diriliş Ertuğrul izlemek derim",positive
"beyler muhteşem yüz yıl kösemi izliyorum da diriliş ertuğrul bu diziye 10 takar. saray saray değil kadınlar hamamı sanki.",positive
"Diriliş Ertuğrul diziside ne boktan bir senaryo arkadaş. Herif 4 bölümde bir hain ilan edilip sonra obaya geri geliyor sonra yine hain :):)",negative
"Diriliş Ertuğrul izlemekten babama beyim dedim amk",neutral
"Diriliş ertuğrul haric bütün Türk dizileri saçmalik broo",positive
However, these tweets are in Turkish language. So, do you think i am going in right way? Or should i do something more complicated? Like firstly stemming the words etc.
Any help to my questions will be appreciated.
Read the error message:
Cannot handle string attributes!
obviously refers to this line:
#attribute tweetMsg string
The classifier IBk does not support string attributes.

use weka with java for prediction on test set

I am trying to get the predictions on test set using evaluateModel function, however evaluation.evaluateModel(classifier, newTest,output) throws an exception.
Exception in thread "main" weka.core.WekaException: No dataset
structure provided!
import weka.classifiers.Evaluation;
import weka.core.Attribute;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.Evaluation;
import weka.core.converters.ConverterUtils.DataSource;
import weka.attributeSelection.CfsSubsetEval;
import weka.attributeSelection.ASSearch;
import weka.attributeSelection.BestFirst;
import weka.classifiers.functions.LinearRegression;
import weka.classifiers.meta.AttributeSelectedClassifier;
import weka.filters.supervised.attribute.AttributeSelection;
import weka.classifiers.evaluation.output.prediction.CSV;
public void evaluateTest() throws Exception
{
DataSource train = new DataSource(trainingData.toString());
Instances traininstances = train.getDataSet();
Attribute attr=traininstances.attribute("regressionLabel");
int trainindex=attr.index();
traininstances.setClassIndex(trainindex);
DataSource test = new DataSource(testData.toString());
Instances testinstances = test.getDataSet();
Attribute testattr=testinstances.attribute(regressionLabel);
int testindex=testattr.index();
testinstances.setClassIndex(testindex);
AttributeSelection filter = new AttributeSelection();
weka.classifiers.AbstractClassifier classifier ;
filter.setSearch(this.search);
filter.setEvaluator(this.eval);
filter.setInputFormat(traininstances); // initializing the filter once with training set
Instances newTrain = AttributeSelection.useFilter( traininstances, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = AttributeSelection.useFilter(testinstances, filter);
classifier= new LinearRegression();
classifier.buildClassifier(newTrain);
StringBuffer buffer = new StringBuffer();
CSV output = new CSV();
output.setBuffer(buffer);
output.setOutputFile(predictFile);
Evaluation evaluation = new Evaluation(newTrain);
evaluation.evaluateModel(classifier, newTest,output);
}
The same thing works with evaluation.crossValidateModel.

How to specify the base classifier in stacking method when using Weka API?

I was trying to use stacking method weka api in java and found a tutorial for single classifier. I tried implementing stacking using the method described in the tutorial method but the classification is done with default Zero classifier in Weka.I was able to set meta classifier using "setMetaClassifier" but not able to change the base classifier.What is the proper method to set base classifier in stacking ?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Random;
import weka.classifiers.Evaluation;
import weka.classifiers.meta.Stacking;
import weka.core.Instances;
public class startweka {
public static void main(String[] args) throws Exception{
BufferedReader breader=new BufferedReader(new FileReader("C:/newtrain.arff"));
Instances train=new Instances(breader);
train.setClassIndex(train.numAttributes()-1);
breader.close();
String[] stackoptions = new String[1];
{
stackoptions[0] = "-w weka.classifiers.functions.SMO";
}
Stacking nb=new Stacking();
J48 j48=new J48();
SMO jj=new SMO();
nb.setMetaClassifier(j48);
nb.buildClassifier(train);
Evaluation eval=new Evaluation(train);
eval.crossValidateModel(nb, train, 10, new Random(1));
System.out.println(eval.toSummaryString("results",true));
}}
Ok i found the answer in other forum weka nabble.The code for setting base classifier is
Stacking nb=new Stacking();
SMO smo=new SMO();
Classifier[] stackoptions = new Classifier[1];
stackoptions[0] = smo;
nb.setClassifiers(stackoptions);
OR
Stacking nb=new Stacking();
SMO smo=new SMO();
Classifier[] stackoptions = new Classifier[] {smo};
nb.setClassifiers(stackoptions);

Categories

Resources