In Weka (using Java), I would like to successsively fit classifiers to different subsets of attributes of the same dataset.
Is there a way to build the Instances object only once, and then remove the non-selected features but only temporarily, so they can be efficiently restored and used later in case the attribute is needed later to build another classifier, without having to create every time a totally new Instances object from scratch?
I am aware of method deleteAttributeAt() which says that
A deep copy of the attribute information is performed before the
attribute is deleted
and also of class Remove but I'm not sure this is what I need.
Create new Instances objects at each stage and use the appropriately.
For example, below example is using instances object without class and normalized to build a cluster.
Use rawData to get original instances. Hope this helps.
final SimpleKMeans kmeans = new SimpleKMeans();
final String[] options = weka.core.Utils
.splitOptions("-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 10 -A \"weka.core.EuclideanDistance -R first-last\" -I 500 -num-slots 1 -S 50");
kmeans.setOptions(options);
kmeans.setSeed(1000);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(5);
kmeans.setMaxIterations(1000);
final BufferedReader datafile = readDataFile("/Users/data.arff");
final Instances rawData = new Instances(datafile);
rawData.setClassIndex(classIndex);
//remove class column[0] from cluster
final Remove removeFilter = new Remove();
removeFilter.setAttributeIndices("" + (rawData.classIndex() + 1));
removeFilter.setInputFormat(rawData);
final Instances dataNoClass = Filter.useFilter(rawData, removeFilter);
//normalize
final Normalize normalizeFilter = new Normalize();
normalizeFilter.setIgnoreClass(true);
normalizeFilter.setInputFormat(dataNoClass);
final Instances data = Filter.useFilter(dataNoClass, normalizeFilter);
kmeans.buildClusterer(data);
Related
I am counting the number of objects in AWS S3 bucket using Scala like this:
val reqAws:ListObjectsV2Request = new ListObjectsV2Request().withBucketName(awsBucketName).withPrefix(prefixForAws);
var resultAws:ListObjectsV2Result = null;
var totalFilesInAws:Int = 0;
do {
resultAws = awsS3Client.listObjectsV2(reqAws);
val summariesForAws:java.util.List[S3ObjectSummary] = resultAws.getObjectSummaries()
totalFilesInAws = totalFilesInAws + summariesForAws.size().toInt
val token:String = resultAws.getNextContinuationToken();
reqAws.setContinuationToken(token);
} while (resultAws.isTruncated());
However it gives me count for those prefixes as well that do not contain any objects.
For example, if my prefix is a/b/c and my S3 has following structure:
bucketName/a/b/c/d/obj1
bucketName/a/b/c/e/obj2
bucketName/a/b/c/f/
Now here we can see a/b/c/f does not have object but a/b/c/d and a/b/c/e does have objects, so the count should be 2 but my code gives count as 3.
How do I modify my code to get the correct count?
Amazon S3 does not actually have folders/directories.
For example, you could run this command:
aws s3 cp foo.txt s3://my-bucket/a/b/c/foo.txt
This works even though the path a/b/c does not exist.
Then, if that object is then deleted, the path disappears.
This is because the filename ('Key') of each object is the full path. Amazon S3 makes it 'look like' there are directories, but there really are none.
So, what happens when you create a folder? The answer is that the system creates a zero-length object with the same name as the path.
In your case, there is a zero-length object called /a/b/c/f/. This makes the directory appear (even though there is no such thing as a directory).
While a/b/c/f/ might not contain an object, there is an object called a/b/c/f/.
How to solve this? Here's some options:
Do not create directories. Let them automatically 'appear' through the creation of objects in a given path. This way, there will be no zero-length files of the name of the directory.
Change your code to ignore zero-length objects.
I did the following code changes and now I get the correct count
val reqAws:ListObjectsV2Request = new ListObjectsV2Request().withBucketName(awsBucketName).withPrefix(prefixForAws);
var resultAws:ListObjectsV2Result = null;
var totalFilesInAws:Int = 0;
do {
resultAws = awsS3Client.listObjectsV2(reqAws);
val summariesForAws:java.util.List[S3ObjectSummary] = resultAws.getObjectSummaries()
for(k <- summariesForAws.asScala) {
if(!(k.getKey.toString().endsWith("/"))) {
totalFilesInAws+= 1;
}
}
val token:String = resultAws.getNextContinuationToken();
reqAws.setContinuationToken(token);
} while (resultAws.isTruncated());
My original tree was much bigger, but since I'm stuck with this issue for quite some time I decided to try to simplify my tree. I Ended up with something like this:
As you can see, I only have a single attribute called "LarguraBandaRede" with 3 possible nominal values "Congestionado", "Livre" and "Merda".
After that I exported the j48.model from weka to use on my java code.
With this piece of code I import the model to use as a classifier:
ObjectInputStream objectInputStream = new ObjectInputStream(in);
classifier = (J48) objectInputStream.readObject();
After that I started to create a arraylist of my attributes and a Instances File
for (int i = 0; i <features.length; i++) {
String feature = features[i];
Attribute attribute;
if (feature.equals("TamanhoDados(Kb)")) {
attribute = new Attribute(feature);
} else {
String[] strings = null;
if(i==0) strings = populateAttributes(7);
if(i==1) strings = populateAttributes(10);
ArrayList<String> attValues = new ArrayList<String>(Arrays.asList(strings));
attribute = new Attribute(feature,attValues);
}
atts.add(attribute);
}
where populateAttributes gives the possible values for each attribute, in this case "Livre, Merda, Congestionado;" for LarguraBandaRede and "Sim,Nao" for Resultado, my class attribute.
Instances instances = new Instances("header",atts,atts.size());
instances.setClassIndex(instances.numAttributes()-1);
After creating my instances is time to create my Instance File, that is, the instances that I'm trying to classify
Instance instanceLivre = new DenseInstance(features.length);
Instance instanceMediano = new DenseInstance(features.length);
Instance instanceCongestionado = new DenseInstance(features.length);
instanceLivre.setDataset(instances);
instanceMediano.setDataset(instances);
instanceCongestionado.setDataset(instances);
then I set each of this instances with the 3 possible values for "LarguraBandaRede". 'instanceLivre' with "Livre", 'instanceMediano' with "Merda" and 'instanceCongestionado' with "Congestionado".
After that I only classify this 3 instances using the classifyInstance method
System.out.println(instance.toString());
double resp = classifier.classifyInstance(instance);
System.out.println("valor: "+resp);
and this is my result:
As you can see, the instance that has Merda as "LarguraBandaRede" was classify to be the same class as Congestionado, the class 'Nao'. But that doesn't make any sense, since the tree above clearly show that when "LarguraBandaRede" is "Merda" or "Livre" the class should be the same.
So that's my question. How this happened and how to fix it?
Thanks in advance.
EDIT
I didn't know that this:
made any difference in the way the model works. But we have to follow this order when feeding a nominal attribute with possible values.
Have you checked if the weka nominal attribute index is equal in order to your populateAttributes method?
I am new to Lucene, so apologies for any unclear wording. I am working on an author search engine. The search query is the author name. The default search results are good - they return the names that match the most. However, we want to rank the results by author popularity as well, a blend of both the default similarity and a numeric value representing the circulations their titles have. The problem with the default results is it returns authors nobody is interested in, and while I can rank by circulation alone, the top result is generally not a great match in terms of name. I have been looking for days for a solution for this.
This is how I am building my index:
IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(INDEX_LOCATION)),
new IndexWriterConfig(new StandardAnalyzer()));
writer.deleteAll();
for (Contributor contributor : contributors) {
Document doc = new Document();
doc.add(new TextField("name", contributor.getName(), Field.Store.YES));
doc.add(new StoredField("contribId", contributor.getContribId()));
doc.add(new NumericDocValuesField("sum", sum));
writer.addDocument(doc);
}
writer.close();
The name is the field we want to search on, and the sum is the field we want to weight our search results with (but still taking into account the best match for the author name). I'm not sure if adding the sum to the document is the correct thing to do in this situation. I know that there will need to be some experimentation to figure out how to best blend the weighting of the two factors, but my problem is I don't know how to do it in the first place.
Any examples I've been able to find are either pre-Lucene 4 or don't seem to work. I thought this was what I was looking for, but it doesn't seem to work. Help appreciated!
As demonstrated in the blog post you linked, you could use a CustomScoreQuery; this would give you a lot of flexibility and influence over the scoring process, but it is also a bit overkill. Another possibility is to use a FunctionScoreQuery; since they behave differently, I will explain both.
Using a FunctionScoreQuery
A FunctionScoreQuery can modify a score based on a field.
Let's say you create you are usually performing a search like this:
Query q = .... // pass the user input to the QueryParser or similar
TopDocs hits = searcher.search(query, 10); // Get 10 results
Then you can modify the query in between like this:
Query q = .....
// Note that a Float field would work better.
DoubleValuesSource boostByField = DoubleValuesSource.fromLongField("sum");
// Create a query, based on the old query and the boost
FunctionScoreQuery modifiedQuery = new FunctionScoreQuery(q, boostByField);
// Search as usual
TopDocs hits = searcher.search(query, 10);
This will modify the query based on the value of field. Sadly, however, there isn't a possibility to control the influence of the DoubleValuesSource (besides by scaling the values during indexing) - at least none that I know of.
To have more control, consider using the CustomScoreQuery.
Using a CustomScoreQuery
Using this kind of query will allow you to modify a score of each result any way you like. In this context we will use it to alter the score based on a field in the index. First, you will have to store your value during indexing:
doc.add(new StoredField("sum", sum));
Then we will have to create our very own query class:
private static class MyScoreQuery extends CustomScoreQuery {
public MyScoreQuery(Query subQuery) {
super(subQuery);
}
// The CustomScoreProvider is what actually alters the score
private class MyScoreProvider extends CustomScoreProvider {
private LeafReader reader;
private Set<String> fieldsToLoad;
public MyScoreProvider(LeafReaderContext context) {
super(context);
reader = context.reader();
// We create a HashSet which contains the name of the field
// which we need. This allows us to retrieve the document
// with only this field loaded, which is a lot faster.
fieldsToLoad = new HashSet<>();
fieldsToLoad.add("sum");
}
#Override
public float customScore(int doc_id, float currentScore, float valSrcScore) throws IOException {
// Get the result document from the index
Document doc = reader.document(doc_id, fieldsToLoad);
// Get boost value from index
IndexableField field = doc.getField("sum");
Number number = field.numericValue();
// This is just an example on how to alter the current score
// based on the value of "sum". You will have to experiment
// here.
float influence = 0.01f;
float boost = number.floatValue() * influence;
// Return the new score for this result, based on the
// original lucene score.
return currentScore + boost;
}
}
// Make sure that our CustomScoreProvider is being used.
#Override
public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
return new MyScoreProvider(context);
}
}
Now you can use your new Query class to modify an existing query, similar to the FunctionScoreQuery:
Query q = .....
// Create a query, based on the old query and the boost
MyScoreQuery modifiedQuery = new MyScoreQuery(q);
// Search as usual
TopDocs hits = searcher.search(query, 10);
Final remarks
Using a CustomScoreQuery, you can influence the scoring process in all kinds of ways. Remember however that the method customScore is called for each search result - so don't perform any expensive computations there, as this would severely slow down the search process.
I've creating a small gist of a full working example of the CustomScoreQuery here: https://gist.github.com/philippludwig/14e0d9b527a6522511ae79823adef73a
I would like to get the classification rate for a tree constructed using J48.
DataSource source = new DataSource(Path);
Instances data = source.getDataSet();
J48 tree = tree.buildClassifier(data);
I know it has something to do with
public double getMeasure(java.lang.String additionalMeasureName)
But i can't find the correct String (additionalMeasureName) to use.
I just found an answer to my question using Evaluation class. The code should be:
//Learning
DataSource source = new DataSource(Path);
Instances data = source.getDataSet();
J48 tree = new J48();
tree.buildClassifier(data);
//Evaluation
Evaluation eval = new Evaluation(data);
eval.evaluateModel(tree, data);
System.out.println((eval.correct()/data.numInstances())*100);
This tests the decision tree using the learning data an displays the percentage of correctly classified instances.
I want to update a StoredMap value and I don't care about the old value. I cannot find a way to avoid the previous value from being loaded.
StoredMap<Integer, SyntaxDocument> tsCol = new StoredMap<Integer, SyntaxDocument>(tsdb, new IntegerBinding(), new PassageBinding(), true);
tsCol.put(1, doc); // insert value => ok
tsCol.put(1, doc); // <- load previous value but I don't care. I want to avoid the "heavy" PassageBinding process.
tsCol.putAll(Collections.singletonMap(1, doc)); // Even this one load the old value
Is there a way to optimize my code and update an existing value without loading it (or at least to prevent the binding to process old DatabaseEntry bytes)?
NOTE: that calling remove then put is slower.
Solution is to use lowlevel Database API :
Database tsdb = environment.openDatabase(null, "tsdb", dbconfig);
PassageBinding binding = new PassageBinding();
DatabaseEntry idDbEntry = new DatabaseEntry();
IntegerBinding.intToEntry(id, idDbEntry);
DatabaseEntry dbEntry = new DatabaseEntry();
pb.objectToEntry(data, dbEntry);
tsdb.put(null, idDbEntry, dbEntry); // <-- replace existing value without loading it.