Following is the piece of code i'm trying to save my model. But i'm unable to find saveModel() API functionality to store the model.
// Create classification trainer.
DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(10, 0.1);
// Train decision tree model.
Model mdl = trainer.fit(
ignite,
dataCache,
featureExtractor,
labelExtractor
);
Exporter<DecisionTreeNode, String> exporter = new FileExporter<>();
**((DecisionTreeNode)mdl).saveModel(exporter, filePath);**
every classification algorithm(KNN, ANN, KMeans...) implements exportable modelFormat interface except decision tree, so in this case we can save it with ModelsComposition (which is true for decision tree scenario)
Exporter exporter = new FileExporter<>();
((ModelsComposition) mdl).saveModel(exporter, filePath);
Related
In Weka (using Java), I would like to successsively fit classifiers to different subsets of attributes of the same dataset.
Is there a way to build the Instances object only once, and then remove the non-selected features but only temporarily, so they can be efficiently restored and used later in case the attribute is needed later to build another classifier, without having to create every time a totally new Instances object from scratch?
I am aware of method deleteAttributeAt() which says that
A deep copy of the attribute information is performed before the
attribute is deleted
and also of class Remove but I'm not sure this is what I need.
Create new Instances objects at each stage and use the appropriately.
For example, below example is using instances object without class and normalized to build a cluster.
Use rawData to get original instances. Hope this helps.
final SimpleKMeans kmeans = new SimpleKMeans();
final String[] options = weka.core.Utils
.splitOptions("-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 10 -A \"weka.core.EuclideanDistance -R first-last\" -I 500 -num-slots 1 -S 50");
kmeans.setOptions(options);
kmeans.setSeed(1000);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(5);
kmeans.setMaxIterations(1000);
final BufferedReader datafile = readDataFile("/Users/data.arff");
final Instances rawData = new Instances(datafile);
rawData.setClassIndex(classIndex);
//remove class column[0] from cluster
final Remove removeFilter = new Remove();
removeFilter.setAttributeIndices("" + (rawData.classIndex() + 1));
removeFilter.setInputFormat(rawData);
final Instances dataNoClass = Filter.useFilter(rawData, removeFilter);
//normalize
final Normalize normalizeFilter = new Normalize();
normalizeFilter.setIgnoreClass(true);
normalizeFilter.setInputFormat(dataNoClass);
final Instances data = Filter.useFilter(dataNoClass, normalizeFilter);
kmeans.buildClusterer(data);
My original tree was much bigger, but since I'm stuck with this issue for quite some time I decided to try to simplify my tree. I Ended up with something like this:
As you can see, I only have a single attribute called "LarguraBandaRede" with 3 possible nominal values "Congestionado", "Livre" and "Merda".
After that I exported the j48.model from weka to use on my java code.
With this piece of code I import the model to use as a classifier:
ObjectInputStream objectInputStream = new ObjectInputStream(in);
classifier = (J48) objectInputStream.readObject();
After that I started to create a arraylist of my attributes and a Instances File
for (int i = 0; i <features.length; i++) {
String feature = features[i];
Attribute attribute;
if (feature.equals("TamanhoDados(Kb)")) {
attribute = new Attribute(feature);
} else {
String[] strings = null;
if(i==0) strings = populateAttributes(7);
if(i==1) strings = populateAttributes(10);
ArrayList<String> attValues = new ArrayList<String>(Arrays.asList(strings));
attribute = new Attribute(feature,attValues);
}
atts.add(attribute);
}
where populateAttributes gives the possible values for each attribute, in this case "Livre, Merda, Congestionado;" for LarguraBandaRede and "Sim,Nao" for Resultado, my class attribute.
Instances instances = new Instances("header",atts,atts.size());
instances.setClassIndex(instances.numAttributes()-1);
After creating my instances is time to create my Instance File, that is, the instances that I'm trying to classify
Instance instanceLivre = new DenseInstance(features.length);
Instance instanceMediano = new DenseInstance(features.length);
Instance instanceCongestionado = new DenseInstance(features.length);
instanceLivre.setDataset(instances);
instanceMediano.setDataset(instances);
instanceCongestionado.setDataset(instances);
then I set each of this instances with the 3 possible values for "LarguraBandaRede". 'instanceLivre' with "Livre", 'instanceMediano' with "Merda" and 'instanceCongestionado' with "Congestionado".
After that I only classify this 3 instances using the classifyInstance method
System.out.println(instance.toString());
double resp = classifier.classifyInstance(instance);
System.out.println("valor: "+resp);
and this is my result:
As you can see, the instance that has Merda as "LarguraBandaRede" was classify to be the same class as Congestionado, the class 'Nao'. But that doesn't make any sense, since the tree above clearly show that when "LarguraBandaRede" is "Merda" or "Livre" the class should be the same.
So that's my question. How this happened and how to fix it?
Thanks in advance.
EDIT
I didn't know that this:
made any difference in the way the model works. But we have to follow this order when feeding a nominal attribute with possible values.
Have you checked if the weka nominal attribute index is equal in order to your populateAttributes method?
I would like to get the classification rate for a tree constructed using J48.
DataSource source = new DataSource(Path);
Instances data = source.getDataSet();
J48 tree = tree.buildClassifier(data);
I know it has something to do with
public double getMeasure(java.lang.String additionalMeasureName)
But i can't find the correct String (additionalMeasureName) to use.
I just found an answer to my question using Evaluation class. The code should be:
//Learning
DataSource source = new DataSource(Path);
Instances data = source.getDataSet();
J48 tree = new J48();
tree.buildClassifier(data);
//Evaluation
Evaluation eval = new Evaluation(data);
eval.evaluateModel(tree, data);
System.out.println((eval.correct()/data.numInstances())*100);
This tests the decision tree using the learning data an displays the percentage of correctly classified instances.
Could someone let me know, if it is possible to open somehow Velocity Excel template and stream to it partially a bigger amount of data on the fly?
Let say I would like to read in a loop from external resource ArrayLists with data. One list with 10 000 items per each iteration. In simple iteration I would like to push the list to Velocity Excel template and forget about it by jumping to the next iteration. At the end of data processing I would do a final merge the Velocity context and template with all data.
So far, I've seen ways of generating Excel reports by Velocity engine in a few simple steps:
create Velocity template
create Velocity context
put data to context
merge context and template
but I need to repeate 3th step several times.
Here is my tested solution that works well in my case. I'm able to generate to Excel sheet a huge amout of data loaded directly from database.
Connection conn = getDs().getConnection();
// such a configuration of prepared statement is mandatory for large amount of data
PreparedStatement ps = conn.prepareStatement(MY_QUERY, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY, ResultSet.CLOSE_CURSORS_AT_COMMIT);
// load huge amount of data per 50000 items
ps.setFetchSize(50000);
ResultSet rs = ps.executeQuery();
// calculate count of returned data earlier
final CountWrapper countWrapper = new CountWrapper(countCalculatedEarlier);
CustomResultSetIterator<MyObject> rowsIter = new CustomResultSetIterator<MyObject>(rs, new org.apache.commons.dbutils.BasicRowProcessor(), MyObject.class) {
private Long iterCount = 0L;
#Override
public boolean hasNext() {
// you can't call isLast method on defined as forward only cursor of result set, hence there is a homegrown calculation whether last item is reached or not
return iterCount < countWrapper.getCount().longValue();
};
#Override
public MyObject next() {
iterCount++;
return super.next();
};
};
VelocityContext context = new VelocityContext();
// place an interator here instead of collection object
context.put("rows", rowsIter);
Template t = ve.getTemplate(template);
File file = new File(parthToFile);
FileWriter fw = new FileWriter(file);
// generate on the fly in my case 1 000 000 rows in Excel, watch out such an Excel may have 1GB size
t.merge(context, fw);
// The CustomResultSetIterator is a:
public class CustomResultSetIterator<E> implements Iterator<E> {
//...implement all interface's methods
}
The only option I see with default Velocity would be to implement something like a "streaming Collection" which never holds the complete data in memory, but rather provides the data in the iterator one by one.
You would then put this Collection into the Velocity context and use it in the Velocity Template to iterate over the items. Internally the Collection would retrieve the items in the hasNext()/next() call one after the other from your external source.
I have this problem where I need to design a Java package which is used for:
Getting data from different data sources. For example, Class A will retrieve customer data from an Oracle database, while Class B will retrieve the same information from a web service data source (via SOAP).
The results will need to be combined, the rule for combination is quite complex, so ideally I should hide this from the users (other developers) of this package.
When one data sources fails, I need to still return the result from other data sources. However, I also need to let the caller know one of the data sources failed to respond.
Right now I'm doing it by having a boolean value inside the Class A and Class B indicating whether there's an error, and another object for storing the actual error message. The caller will have to check this boolean value after making a call to see whether an error has occurred.
What is a good design model for this?
The answer would be very broad, so I would suggest you to use the:
The Data Access Object (DAO) design pattern to abstract the source of the data (database or webservice)
The strategy pattern to abstract the algorithm by which the data is merged (when both sources are available and one there is only one)
And finally the state design pattern to change the way your application works depending on which source is available.
All this wrapped (why not) in a nice facade.
This psuedo code has similar syntax as UML and Python:
// The data implements one interface
Data {interface}
// And you implement it with DatabaseData
DbData -> Data
...
// Or WebServiceData
WsData -> Data
...
// -- DAO part
Dao {interface}
+ fetch(): Data[]
// From database
DatabaseDao -> Dao
- data: Data[0..*]
// Query database and create dbData from rows...
+ fetch(): Data[]
self.status = "Not ok"
self.status = connectToDb()
if( self.status == ok ,
performQuery()
forEach( row in resultSet,
data.add( DbData.new( resultSet.next() ) )
)
disconnect()
)
...
// From web service
WebServiceDao -> Dao
- data: Data[0..*]
// Execute remote method and create wsData from some strange object
+ fetch(): Data[]
remoteObject: SoapObject = SoapObject()
remoteObject.connect()
if (remoteObject.connected?(),
differentData: StrangeObject = remoteObject.getRemoteData()
forEach( object in differentData ,
self.data.add( WsData.new( fromElement ))
)
).else(
self.status = "Disconnected"
)
....
// -- State part
// Abstract the way the data is going to be retrieved
// either from two sources or from a single one.
FetcheState { abstract }
- context: Service
- dao: Dao // Used for a single source
+ doFetch(): Data[] { abstract }
+ setContext( context: Service )
self.context = context
+ setSingleSource( dao: Dao)
self.dao = dao
// Fetches only from one DAO, and it doesn't quite merge anything
// because there is only one source after all.
OneSourceState -> FetcheState
// Use the single DAO and fetch
+ doFetch(): Data[]
data: Data[] = self.dao.doFetch()
// It doesn't hurt to call "context's" merger anyway.
context.merger.merge( data, null )
// Two sources, are more complex, fetches both DAOs, and validates error.
// If one source had an error, it changes the "state" of the application (context),
// so it can fetch from single source next time.
TwoSourcesState -> FetcheState
- db: Dao = DatabaseDao.new()
- ws: Dao = WebServiceDao.new()
+ doFetch(): Data[]
dbData: Data[] = db.doFetch()
wsData: Data[] = ws.doFetch()
if( ws.hadError() or db.hadError(),
// Changes the context's state
context.fetcher = OneSourceState.new()
context.merger = OneKindMergeStrategy.new()
context.fetcher.setContext( self.context )
// Find out which one was broken
if( ws.hadError(),
context.fetcher.setSingleSource( db )
)
if( db.hadError(),
context.fetcher.setSingleSource( ws )
)
)
// Since we have the data already let's
// merge it with the "context's" merger.
return context.merger.merge( dbData, wsData)
// -- Strategy part --
// Encapsulate algoritm to merge data
Strategy{ interface }
+ merge( a: Data[], with : Data[] )
// One kind doesn't merge too much, just "cast" one array
// because there is only one source after all.
OneKindMergeStrategy -> Strategy
+ merge( a: Data[], b: Data[] )
mergedData: Data[]
forEach( item, in( a ),
mergedData = Data.new( item ) // Take values from wsData or dbData
)
return mergedData
// Two kinds merge, encapsulate the complex algorithm to
// merge data from two sources.
TwoKindsMergeStrategy -> Strategy
+ merge( a: Data[], with: Data[] ): Data[]
forEach( item, in( a ),
mergedData: Data[]
forEach( other, in(with ),
WsData wsData = WsData.cast( item )
DbData dbData = DbData.cast( other )
// Add strange and complex logic here.
newItem = Data.new()
if( wsData.name == dbData.column.name and etc. etc ,
newItem.name = wsData+dbData...e tc. etc
...
mergedData.add( newItem )
)
)
)
return mergedData
// Finally, the service where the actual fetch is being performed.
Service { facade }
- merger: Strategy
- fetcher: FetcheState
// Initialise the object with the default "strategy" and the default "state".
+ init()
self.fetcher = TwoSourcesState()
self.merger = TwoKindsMergeStrategy()
fetcher.setContext( self )
// Nahh, just let the state do its work.
+ doFetch(): Data[]
// Fetch using the current application state
return fetcher.doFetch()
Client usage:
service: Service = Service.new()
service.init()
data: Data[] = service.doFetch()
Unfortunately, it looks a bit complex.
OOP is based a lot on polymorphism.
So in Dao, you let the subclass fetch data from whatever place and you just call it dao.fetch().
In Strategy the same, the subclass performs one algorithm or the other (to avoid having a lot of strange if's, else's, switch's, etc.).
With State the same thing happens. Instead of going like:
if isBroken and itDoesntWork() and if ImAlive()
etc., etc. you just say, "Hey, this will be the code one. There are two connections and this is when there is only one.".
Finally, facade say to the client "Don't worry, I'll handle this.".
Do you need to write a solution, or do you need a solution? There's plenty of free Java software that does these things - why re-invent the wheel. See:
Pentaho Data Integration (Kettle)
CloverETL
Jitterbit
I would suggest a Facade that would represent the object as a whole (the customer data) and a factory which creates that object by retrieving from each data source and passing those to the Facade (in the constructor or as a builder, depending on how many there are). The individual class with the specific data source would have a method (on a common interface or base class) to indicate if there was an error retrieving the data. The Facade (or a delegate) would be responsible for combining the data.
Then the Facade would have a method that would return a collection of some sort indicating which data sources the object represented, or which ones failed - depending on what the client needs to know.