I'm looking at the Mallet source codes, and it seems that most of the classifier implementations (e.g naive bayes) didn't really take into account the feature selections even though the InstanceList class has a setFeatureSelection method.
Now I want to conduct some quick experiments with my datasets with feature selection involved. I am thinking, from a technical shortcut standpoint, I might get the lowest ranking features and set those values to 0 in the instance vectors. Is this equivalent in machine learning to feature selection in classifier training whereby they are not considered at all (if smoothing e.g laplace estimation is not involved)?
thank you
Yes, setting the feature value to zero will have the same effect as removing it from the feature vector, since MALLET has no notion of "missing features," only zero and nonzero feature values.
Using the FeatureSelection class isn't too painful, though. MALLET comes with several built-in classes that apply a "mask" under the hood based on RankedFeatureVector sublcasses. For example, to use information gain feature selection, you should just be able to do this:
FeatureSelection fs = FeatureSelection(new InfoGain(ilist), numFeatures);
ilist.setFeatureSelection(fs);
You can also implement your own RankedFeatureVector subclass (the API is here) for something more customized. To manually select features some other way, you can still do so by creating a feature mask as a BitSet that contains all the feature ids (from the Alphabet) that you want to use, e.g.:
java.util.BitSet featureMask = /* some code to pick your features */;
FeatureSelection fs = FeatureSelection(ilist.getAlphabet(), featureMask);
ilist.setFeatureSelection(fs);
In general, I recommend using FeatureSelection objects instead of destructively changing the instance data.
Related
This seems most related to: How to get the probability per instance in classifications models in spark.mllib
I'm doing a classification task with spark ml, building a MultilayerPerceptronClassifier. Once I build a model, I can get a predicted class given an input vector, but I can't get the probability for each output class. The above listing indicates that NaiveBayesModel supports this functionality as of Spark 1.5.0 (using a predictProbabilities method). I would like to get at this functionality for the MLPC. Is there a way I can hack at it to get my probabilities? Will it be included in 1.6.2?
If you take a look at this line in the MLPC source code, you can see that the MLPC is working from an underlying TopologyModel which provides the .predict method I'm looking for. The MLPC decodes the resulting Vector into a single label.
I'm able to use the trained MLPC model to create a new TopologyModel using its weights:
MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier()...;
MultilayerPerceptronClassificationModel model = trainer.fit(trainingData);
TopologyModel topoModel = FeedForwardTopology.multiLayerPerceptron(model.layers(), true).getInstance(model.weights());
I think the short answer is No.
The MultilayerPerceptronClassifier is not probabilistic. When the weights (and any biases) are set after training, the classification for a given input will always be the same.
What you're really asking, I think, is "if I were to tweak the weights by certain random disturbances of a given magnitude, how likely would the classification be the same as without the tweaks?"
You could do an ad hoc probability calculation by re-training the perceptron (with different, randomly chosen starting conditions) and get some idea of the probability of various classifications.
But I don't think this is really part of the expected behavior of a MLPC.
That's rather newbie question, so please take it with a grain of salt.
I'm new in the field of data mining and trying to get my head wrapped around this topic. Right now I'm trying to polish my existing model so that it classifies instances better. The problem is, that my model has around 480 attributes. I know for sure that not all of them are relevant, but it's hard for me point out which are indeed important.
The question is: having valid training and test sets, does one can use some sort of data mining algorithm which would throw away attributes that seem to not have any impact on the quality of classification?
I'm using Weka.
You should test using some of the Classifier algorithms that Weka has.
The basic idea is to use the Cross-validation option, so you can see which algorithm gives you the best Correctly Classified Instances value.
I can give you an example of one of my training set, using the Cross-validation option and choosing Folds 10.
As you can see, using the J48 classifier I will have:
Correctly Classified Instances 4310 83.2207 %
Incorrectly Classified Instances 869 16.7793 %
and if I will use for example the NaiveBayes Algorithm I will have:
Correctly Classified Instances 1996 38.5403 %
Incorrectly Classified Instances 3183 61.4597 %
and so on, the values differ depending on the algorithm.
So, test as many algorithms as possible and see which one gives you the best Correctly Classified Instances / Time consumed.
Comment converted to answer as OP suggested:
If You use weka 3.6.6 - select module explorer -> than go to tab "Select attributes" and choose "Attribute evaluator" and "Search method", you can also choose between using full data set or cv sets, for more details see e.g. http://forums.pentaho.com/showthread.php?68687-Selecting-Attributes-with-Weka or http://weka.wikispaces.com/Performing+attribute+selection
Read up on the topic of clustering algorithms (only on your training set though!)
Look into the InfoGainAttributeEval class.
The buildEvaluator() and the evaluateAttribute(int index) functions should help.
I'm currently looking for a java library (or native library with a java API) for formula parsing and evaluation.
Using recommandations from here, I took a look on many libraries :
JFormula
JEval
Symja
JEP
But none of them fulfil my needs, that are :
Multiple formula evaluation with dependency between them (a formula is always an affectation to a variable using other variables or numerical values)
Possibility to change only one formula out of maybe 50, with good performances if only one formule changes
no need to handle by hand variables dependancies
Automatically update other dependant variables if a formula changes
Possibility to listen which variable changed
no need to have a specific format for the variables (the user will directly enter a name and doesn't want to have a complexe notation)
Maybe an exemple will be better. Let's say we have, entered in the system in this order :
a = b + c
c = 2 * d
b = 3
d = 2
I would like to be able to enter those 4 lines in this order, and ask for the result of "a" (or "b", whatever).
Then if in the user interface (basically a table variable <> formula) "b" is changed to "2 * d", the library will automatically change the value of "b" and "a", and return me (or lunch an event, or call a function) a list of changes
The best library would be one just like JEP, but with the out-of-order variables capability and the possibility to auto-evaluate dependant variables
I know that compilers and spreadsheet softwares uses such mechanisms, but I didn't found any java or java compatible libraries directly usable
Does someone know one?
EDIT : Precision : the question is really about a library, or eventually a set of libraries to link together. The question is for a project in a company and the idea is to spend the minimum amount of time. The "do it yourself" solution has already been estimated and is not in the scope of the question
For a project that I also needed a simple formula parser I used the code of the article Lexical analysis, Part 2: Build an application in javaworld.com. It's simple and small (7 classes), and you can adapt it to your needs.
You can downdoad the source form here (search for 'Lexical Analysis Part II' entry).
Don't know of any libraries.
Assuming what you have is a set of equations with a single variable on at least one side of the equation (A+B=C-D is disallowed) and no cycles, (e.g., A=B+1; B=A-2), what you technically need to do is to build a a data flow graph showing how each operator depends on its operands. For side-effect-free equations (e.g., pure math) this is pretty easy; you end up with a directed acyclic graph (a forest with shared subtrees representing shared subexpressions). Then if a value of a variable is changed, or a new formula is introduced, you revise the dag and re-evaluate the changed parts, propagating changes up the dag to the dag roots. So, you need to build trees for the expressions, and then share them (often by hashing on subtrees to find potential equivalent candidates). So, lots of structure manipulation to keep the dag (and is root values)
But if its only 50 variables of the complexity you show, it would act, you could simply reevaluate them all. If you store the expression as trees (or better yet, reverse polish) you can evaluate each tree quite fast, and you don't pay any overhead to keep all those data structures up to date.
If you have hundreds of equations, the dag scheme is likely a lot better.
If you have constraint equations (e.g., you aren't restricted as to what can be on both sides), you're outside the spreadsheet paradigm and into constraint solvers, which is a far more complex technology.
Why would not you just write your own? Your assessment of complexity of this task might be wrong. It is much easier than you might think - chances are, learning how to deal with any 3rd party library would require much more effort than implementing such a trivial thing from scratch. It should not take more than a couple of hours in the worst case.
It does not make any sense to look for 3rd party libraries for doing simple things (I know, it is a part of the Java ethos, but still...)
I'd recommend to take a look at the Cells library for inspiration. It is in Common Lisp, but ideas are basic enough to be transferred anywhere else.
you can check these links too...
MathPiper (a Java fork of the Java Yacas version) (has it's own
editor based on jEdit) (GPL) http://code.google.com/p/mathpiper/
Symja/Matheclipse (my own project, uses JAS and Commons Math
libraries) (LGPL) http://krum.rz.uni-mannheim.de/jas/
Java Algebra System (JAS) (LGPL) http://krum.rz.uni-mannheim.de/jas/
I would embedd Groovy, see the Tutorial about embedding here. Freeplane (a Java Mindmapper) also uses Groovy for formulas.
Whenever a variable is changing you have to put the new value into the binding.
All the cell code should be given to the Groovy Shell as single code piece. You can register on changes via BindPath.
Anyway I assume you have to implement a thin layer to fullfill your requirements:
no need to handle by hand variables dependancies
Possibility to listen which variable changed
Is it possible to measure how many distinct inputs were passed on to the methods of a class under test from existing test cases.
I'd like to measure something like code coverage, but for inputs instead.
I don't know of any COTS tools that compute input coverage, so I'd expect you to have to build a tool that did what you wanted.
My technical paper Branch Coverage for Arbitrary Languages Made Easy describes an approach for building test coverage tools for arbitrary languages using a Program transformation system to insert arbitrary probes into source code.
The paper is naturally focused on building code coverage, but the probe insertion technique is general and you can decide where to place probes and what they do. In your case, you want to place probes only at method entry, and you want the probes to track the input argument instances. The paper shows how to place probes anywhere by using a source code pattern to indicate the point of insertion; method entry is easy to describe as a pattern.
Capturing the input instances is more awkward but doable. You'll have to decide what an "input" is; is it just the argument values, or some kind of deep copy of the arguments? Likely what you need to do is create (per-method instrumented) an object type whose data members corresponds to the parameters, instantiate such an object with a copy (to appropriate depth) of the arguments, and store that object in a per-method hash table. (The transforamtion rules can insert all this once you know what you want to do as a code idiom). With all that, at execution, your hash table builds up the argument set, which is the key to what you want.
You can (continuously) count unique argument-set instances by controlling what happens when you insert duplicates into the hash table; that count (per method) can be managed in a global array that is exported at program completion. The paper discusses such a global array, and the various ways to export/display it in general.
Our line of test coverage and profilers are built using the techniques in the paper. The profilers keep counts/times in such global arrays (essentially what you need) and export them to a display engine that draws heat histograms, showing where the hot spots are. Those display engines are language and probe-data-source agnostic off-the-shelf in that they come in any of our (profiler) tools, including the Java profiler, so you could press one of them into service for the display task.
We have a system which performs a 'coarse search' by invoking an interface on another system which returns a set of Java objects. Once we have received the search results I need to be able to further filter the resulting Java objects based on certain criteria describing the state of the attributes (e.g. from the initial objects return all objects where x.y > z && a.b == c).
The criteria used to filter the set of objects each time is partially user configurable, by this I mean that users will be able to select the values and ranges to match on but the attributes they can pick from will be a fixed set.
The data sets are likely to contain <= 10,000 objects for each search. The search will be executed manually by the application user base probably no more than 2000 times a day (approx). It's probably worth mentioning that all the objects in the result set are known domain object classes which have Hibernate and JPA annotations describing their structure and relationship.
Possible Solutions
Off the top of my head I can think of 3 ways of doing this:
For each search persist the initial result set objects in our database, then use Hibernate to re-query them using the finer grained criteria.
Use an in-memory Database (such as hsqldb?) to query and refine the initial result set.
Write some custom code which iterates the initial result set and pulls out the desired records.
Option 1
Option 1 seems to involve a lot of toing and froing across a network to a physical Database (Oracle 10g) which might result in a lot of network and disk activity. It would also require the results from each search to be isolated from other result sets to ensure that different searches don't interfere with each other.
Option 2
Option 2 seems like a good idea in principle as it would allow me to do the finer query in memory and would not require the persistence of result data which would only be discarded after the search was complete. Gut feeling is that this could be pretty performant too but might result in larger memory overheads (which is fine as we can be pretty flexible on the amount of memory our JVM gets).
Option 3
Option 3 could be very performant but is something I would like to avoid as any code we write would require such careful testing that the time taken to acheive something flexible and robust enough would probably be prohibitive.
I don't have time to prototype all 3 ideas so I am looking for comments people may have on the 3 options above, plus any further ideas I have not considered, to help me decide which idea might be most suitable. I'm currently leaning toward option 2 (in memory database) so would be keen to hear from people with experience of querying POJOs in memory too.
Hopefully I have described the situation in enough detail but don't hesitate to ask if any further information is required to better understand the scenario.
Cheers,
Edd
Options 1 and 2 are quite compatible: by implementing one you can replace it with the other with simple reconfiguration of persistence.xml (given that in-memory database is JPA compatible, e.g. JavaDB, Derby, etc.).
Option 3 is re-implementing both third-party software (database) and your own code (existing JPA entities). You also listed its advantages as concerns. It's clearly a less feasible option in your case. I can't think of anything else to promote Option 3 either.
It seems that in-memory database is more suitable given use cases and their time span. If requirements evolve into less transient ones then you can switch to Oracle.
If your expressions are not too complex, you can use an expression language for evaluating string queries on your Java objects (POJOs). I can recommend MVEL http://mvel.codehaus.org .
The idea is that you put your objects into MVEL context. Then you provide string query written according to MVEL simple notation, and finally evaluate expression.
Example taken from MVEL site:
Map vars = new HashMap();
vars.put("x", new Integer(5));
vars.put("y", new Integer(10));
Integer result = (Integer) MVEL.eval("x * y", vars);
assert result.intValue() == 50; // Mind the JDK 1.4 compatible code :)
Usually expression languages support traversing your object graph (collections) and
accessing members in JSP EL style (dot notation).
Also, I can suggest looking at OGNL (google it, I can't add more than one link)
How complex are the refining criteria? If the majority are quite simple, I'd be tempted to go for option (3) to start with, but make sure it's encapsulated behind a suitable interface so that if you come across something that is too complex or inefficient to code up yourself you can switch to the in-memory DB at that point (either wholesale for all queries, or just for the complex ones if there's an overhead in setting up the temporary tables).
Option 2 seems to be good - since you can toggle between 1 & 2 as per need. 3 is restricted in terms of future data sizing issue as well. Querying objects would imply greater dependency on the code structure for storage and querying.
Probably it would be good idea to include some caching mechanism (ehcache/memcache) along with usage of Option 2 and then profiling to check the performance difference.