Have huge distributed datasets which are trained to produce classifiers.All the datasets have identical attributes and the training is done using a single algorithm J48.
The problem I am facing is as to how would combine these classifiers to have a single classifier which can be used for testing and predicting data.
I am using weka tool for the code.Have converted the weka jar to dll.Using C# language.
Any help in C# or Java would be of great help.
If any additional information is needed you are free to ask.
Thanks
It is perfectly possible to do what you are asking for. You could build N different classifiers from N different but compatible datasets and combine their outputs to form a new dataset of higher order. Its a hierarchical way of combining classifiers and there is a great variety in ways of doing that. Its called 'ensembling' or 'classifier ensemble'. There are a large number of technical articles detailing how to do it.
One approach would be:
1. Train/get N different classifiers.
2. Build a new dataset with its probability output for a known set of instances, one instance per row, the set-of-output-probalities per set of columns. And the right/known class.
3. Throw away the old attributes and retain only the output probs calculated and known class.
4. Train a new model/classifier with this higher order dataset (don't need to use the whole data, only a moderate subsample).
5. For every new instance, get lower level probabilities (using N classifiers), as previously done, and apply higher level classifier over these newly constructed instance.
Hope to have helped.
I don't think it is possible if you create N classifiers on N training sets and then combine N classifiers to generate a single one. Because first, the data are different; second, so the models will be different. Instead, what I would do is if I were happy with the N results, I would combine all N datasets and develop a single model from it to test and predict unseen data.
Related
This seems most related to: How to get the probability per instance in classifications models in spark.mllib
I'm doing a classification task with spark ml, building a MultilayerPerceptronClassifier. Once I build a model, I can get a predicted class given an input vector, but I can't get the probability for each output class. The above listing indicates that NaiveBayesModel supports this functionality as of Spark 1.5.0 (using a predictProbabilities method). I would like to get at this functionality for the MLPC. Is there a way I can hack at it to get my probabilities? Will it be included in 1.6.2?
If you take a look at this line in the MLPC source code, you can see that the MLPC is working from an underlying TopologyModel which provides the .predict method I'm looking for. The MLPC decodes the resulting Vector into a single label.
I'm able to use the trained MLPC model to create a new TopologyModel using its weights:
MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier()...;
MultilayerPerceptronClassificationModel model = trainer.fit(trainingData);
TopologyModel topoModel = FeedForwardTopology.multiLayerPerceptron(model.layers(), true).getInstance(model.weights());
I think the short answer is No.
The MultilayerPerceptronClassifier is not probabilistic. When the weights (and any biases) are set after training, the classification for a given input will always be the same.
What you're really asking, I think, is "if I were to tweak the weights by certain random disturbances of a given magnitude, how likely would the classification be the same as without the tweaks?"
You could do an ad hoc probability calculation by re-training the perceptron (with different, randomly chosen starting conditions) and get some idea of the probability of various classifications.
But I don't think this is really part of the expected behavior of a MLPC.
I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922
I have some high dimensional (30000 dimensions) vectors of integer numbers. I have 2 classes: [YES, NO]. I have 6000 samples of the YES-class and 50000 samples of the NO-class. I would like to train a classifier, to classify new samples in future automatically to one of these classes.
I know how to use the Weka Java API, but I am not sure which algorithms in which order to use. Can anyone give me advice on the following questions:
Are the vectors too high dimensional or do I have too many samples to do this efficiently in Weka?
Should I reduce the dimensionality before I start? What algorithm can I use to identify significant elements of my feature vector?
What classifier would be best to classify this kind of data? I think a decision tree should work fine, but maybe a naive bayes is faster to train, is it?
Since every element must have a name in weka, how can I assign a name to each of my 30000 features?
Any advice is appreciated. Thanks.
The number of dimensions to this problem most certainly are quite large, but I believe that Weka should be able to handle a large number of dimensions. The number of samples should not be a problem, but there are a lot more NO class samples than there are YES Class, so balancing the two might assist in classifying the NO Class cases better.
If you believe that there are redundant dimensions or some of the dimensions may contain noise, then it would certainly help.
A decision tree shouldn't be too much of a problem. There are a number of algorithms available in Weka, but I wouldn't recommend Neural Networks given the dimensionality of the problem.
If you have saved the data in a CSV File, you could assign attribute names in the first row of the data. This way, you can assign attribute names. Given the number of dimensions, you would likely call these a1 to a30000 and output for the output class.
Hope this Helps!
That's rather newbie question, so please take it with a grain of salt.
I'm new in the field of data mining and trying to get my head wrapped around this topic. Right now I'm trying to polish my existing model so that it classifies instances better. The problem is, that my model has around 480 attributes. I know for sure that not all of them are relevant, but it's hard for me point out which are indeed important.
The question is: having valid training and test sets, does one can use some sort of data mining algorithm which would throw away attributes that seem to not have any impact on the quality of classification?
I'm using Weka.
You should test using some of the Classifier algorithms that Weka has.
The basic idea is to use the Cross-validation option, so you can see which algorithm gives you the best Correctly Classified Instances value.
I can give you an example of one of my training set, using the Cross-validation option and choosing Folds 10.
As you can see, using the J48 classifier I will have:
Correctly Classified Instances 4310 83.2207 %
Incorrectly Classified Instances 869 16.7793 %
and if I will use for example the NaiveBayes Algorithm I will have:
Correctly Classified Instances 1996 38.5403 %
Incorrectly Classified Instances 3183 61.4597 %
and so on, the values differ depending on the algorithm.
So, test as many algorithms as possible and see which one gives you the best Correctly Classified Instances / Time consumed.
Comment converted to answer as OP suggested:
If You use weka 3.6.6 - select module explorer -> than go to tab "Select attributes" and choose "Attribute evaluator" and "Search method", you can also choose between using full data set or cv sets, for more details see e.g. http://forums.pentaho.com/showthread.php?68687-Selecting-Attributes-with-Weka or http://weka.wikispaces.com/Performing+attribute+selection
Read up on the topic of clustering algorithms (only on your training set though!)
Look into the InfoGainAttributeEval class.
The buildEvaluator() and the evaluateAttribute(int index) functions should help.
I'm thinking to write a simple research paper manager.
The idea is to have a repository containing for each paper its metadata
paper_id -> [title, authors, journal, comments...]
Since it would be nice to have the possibility to import the paper dump of a friend,
I'm thinking on how to generate the paper_id of a paper: IMHO should be produced
by the text of the pdf, to garantee that two different collections have the same ids only for the same papers.
At the moment, I extract the text of the first page using the iText library (removing the possible annotations), and i compute a simhash footprint from the text.
the main problem is that sometime text is slightly different (yes, it happens! for example this and this) so i would like to be tolerant.
With simhash i can compute how much the are similar the original document, so in case the footprint is not in the repo, i'll have to iterate over the collection looking for
'near' footprints.
I'm not convinced by this method, could you suggest some better way to produce a signature
(short, numerical or alphanumerical) for those kind of documents?
UPDATE I had this idea: divide the first page in 8 (more or less) not-overlapping squares, covering all the page, then consider the text in each square
and generate a simhash signature. At the end I'll have a 8x64=512bit signature and I can consider
two papers the same if the sum of the differences between their simhash signatures sets is under a certain treshold.
In case you actually have a function that inputs two texts and returns a measure of their similarity, you do not have to iterate the entire Repository.
Given an article that is not in the repository, you can iterate only articles that have approximately the same length. for example, given an article that have 1000 characters, you will compare it to articles having between 950 and 1050 characters. For this you will need to have a data structure that maps ranges to articles and you will have to fine tune the size of the range. Range too large- too many items in each range. Range too small- higher potential of a miss.
Of course this will fail on some edge cases. For example, if you have two documents that the second is simply the first that was copy pasted twice: you would probably want them to be considered equal, but you will not even compare them since they are too far apart in length. There are methods to deal with that also, but you probably 'Ain't gonna need it'.