Weka: How can I implement a Surrogate Split in J48 Decision Tree? - java

Can anybody help me to implement an alternative missing value handling in J48 algorithm using Weka API in Java.
I am sure that using pre-imputation approaches before training the J48 is easy.
But what is about using a surrogate split attribute in case of partition the training date (like Breiman does in CART) instead of the J48 standard approach (Quinlan in C4.5) splitting the cases across a probability distribution from observed cases with known value.
Can anybody give me some information, tip, help, where in the Weka API and Source Code a have to modify to replace standard with surrogate split?

Look at weka source code weka.classifiers.trees.j48.C45ModelSelection from line 152 (Find "best" attribute to split on). It uses info gain ratio as splitting criteria.

Related

Naive Bayes and SVM java implementation for document classification

I am trying to classify legal case documents which are in text format, in different folders like Civil, Land, Criminal, e.t.c, I intended using Naive Bayes as Vectoriser to get the vectors from the text documents,feed it in to SVM to classify the documents using javaml, I have implemented the preprocessing like stemming, I used the formulars of Naive Bayes as seen in http://eprints.nottingham.ac.uk/2995/1/Isa_Text.pdf to calculate prior probability, likelihood, evidence and posterior probability, I am assuming the posterior probability is the vector to be fed into SVM, but I cannot format the output to feed into the SVM library.
I need all the help I can get in this, I hope I am doing things right.
I have other legal cases as test set that I want to classify to the right categories.

Software pattern for matching object with handles

I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922

500,000 street names - what data structure and to use to implement a fast search?

So we have many street names. They come in a file. Id probably cache them when booting the server up in production. The search should be auto complete like - e.g. you type 'lang ' and you would get maybe 8 hits : langstr, langestr. Etc
What you are looking for is some sort of compressed trie representation. You might want to look into succinct tries or DAWGs as a starting point, as they give excellent efficiency and very good space usage.
Hope this helps!
Autocomplete is usually implemented using one of the following:
Trees. By indexing the searchable text in a tree structure (prefix tree, suffix tree, dawg, etc..) one can execute very fast searches at the expense of memory storage. The tree traversal can be adapted for approximate matching.
Pattern Partitioning. By partitioning the text into tokens (ngrams) one can execute searches for pattern occurrences using a simple hashing scheme.
Filtering. Find a set of potential matches and then apply a sequential algorithm to check each candidate.
Take a look at completely, a Java autocomplete library that implements some of the latter concepts.

Binary classification for web pages

We are interested in doing binary classification of web pages present across the web e.g. Ecommerce vs Non-Ecommerce.
Currently, we are using Mahout library with Naive Bayes algorithm. We are creating training data from existing classified URLs and feature set from the same.
What is the best possible way in terms of accuracy to perform this task?
I need help in terms of algorithm, libraries(usable with JAVA) or any better ideas that help in such types of classification.
Thanks in advance.
The question is quite general so I can add only general information.
The ways to improve the quality of your classification are (in order of importance):
use Lemmatisation and/or Stemming to use only base word forms
implement word filter to remove useless words
train separate classifiers for different languages
You may try to use some existing, well-tuned program,...
CRM411 is designed to be a spam filter, but it is generic enough to do what you want. People use it to sort resume and stuffs. It have lots of engine (HMM, SVM, CLUMP, Bayes, etc..). Give it a try.
This one is a very good demonstration of the algorithm regarding NB classifier.
Discarding most common words would lead to better predictions. IDF can be a good tool for filtering out those words. Also see Wikipedia.

most effective distance function for collaborative filtering in weka Java API

so I'm building this collaborative filtering system using Weka's machine learning library JAVA API...
I basically use the StringToWordVector filter to convert string objects into their word occurence decomposition....
so now I'm using kNN algorithm to find the nearest neighbors to a target object....
my question is, what distance function should I use to to compute distance between two objects that has been filtered by the StringToWordVector filter...which one woud be most effective for this scenario?
the available options in Weka are:
AbstractStringDistanceFunction, ChebyshevDistance, EditDistance, EuclideanDistance, ManhattanDistance, NormalizableDistance
Yes similarity metrics are good times. Short answer is that you should try them all and optimize with respect to RMSE, MAE, breadth of return set, etc.
There seems to be a distinction between Edit distance and the rest of these metrics as I would expect an EditDistance algorithm to work on strings themselves.
How does your StringToWordVector work? First answer this question, and then use that answer to fuel thoughts like: what do I want a similarity between two words to mean in my application (does semantic meaning outweigh word-length for instance).
And as long as you're using a StringVectorizer, it would seem you're free to consider more mainstream similarity metrics like LogLikelihood, Pearson, and Cosine (respectively). I think this is worth doing as none of the similarity metrics you've listed are widely used or studied seriously in the literature to my knowledge.
May the similarity be with you!

Categories

Resources