ML technique for classification with probability estimates

ML technique for classification with probability estimates - java

I want to implement a OCR system. I need my program to not make any mistakes on the letters it does choose to recognize. It doesn't matter if it cannot recognize a lot of them (i.e high precision even with a low recall is Okay).
Can someone help me choose a suitable ML algorithm for this. I've been looking around and find some confusing things. For example, I found contradicting statements about SVM. In the scikits learn docs, it was mentioned that we cannot get probability estimates for SVM. Whereas, I found another post that says it is possible to do this in WEKA.
Anyway, I am looking for a machine learning algorithm that best suites this purpose. It would be great if you could suggest a library for the algorithm as well. I prefer Python based solutions, but I am OK to work with Java as well.

It is possible to get probability estimates from SVMs in scikit-learn by simply setting probability=True when constructing the SVC object. The docs only warn that the probability estimates might not be very good.
The quintessential probabilistic classifier is logistic regression, so you might give that a try. Note that LR is a linear model though, unlike SVMs which can learn complicated non-linear decision boundaries by using kernels.

I've seen people using neural networks with good results, but that was already a few years ago. I asked an expert colleague and he said that nowadays people use things like nearest-neighbor classifiers.
I don't know scikit or WEKA, but any half-decent classification package should have at least k-nearest neighbors implemented. Or you can implement it yourself, it's ridiculously easy. Give that one a try: it will probably have lower precision than you want, however you can make a slight modification where instead of taking a simple majority vote (i.e. the most frequent class among the neighbors wins) you require larger consensus among the neighbors to assign a class (for example, at least 50% of neighbors must be of the same class). The larger the consensus you require, the larger your precision will be, at the expense of recall.

Related

Existing Algorithm for Scheduling Problems?

Let's say I want to build a function that would properly schedule three bus drivers to drive in a week with the following constraints:
Each driver must not drive more than five times per week
There must be two drivers driving everyday
They will rest one day each week (will not clash with other drivers' rest day)
What kind of algorithm would be used to solve a problem like this?
I looked through several sites and I found these:
1) Backtracking algorithm (brute force)
2) Genetic algorithm
3) Constraint programming
Frankly, these are all "culture shock" for me as I have never learnt any kind of linear programming in the past. There are two things I want to know:
1) Which algorithm will best suit the case scenario above?
2) What would be the simplest algorithm to solve this problem?
3) Please suggest any other algorithms I can look into to solve the above problem.

1) I agree brute force is bad.
2) Your Problem is an Integer Problem. They can be solved with Linear Programming though.
3) You can distinquish 2 different approaches: heuristics and exact approaches.
Heuristics provide good solutions in reasonable computation time. They are used when there are strict requirements on the computation time or if the problem is too hard to calculate an optimal solution. Genetic Algorithms is a heuristic.
As your Problem is comparably simple, you would probably go with an exact approach.
4) The standard way to solve this exacly, is to embed a Linear Program in a Branch & Bound search tree. There is lots of literature on it. The procedure can be outlined as follows:
Solve the Linear Program with the Simplex-Algorithm
Find a fractional variable for branching. I.e. x=1.5
Create two new nodes and add the constraints x<=1 and x>=2 respectively
Go into one node (selected by some strategy)
Go to point 1
Additionally, at every node in the tree, after point 1, the algorithms checks, if a node can be pruned. That means to stop searching 'deeper' from this node on, because
a) the problem has become infeasible,
b) a better solution already exists,
c) an integer solution is found. This objective value of this solution is used to determine point b.
The procedure finishes when all nodes are pruned.
Luckily, as Nicolas stated, there are free implementations that do just this. All you have to do is to create your model. Code its objective and constraints in some tool and let it solve.

First of all this is a discrete optimization problem, so linear programming is probably not a good idea (since it is meant for continuous optimization). You can still solve this using linear programming (it will become an integer or mixed-integer program) but that is exponentially heard (if your input size is small then it is ok).
Now back to the comparison:
Brute force : worst.
Genetic: Can not guarantee optimality. The algorithm may not be able to solve the problem.
Constraint programming: definitely the best in this case (and in many discrete optimization problems). There is a super efficient implementation of it in IBM ILOG CPLEX solver (but is is not free, it is free for academia or for testing though).

How to approach writing algorithm from a complex research paper

I thought of writing a piece of software which does Alpha Compositing. I didn't wanted ready made code off from internet so I tried to find research papers and other sources to understand the mathematical algorithms, and initiated to implement.
But, I got lost very quickly. So my question is,
How should I approach these papers to extract the necessary details from it in order to write algorithm based on it. Any specific set of steps which works well?
Desired answer :
Read ...
Extract ...
Understand ...
Implement ...
Note: This question is not limited to only Alpha Compositing, so more generalised approach will be helpful. I have tagged Java and C++, because thats my desired language to implement the image processing.
What I have done so far?
This is not a homework question but it is of course better to say what I know. I have read wiki of Alpha compositing, and few closely related Image compositing research papers. But, I stuck at the next step to take in order to go from understanding to implementation.
Wikipedia
Technical Memo, Image compositing

I'd recommend reading articles with complex formulas with a pencil and paper. Work through the math involved until you have a good grasp on it. Then, you'll be ready to code.
Start with identifying the steps needed to perform your algorithm on some image data. Include all of the steps from loading the image itself into memory all the way through the complex calculations that you may need to perform. Then structure that list into pseudocode. Once you have that, it should be rather easy to code up.

Write pseudocode. Ideally, the authors of the research papers would have done this, but often they don't. Write pseudocode for some simple language like Matlab or possibly Python, and hack away at writing a working implementation based on the psuedocode.
If you understand some parts of the algorithm but not others, then implement your pseudocode into real code for the parts you understand, and leaving comments for the places you don't.
The section from The Pragmatic Programmer on "Tracer Bullets" basically describes this idea. You want to quickly hack together something that takes your data into some form of an output, and then iterate on the body of the code to get it to slowly resemble the algorithm you're trying to produce.
My answer is necessarily somewhat vague. There's no magic bullet for something like this.

Have you implemented any image processing algorithms? Maybe start with something a little simpler, like desaturation/color intensification, reversal (side to side and upside down), rotating, scaling, and compositing images through a mask.
Once you have those figured out, you will be in a very good position to do an alpha composite.
I agree that academic papers seem to go out of their way to make implementation details muddy and uncertain. I find that large amounts of simplification to what is written is needed to begin to perform a practical implementation. In their haste to be general, writers excessively parameterize every aspect. To build useful, reliable software, it is necessary to start with something simple which actually works so that it can be a framework to add features. To do that, it is necessary to throw away 80–90 percent of the academic generality. Often much can be done with a raft of symbolic constants, but abandoning generality (say for four and five dimensional images) doesn't really lose anything in practice.

My suggestion is to first write the algorithm using Matlab to make sure that you understood all the steps and then try to implement using C++ or java.

To add to the good suggestions above, try to write your pseudocode in simple module (Object oriented style ) so has to have a deep understanding of each part of your code while not loosing the big picture. Writing everything in a procedural way is good a the beginning but as the code grow, it might get become hard to keep up will all you are trying to do.

This example cites one of the seminal works on the topic: Compositing Digital Images by Porter & Duff. The class java.awt.AlphaComposite implements the same rules.

AIMA implementation bayesian networks

I would like to code bayesian networks in java to understand them better, and I have found some code of Artificial Intelligence A Modern Approach (3rd Edition), "AIMA"
Do you recommend I read the code there and adapt to a particular problem, or how do I start?
Could you please orient me where in how to use the code?
I found google has it here and here ,

I would say there is no need to look at existing code if you want to learn. You will probably learn more by doing it yourself.
A good start would be to write code that does the following:
Compute Condition Probabilities from Joint Probability table,
For example, from P(A,B,C) compute P(A|B)
Compute Joint Probability Table from complete set of Conditional Probabilities
For example, from P(A|B,C)*P(B)*P(C) compute P(A,B,C).
Given a DAG, compute if A is d-seperated from B
Do all of the above naively and then go back and try to make them efficient.
It should give you a good understanding of what Bayesian Networks are (conditional probability tables) and what they are used for (reasoning about probability).

implement/invoke Excel Solver from java

I have an application in Java/JSF, I need to do some optimization calculations, like Excel Solver Add-in does, one option is certainly to write my own solver implementation, but I'm kind of short of time, so I'm looking into libraries that already exist that can help me with this.
Can you recommend any libraries?
EDITED
I don't have the algorithm yet, but I know that I will have to do similar actions like in Excel Solver - defining parameters, the goal and restrictions and calculation the MAX/MIN revenue

Not a complete solution, but this may get you on the right track (what you are looking for is a non-linear parametric optimizer/solver):
http://jfuzzylogic.sourceforge.net/html/index.html
I did some Googling, and I was surprised that I wasn't able to find something right away...
Here is info about Excel's specific algorithm: http://support.microsoft.com/kb/82890 (again, not a solution, but certainly interesting information for anyone who does this sort of thing).
And here's the company that actually wrote the Excel solver: http://www.solver.com/sdkplatform2.htm
Not sure what your budget is, but if time is of the essence, it may make sense to license it (not sure if they have a Java version of their sdk or not).
And a related question at SO: Solving nonlinear equations numerically

Random maps/graphs and OSM

just wondering if you have any suggestions here. I need a lot of sample maps/graphs to test my shortest path search solution (I was told I should have >100 of them). My code is supposed to work in a simulator, which uses OpenStreetMap maps of urban setting, limiting the total number of junctions to a few thousand. the problem is, there are only two or three maps provided with the simulator. The way I see it, I have a few choices here:
Write my own random graph generator. Possibly lots of work (do you think? --I've never done it before) and reinventing the wheel.
Use off-the-shelf solution. I'm not aware of any that would generate me map-like graphs (well, at least I didn't find it in JUNG :-) )
In some automated way grab them from OSM. I don't really intend to myself go and pick out a 100+ urban maps that would satisfy <15000 nodes requirement. I don't think that would be easy to automate either, though.
I would assume that 3 would be tough to do. Any advice on some off-the-shelf solution? or comments about writing my own? I'm not an experienced programmer by any measure, but given a few days.

The first thought:
You have a known problem and you need to test its solution. Generate lots of test data, find correct solutions with verified algorithm, then run your algorithm against generated data set and compare results. (or just download verified dijkstra algorithm implementation, I believe that implementing this algorithm is your task)
The second thought:
Random-generated data set is not the best way to test algorithms. You need to think about cases when your algorithm can fail and create correspondent tests. For example, graph with 1 node, graph with cycles, linear graph i.e. N1---N2---N3-...-Nn, complete graph with maximum nodes number. I think if you create these 4 tests and 2-3 small random tests it'll be enough to be sure that your algorithm is implemented correctly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.