First of all I know what the Euclidean distance is and what it does or calculates between two vectors.
But my question is about how to calculate the distance between two class objects for example in Java or any other OOP-Language. I read pretty much stuff about machine learning already wrote a classifier using libraries etc. but I want to know how the Euclidean distance is calculated when I have for example this object:
class Object{
String name;
Color color;
int price;
int anotherProperty;
double something;
List<AnotherObject> another;
}
What I already know (If I am not wrong!) is that I have to convert this object to a(n) vector / array representing the properties or 'Features' (called in Machine Learning?)
But how can I do this? It is just this piece of puzzle which I need, to understand even more.
Do I have to collect all possible values for a property to convert it to a number and write it in the array/vector?
Example:
I guess the above object would be represented by an 6-dimensional array or smaller based on the 'Features' which are necessary to calculate.
Let's say Color, Name and the price are those necessary features the array/vector based on the following data:
color: green (Lets say an enum with 5 possible values where green is the third one)
name: "foo" (I would not know how to convert this one maybe using
addition of ascii code?)
price: 14 (Just take the integer?)
would look like this?
[3,324,14]
And if I do this with every Object from the same class I am able to calculate the Euclidean distance. Am I right or did I misunderstand something, or is it completely wrong?
For each data type you need to choose an appropriate method of determing the distance. In many cases each data type may also itself have to be treated as a vector.
For colour, for example, you could express the colour as an RGB value and then take the Euclidian distance (take the 3 differences, square them, sum and then square root). You might want to chose a different colour-space than RGB (e.g., HSI). See here: Colour Difference.
Comparing two strings is easier: a common method is the Levenshtein distance. There is an method in the Apache commons StringUtils class.
Numbers - just take the difference.
Every type will require some consideration for the best way of either generating a distance directly or calculating the a numeric value that can then be subtracted to give a "distance".
Once you have a vector of all of the "values" of all of the fields for each object you can calculate the Euclidian distance (square the differences, sum and square root the sum).
In your case, if you have:
object 1: [3,324,14]
object 2: [5,123,10]
The Euclidian distance is:
sqrt( (3-5)^2 + (324-123)^2 + (14-10)^2 )
But in the case of comparing strings, the Levenshtein algorithm gives you the distance directly without intermediate numbers for the fields.
Think about this problem as a statistics problem. Classify all the attributes into nominal, ordinal, and scale variables. Once you have done that, it is just a multiple dimension distance vector problem.
Related
The main issue which needs to be solved is:
Let's say I have an array with 8 numbers, e.g. [2,4,8,3,5,4,9,2] and I use them as values for my x axis in an coordinate system to draw a line. But I can only display 3 of this points.
What I need to do now is do reduce the number of points (8) to 3, without manipulating the line too much - so using an average should be an option.
I am NOT looking for the average of the array in a whole - I still need 3 points of the amount of 8 in total.
For an array like [2,4,2,4,2,4,2,4] and 4 numbers out of that array, I could simply use the average "3" of each pair - but that's not possible if the number is uneven.
But how would I do that? Do you know how this process is called in a mathematical way?
To give you some more realistic details about this issue: I have an x axis, which is 720px long and let's say I get 1000 points. Now I have to reduce this 1000 points (2 arrays, one for x and one for y values) to a maximum of 720 points.
Thought about interpolation and stuff like that, but I'm still not quite sure if this is what I am looking for.
Interpolation is good idea. You input your points and get a polynomial function as an output. Then you can use it to draw your line. Check more here : Interpolation over an array (or two)
I would recommend that you fit all the points you have in some fashion and then evaluate at the particular points you need for the display.
There are a myriad of choices for fitting:
Least squares
Piecewise using polynomials or splines
You should consult a text or find a library to help you - something like Apache Commons Math.
It sounds like you are looking for a more advanced mathematical function than a simple average.
I would suggest trying to identify potential algorithms via Mathematica Stack Exchange and then trying to find a Java library that implements any of the potential choices (maybe a new question here).
since its for an X-axis, why not use the
MIN, MAX and (MIN+MAX)/2
for your three points?
I am doing a two-faces comparison work using OpenCV FaceRecognizer of LBP type. My question is how to calculate the percentage format prediction confidence? Giving the following code(javacv):
int n[] = new int[1];
double p[] = new double[1];
personRecognizer.predict(mat, n, p);
int confidence = p[0];
but the confidence is a double value, how should I convert it into a percentage % value of probability?
Is there an existing formula?
Sorry if I didn't state my question in a clear way. Ok, here is the scenario:
I want to compare two face images and get out the likeliness of the two face, for example input John's pic and his classmate Tom's pic, and let's say the likeliness is 30%; and then input John's pic and his brother Jack's pic, comes the likeliness is 80%.
These two likeliness factor shows that Jack is more like his brother John than Tom does... so the likeliness factor in percentage format is what i want, more the value means more likeliness of the two input face.
Currently I did this by computing the confidence value of the input using opencv function FaceRecognizer.predict, but the confidence value actually stands for the distance between the inputs in their feature vectors space, so how can I scale the distance(confidence) into the likeliness percentage format?
You are digging too deep by your question. Well, according to the OpenCV documentation:
predict()
Predicts a label and associated confidence (e.g. distance) for a given
input image
I am not sure what are you looking for here but the question is not really easy to be answered. Intra-person face variants (variation of the same person) are vast and inter-person face variation (faces from different persons) can be more compact (e.g. when both face front while the intra-person second facial image is profile) so this is a whole topic that expect an answer.
Probably you should have a ground truth (i.e. some faces with labels already known) and deduct form this set the percentage you want associating the distances with the labels. Though this is also often inaccurate as distance would not coincide with your perception of similarity (as mentioned before inter-person faces can vary a lot).
Edit:
First of all, there is no universal human perception of face similarity. On the other half, most people would recognize a face that belongs to the same person in various poses and postures. Most word here is important. As you pressure the limits the human perception will start to diverge, e.g. when asked to recognize a face over the years and the time span becomes quite large (child vs adolescence vs old person).
You are asking to compute the similarity of noses/eyes etc? If so, I think the best way is to find a set of noses/eyes belonging to the same persons and train over this and then check your performance on a different set from different persons.
The usual approach as I know is to train and test using pairs of images comprising positive and negative samples. A positive sample is a pair of images belonging to the same person while a negative one is an image pair belong to two different ones.
I am not sure what you are asking exactly so maybe you can check out this link.
Hope it helped.
Edit 2:
Well, since you want to convert the distance that you are getting to a similarity expressed as percentage you can somehow invert the distance to get the similarity. There are some problems arising here though:
There is a value for absolute match, that is dis = 0; or equivalently similarity is sim = 100% but there is no value explicit for total mismatch: dis = infinite so sim = 0%. On the other hand the inverse progress has explicit boundaries 0% - 100%.
Since extreme values include 0 and infinite there must be a smarter conversion than simple inversion.
You can easily assign 1.0 (or 100% to similarity) corresponding to the absolute match but what you are going to take as total mismatch is not clear. You can consider an arbitrary high value as 0.0 (since you there is no big difference e.g. in using distance 10000 to 11000 I guess) and all values higher than this (distance values that is) to be considered 0.0.
To find which value that should be I would suggest to compare two quite distinct images and use the distance between them as 0.0.
Let's suppose that this value is disMax = 250.0; and simMax = 100.0;
then a simple approach could be:
double sim = simMax - simMax/disMax*dis;
which gives a 100.0 similarity for 0 distance and 0.0 for 250 distance. Values larger than 250 would give negative similarity values which should be considered 0.0.
I would like to find a way, given a set of any points on a 2 dimensional (or 3 dimensional if possible) plane, to connect as many of these points as possible with an equation, preferably in the form of X^n+BX^n and so on. X being of course a variable, and b and n being any numbers.
This would hopefully work in a way that given say, 50 random points, I would be able to use the equation to draw a line that would pass through as many of these points as possible.
I plan on using this in a compression format where data is converted to X,Y coordinate pairs, the goal is then to create equations that can reproduce these points. The equation would then be stored and the data would be replaced with a pointer to the equation as well as the number to enter into the equation to get the data back.
Any feedback is nice, this is just an idea I thought of during class and wanted to see if it would be possible to implement in a usable format.
To connect n points you need a polynomial of at most degree n-1. You can use Polynomial Regression to form your line.
Is there any library or open source function that approximate the area under a line that is described by some of its values taken at irregular intervals?
Action Script would be preferred but Java might work fine as well.
You could use the as3mathlib math library. Here's the relevant class:
http://code.google.com/p/as3mathlib/source/browse/trunk/src/com/vizsage/as3mathlib/math/calc/Integral.as
It includes the most common integral approximation methods.
Edit for more explanation (based on comments below):
Use timestamp values for each date; only convert to anything else if you need to display it to the user, and do so at the very end.
Hopefully there's a standard greatest common divisor (GCD) among the various differences between each set of adjacent timestamps. (If not, you'll need to calculate that first.) In other words, hopefully each timestamp differs by a number of whole days. If so, the GCD is 1 day. If it's not like this, you'll have to calculate what that GCD equals on the fly.
Then, use the GCD value in combination with the delta between the first and last timestamps to determine n, the number of partitions. Then, in f (your function to be integrated), determine whether the passed x corresponds to a defined timestamp. If so, return the numeric_value associated with that timestamp. If not, interpolate between the numeric_values of the nearest two defined timestamps, and return that.
currently i have using a framework and it has a function called distance2D, and it has this description:
Calculate the Euclidean distance
between two points (considering a
point as a vector object). Disregards
the Z component of the vectors and is
thus a little faster.
and this is how i use it
if(g.getCenterPointGlobal().distance2D(target.getCenterPointGlobal()) > 1)
System.out.println("Near");
i have totally no idea what a Euclidean distance is, i am thinking that it can be used to calculate how far 2 points are? because i am trying to compare distance between 2 objects and if they are near within a certain range i want to do something. how would i be able to use this?
Euclidean distance is the distance between 2 points as if you were using a ruler. I don't know what are the dimensions of your Euclidean space, but be careful because the function you are using just takes in consideration the first two dimensions (x,y). Thus if you have a space with 3 dimensions(x,y,z) it will only use the first two(x,y of x,y,z) to calculate the distance. This may give a wrong result.
For what I understood, if you want to trigger some action when two points are within some range you should make:
<!-- language: lang-java -->
if(g.getCenterPointGlobal().distance2D(target.getCenterPointGlobal()) < RANGE)
System.out.println("Near");
The Euclidean distance is calculated tracing a straight line between two points and measuring as the hypotenuse of a imaginary isosceles triangle between the two lines and a complementary point. This measure is scalar, so it's a good metric for your calculations.
Euclidean geometry is a coordinate system in which space is flat, not curved. You don't need to care about non-Euclidean geometry unless for example you're dealing with coordinates mapped onto a sphere, such as calculating the shortest travel distance between two places on Earth.
I imagine this function will basically use Pythagoras' theorem to calculate the distance between the two objects. However, as the description says, it disregards the Z component. In otherwords, it will only give a correct answer if both points have the same Z value (aka "depth").
If you wish to compare distances and save time, use not the distance itself, but its square: (x1-x2)^2 + (y1-y2)^2. Don't take sqrt. So, your distances will work exactly as euclidian ones, but quickly.