Cluster analysis - finding the center of a cluster

Cluster analysis - finding the center of a cluster - java

I've created two clustering algorithms: k-means and divisive, maybe later I'll add aglomerative as well. I have to analyze how good they are with high dimension data, and for that I have to calculate the average/sum distance to the clusters center. In the case of k-means, it's easy, i have the centroid, but how to find the center in the divisive/aglomerative algorithm?
While I'm here: I've currently implemented Euclede's, Manhattans and Pearsons distance, are there any more distance measures which i could use?
Thanks in advance!

You may want to get this book:
Encyclopedia of distances, Michel Deza, Elena Deza, 590 pages.
which covers many of the alternate distance functions you could use.
Probably a few hundred different distances ...
However, you will also need to look into your evaluation method -- if it is centroid based, it will be biased towards k-means. So the comparison is likely unfair.
Furthermore, if you use artificial data, make sure you do not unfairly favor one method over another because the method correlates with the way you generate your data (e.g. if you generate Gaussian clusters, it favors methods such as k-means).

The goal of my work is to analyze these clusters, when they have to create clusters from data with high dimensionality. It is hard to evaluate them and it's very unlikely that the result will be completely fair, so I'm going to use the average, accumulated distance between records in one cluster and the minimal distance between two records from different clusters.
Regarding the way on how to find the center of a cluster in Hierarchical clustering algorithms - the same formula used in k-means, used to recalculate the centroid after each iteration.

Related

Collision / distance to volumes from point in OpenGL ES on Android

is user inside volume OpenGL ES Java Android
I have an opengl renderer that shows airspaces.
I need to calculate if my location already converted in float[3] is inside many volumes.
I also want to calculate the distance with the nearest volume.
Volumes are random shapes extruded along z axis.
What is the most efficient algorithm to do that?
I don t want to use external library.

What you have here is a Nearest Neighbor Search problem. Since your meshes are constant and won't change, you should probably use a space partioning algorithm. It's a big topic but, in short, you generally need to use a tree structure and sort all the objects to be put into various tree nodes. You'll need to pre-calculate the tree itself. There are plenty of books and tutorials on the net about space partioning, and you could also at source code of, for example, id Software products like Doom, Quake etc. to see how this algorithms (BSP, at least) are used. The efficiency of each algorithm depends on what you have and what you need. Using BSP trees, for example, you'll have the objects sorted from nearest to farthest so you can quickly get the one you need.

Lookup algorithm that returns regions?

I have a large list of regions with 2D coordinates. None of the regions overlap. The regions are not immediately adjacent to one another and do not follow a placement pattern.
Is there an efficient lookup algorithm that can be used to let me know what region a specific point will fall into? This seems like it would be the exact inverse of what a QuadTree is.

The data structure you need is called an R-Tree. Most RTrees permit a "Within" or "Intersection" query, which will return any geographic area containing or overlapping a given region, see, e.g. wikipedia.
There is no reason that you cannot build your own R-Tree, its just a variant on a balanced B-Tree which can hold extended structures and allows some overlap. This implementation is lightweight, and you could use it here by wrapping your regions in rectangles. Each query might return more than one result but then you could check the underlying region. Its probably an easier solution than trying to build a polyline-supporting R-tree version.

What you need, if I understand correctly, is a point location data structure that is, as you put it, somehow the opposite of quad or R-tree. In a point location data structure you have a set of regions stored, and the queries are of the form: given point p give me the region in which it is contained.
Several point location data structures exists, the most famous and the one that achieves the best performance is the Kirkpatrick's one also known as triangulation refinement and achieves O(n) space and O(logn) query time; but is also famous to be hard to implement. On the other hand there are several simpler data structures that achieves O(n) or O(nlogn) space but O(log^2n) query time, which is not that bad and way easier to implement, and for some is possible to reduce the query time to O(logn) using a method called fractional cascading.
I recommend you to take a look into chapter 6 of de Berg, Overmars, et al. Computational Geometry: Algorithms and Applications which explains the subject in a way very easy to grasp, though it doesn't includes Kirkpatrick's method, which you can find it in Preparata's book or read it directly from Kirkpatrick's paper.
BTW, several of this structures assumes that your regions are not overlapping but are expected to be adjacent (regions share edges), and the edges forms a connected graph, some times triangular regions are also assumed. In all cases you can extend your set of regions by adding new edges, but don't you worry for that, since the extra space needed will be still linear, since the final set of regions will induce a planar graph. So you can blindly extend your sets of regions without worrying with too much growth of space.

GPS data comparison after smoothing

I'm trying to compare multiple algorithms that are used to smooth GPS data. I'm wondering what should be the standard way to compare the results to see which one provides better smoothing.
I was thinking on a machine learning approach. To crate a car model based on a classifier and check on which tracks provides better behaviour.
For the guys who have more experience on this stuff, is this a good approach? Are there other ways to do this?

Generally, there is no universally valid way for comparing two datasets, since it completely depends on the applied/required quality criterion.
For your appoach
I was thinking on a machine learning approach. To crate a car model
based on a classifier and check on which tracks provides better
behaviour.
this means that you will need to define your term "better behavior" mathematically.
One possible quality criterion for your application is as follows (it consists of two parts that express opposing quality aspects):
First part (deviation from raw data): Compute the RMSE (root mean squared error) between the smoothed data and the raw data. This gives you a measure for the deviation of your smoothed track from the given raw coordinates. This means, that the error (RMSE) increases, if you are smoothing more. And it decreases if you are smoothing less.
Second part (track smoothness): Compute the mean absolute lateral acceleration that the car will experience along the track (second deviation). This will decrease if you are smoothing more, and it will increase if you are smoothing less. I.e., it behaves in contrary to the RMSE.
Result evaluation:
(1) Find a sequence of your data where you know that the underlying GPS track is a straight line or where the tracked object is not moving. Note, that for those tracks, the (lateral) acceleration is zero by definition(!).
For these, compute RMSE and mean absolute lateral acceleration.
The RMSE of appoaches that have (almost) zero acceleration results from measurement inaccuracies!
(2) Plot the results in a coordinate system with the RMSE on the x axis and the mean acceleration on the y axis.
(3) Pick all approaches that have an RMSE similar to what you found in step (1).
(4) From those approaches, pick the one(s) with the smallest acceleration. Those give you the smoothest track with an error explained through measurement inaccuracies!
(5) You're done :)

I have no experience on this topic but I have few things in mind that may help you.
You know it is a car. You know that the data is generated from a car so you can define a set of properties of a car. For example if a car is moving with speed above 50km than the angle of the corner should be at least 110 degrees. I am absolutely guessing with the values but if you do a little research i am sure you will be able to define such properties. Next thing you can do is to test how each approximation fits the car properties and choose the best one.
Raw data. I assume you are testing all methods on a part of given road. You can generate a "raw gps track" - a track that best fits the movement of a car. Google maps may help you to generate such track os some gps devise with higher accuracy. Than you measure the distance between each approximation and your generated track - the one with the min distance wins.

i think you easily match the coordinates after the address conversion.
because address have street,area and city. so you can easily match the different radius.
let try this link

Take a look at this paper that discusses comparing machine learning algorithms:
"Choosing between two learning algorithms
based on calibrated tests" available at:
http://www.cs.waikato.ac.nz/ml/publications/2003/bouckaert-calibrated-tests.pdf
Also check out this paper:
"Bayesian Comparison of Machine Learning Algorithms on Single and
Multiple Datasets" available at:
http://www.jmlr.org/proceedings/papers/v22/lacoste12/lacoste12.pdf
Note: It is noted from the question that you are looking into the best way to compare the results for machine learning algorithms and are not looking for additional machine learning algorithms that may implement this feature.

Machine Learning is not an well suited approach for that task, you would have to define what is good smoothing...
Principially your task cannot be solved by an algorithm that gives an general answer because every smoothing destroy the original data by some amount and adds invented positions, and different systems/humans that use the smoothed data react differently on that changed data.
The question is: What do you want to achieve with smoothing?
Why do you need smoothing? (have you forgotten to implement or enable a stand still filter that eliminates movement while the vehicle is standing still, which in GPS introduces jumping location during stand still?)
The GPS chip has already built in a (best possible?) real time smoothing using a Kalman filter, having on the one side more information than a post processed smotthing algo, on the other side it has less.
So next you have to ask yourself: do you compare post processing smooting algos or real time algos? (probably post processing) Comparing a real time smoothing algorithm with a post process smoothing algorithm is not fair.
Again: What do you expect from smoothed data: That they look somewhat fine, but unrealistic like photoshopped models for tv-advertisments?
What is good smoothing? near to real vehicle postion which nobody ever knows, or a curve whith low acceleration?
I would prefer an smoothing algorithm that produces the curve most near to the real (usually unknown) vehicle trajectory.
Or you might just think it should somehow look beautifull: In that case overlay the curves with different colors, display it on a satelitte image map, and let a team of humans (experts at least owning and driving an own car) decide what looks good and realistic.
We humans have the best multi purpose pattern matching algorithm built in.
Again why smooth?: for display in a map to please humans that look at that map?
or to use the smoothed tracks to feed other algorithms that have problems with the original data?
To please humans I have given an answer above.
To please other algorithms:
What they need? nearer positions? or better course value / direction between points.
What attributes do you want to smooth: only the latitude, longitude coordinates, or also the speed value, and course value?
I have much professional experience with GPS tracks, and recommend, to just remove every location under 7km/h and keep the rest as it is. In most cases there is no need for further smoothing.
Otherwise it gets expensive:
A possible solution:
1) You arrange a 2000€ Reference GPS receiver delivered with a magnetic vehicle roof antenna (E.g Company hemisphere 2000 GPS receiver) and use that as reference
2) You use a comnsumer GPS usually used for your task (smartphone, etc.)
Both mounted inside the car: drive some test tracks, in good conditions (highways) but more tracks at very bad: strong curves combined with big houses left and right. And through tunnel, a struight and a curved one, if you have one.
3) apply the smoothing algoritms to the consumer GPS tracks
4) compare the smoothed to the reference track, by matching two positions and finally calulate the (RMSE Root mean squared error)
Difficulties
matching two positions: Hopefully the time can be exactly matched which is usually not the case (0,5s offset possible).
Think what do you do when having an GPS outage.
Consider first to display a raw track and identify what kind of unsmoothed data is not suitable/ nice looking. (Probably later posting the pics here)

what about using the good old Kalman Filter!

Approximate Minimum Feedback Arc Set implementation in Java

I'd like to find an implementation of an approximate algorithm for the Minimum Feedback Arc Set in Java but I did not find anything so far. Does anyone have something in mind?

It appears that the simplest approximate algorithm one can implement (but with no minimality guarantees) is the one of this paper:
A fast and effective heuristic for the feedback arc set problem, by P. Eades, X. Lin, W.F. Smyth.
It is very easy to implement and works quite fast for large graphs (I tried it on a graph of 2.5 million edges and around 100 thousand nodes and broke all cycles in less than a minute).

Fast multi-body gravity algorithm?

I am writing a program to simulate an n-body gravity system, whose precision is arbitrarily good depending on how small a step of "time" I take between each step. Right now, it runs very quickly for up to 500 bodies, but after that it gets very slow since it has to run through an algorithm determining the force applied between each pair of bodies for every iteration. This is of complexity n(n+1)/2 = O(n^2), so it's not surprising that it gets very bad very quickly. I guess the most costly operation is that I determine the distance between each pair by taking a square root. So, in pseudo code, this is how my algorithm currently runs:
for (i = 1 to number of bodies - 1) {
for (j = i to number of bodies) {
(determining the force between the two objects i and j,
whose most costly operation is a square root)
}
}
So, is there any way I can optimize this? Any fancy algorithms to reuse the distances used in past iterations with fast modification? Are there any lossy ways to reduce this problem? Perhaps by ignoring the relationships between objects whose x or y coordinates (it's in 2 dimensions) exceed a certain amount, as determined by the product of their masses? Sorry if it sounds like I'm rambling, but is there anything I could do to make this faster? I would prefer to keep it arbitrarily precise, but if there are solutions that can reduce the complexity of this problem at the cost of a bit of precision, I'd be interested to hear it.
Thanks.

Take a look at this question. You can divide your objects into a grid, and use the fact that many faraway objects can be treated as a single object for a good approximation. The mass of a cell is equal to the sum of the masses of the objects it contains. The centre of mass of a cell can be treated as the centre of the cell itself, or more accurately the barycenter of the objects it contains. In the average case, I think this gives you O(n log n) performance, rather than O(n2), because you still need to calculate the force of gravity on each of n objects, but each object only interacts individually with those nearby.
Assuming you’re calculating the distance with r2 = x2 + y2, and then calculating the force with F = Gm1m2 / r2, you don’t need to perform a square root at all. If you do need the actual distance, you can use a fast inverse square root. You could also used fixed-point arithmetic.

One good lossy approach would be to run a clustering algorithm to cluster the bodies together.
There are some clustering algorithms that are fairly fast, and the trick will be to not run the clustering algorithm every tick. Instead run it every C ticks (C>1).
Then for each cluster, calculate the forces between all bodies in the cluster, and then for each cluster calculate the forces between the clusters.
This will be lossy but I think it is a good approach.
You'll have to fiddle with:
which clustering algorithm to use: Some are faster, some are more accurate. Some are deterministic, some are not.
how often to run the clustering algorithm: running it less will be faster, running it more will be more accurate.
how small/large to make the clusters: most clustering algorithms allow you some input on the size of the clusters. The larger you allow the clusters to be, the faster but less accurate the output will be.
So it's going to be a game of speed vs accuracy, but at least this way you will be able to sacrafice a bit of accuracy for some speed gains - with your current approach there's nothing you can really tweak at all.

You may want to try a less precise version of square root. You probably don't need a full double precision. Especially if the order of magnitude of your coordinate system is normally the same, then you can use a truncated taylor series to estimate the square root operation pretty quickly without giving up too much in efficiency.

There is a very good approximation to the n-body problem that is much faster (O(n log n) vs O(n²) for the naive algorithm) called Barnes Hut. Space is subdivided into a hierarchical grid, and when computing force contribution for distant masses, several masses can be considered as one. There is an accuracy parameter that can be tweaked depending on how much your willing to sacrifice accuracy for computation speed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.