Algorithm for clustering Tweets based on geo radius

Algorithm for clustering Tweets based on geo radius - java

I want to cluster tweets based on a specified geo-radius like (10 meters). If for example I specify 10 meters radius, then I want all tweets that are within 10 meters to be in one cluster.
A simple algorithm could be to calculate the distance between each tweet and each other tweets, but that would be very computationally expensive. Are there better algorithms to do this?

You can organize your tweets in a quadtree. This makes it quite easy to find tweets near by without looking at all tweeds and their location.
The quadtree does not directly deliver the distance (because it is based on a Manhatten-distance but it gives you near by tweets, for which you can calculate the precise distance afterwards afterwards.

If your problem is only in computation of distances:
remember: you should never count distances if you need them for comparison only. Use their squares instead.
Do not compare:
sqrt((x1-x2)^2+(y1-y2)^2) against 10
compare instead
(x1-x2)^2+(y1-y2)^2 against 100
It takes GREATLY less time.
The other improvement can be reached if you simply compare coordinates before comparing squares of distances. If abs(x1-x2)>1, you needn't that pair anymore. (It is the Manchattan distance MrSmith is speaking about)
I don't know how you work with your points, but if their set is stable, you could make two arrays of them, and in each one order them according to one of the coordinates. After that you need to check only these points that are close to the source one in both arrays.

Related

Actual performance benefits of distance squared vs distance

When calculating the distance between two 3D points in Java, I can compute the distance, or the distance squared between them, avoiding a call to Math.sqrt.
Natively, I've read that sqrt is only a quarter of the speed of multiplication which makes the inconvenience of using the distance squared not worthwhile.
In Java, what is the absolute performance difference between multiplication and calculating a square root?

I Initially wanted to add this as a comment, but it started to get too bid, so here goes:
Try it yourself. Make a loop with 10.000 iterations where you simply calculate a*a + b*b and another separate loop where you calculate Math.sqrt(a*a + b*a). Time it and you'll know. Calculating a square root is an iterative process on its own where the digital (computer bits) square root converges closer to the real square root of the given number until it's sufficiently close (as soon as the difference between each iteration is less than some really small value). There are multiple algorithms out there beside the one the Math library uses and their speed depends on the input and how the algorithm is designed. Stick with Math.sqrt(...) in my opinion, can't go wrong and it's been tested by a LOT of people.
Although this can be done very fast for one square root, there's a definite observable time difference.
On a side note: I cannot think of a reason to calculate the square root more than once, usually at the end. If you want to know the distance between points, just use the squared value of that distance as a default and make comparisons/summations/subtractions or whatever you want based on that default.
PS: Provide more code if you want a more "practical" answer

Android Algorithm for find all Geopoints within a given distance

In my naive beginning Android mind I thought the way to do this would be to loop through each of the objects checking if proximity falls within X range and if so, include the object. This is being done with Google Maps and GeoPoints.
That said, I know this is probably the slowest way possibly. I did a search for Android Proxmity algorithm's and did not get much really. What I am looking for is best options with regard to this the more efficiently.
Are there any libraries I have not been able to find?
If not, should I load these Location objects into SQL then go from there or keep them in a JSONArray?
Once I establish my best datastructure, what is he best method to find all Locations located with X miles of user?
I am not asking for cut and paste code, rather the best method to this efficiently. Then, I can stumble through the code :)
My first gut feeling is to group the Locations by regions but I'm not exactly sure how to do this.
I could potentially have tens of thousands of datapoints.
Any help in simply heading in the right direction is greatly appreciated.
As a side note, I reach this juncture after discovering that a remote API I had been using was.. well.. just PLAIN WRONG and ommiting datapoints from my proximity search. I also realized that if just placed on the datapoints on the phone, then I could allow the user to run the App without internet connection, and only GPS and this would be a HUGE plus. So, with all setbacks come opportunnities!

The answer depends on the representation of the GeoPoints: If these are not sorted you need to scan all of them (this is done in linear time, sorting wrt. distance or clustering will be more expensive). Use Location.distanceTo(Location) or Location.distanceBetween(float, float, float, float, float[]) to calculate the distances.
If the GeoPoints were sorted wrt. distance to your position this task can be done much more efficiently, but since the supplier does not know your position, I assume that this cannot be done.
If the GeoPoints are clustered, i.e. if you have a set of clusters with some center and a radius select each cluster where the distance from your position to the cluster's center is within the limit plus the radius. For these clusters you need to check each GeoPoint contained in the cluster (some of them are possibly farther away from your position than the limit allows). Alternatively you might accept the error and include all points of the cluster (if the radius is relatively small I would recommend this).

Interpolating between Functions [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Interpolation over an array (or two)
I have a set of CSV files that contain points of a 2D function... in other words I have four CSV files, each is the result of evaluating a function f(x, y) at different y values. I need to interpolate between these data such that I can calculate an arbitrary f for a certain x and y. The CSV files have varying lengths and x-values. Does anyone know of a library or algorithm in Java for this task? Linear interpolation is OK, as is spline interpolation.
Thanks,
taktoa

Ok, first of all I assume the "CSV" bit is irrelevant, let's assume you have read those into memory and merged them together (they're the values of the same function, right?). Now you have a single set of f(x,y) values for different (x,y) pairs and would like to interpolate between those. Fine so far?
If you stick to linear interpolation, there's still the question of how many points to take into account, which will depend on the level of noise in the measurements. In the simplest case one would use just the three nearest points to identify the plane they lie in and use that to find the value for the point in question. This option requires neither libraries nor algorithms, apart from vector addition, subtraction, cross product and dot product.
More sophisticated solutions would generally require some sort of fitting, e.g. (weighted) least squares.

The simplest function is to find the closest points and use linear interpolation. e.g. chose two of three closest points and interpolate them.
Or you can take a weighted average based on distance. Or you can pick a close point and then find points on the "other side" of the closest point to improve the interpolation.

Lagrange interpolation would be simple and accurate.

Algorithm to find peaks in 2D array

Let's say I have a 2D accumulator array in java int[][] array. The array could look like this:
(x and z axes represent indexes in the array, y axis represents values - these are images of an int[56][56] with values from 0 ~ 4500)
or
What I need to do is find peaks in the array - there are 2 peaks in the first one and 8 peaks in the second array. These peaks are always 'obvious' (there's always a gap between peaks), but they don't have to be similar like on these images, they can be more or less random - these images are not based on the real data, just samples. The real array can have size like 5000x5000 with peaks from thousands to several hundred thousands... The algorithm has to be universal, I don't know how big the array or peaks can be, I also don't know how many peaks there are. But I do know some sort of threshold - that the peaks can't be smaller than a given value.
The problem is, that one peak can consist of several smaller peaks nearby (first image), the height can be quite random and also the size can be significantly different within one array (size - I mean the number of units it takes in the array - one peak can consist from 6 units and other from 90). It also has to be fast (all done in 1 iteration), the array can be really big.
Any help is appreciated - I don't expect code from you, just the right idea :) Thanks!
edit: You asked about the domain - but it's quite complicated and imho it can't help with the problem. It's actually an array of ArrayLists with 3D points, like ArrayList< Point3D >[][] and the value in question is the size of the ArrayList. Each peak contains points that belong to one cluster (plane, in this case) - this array is a result of an algorithm, that segments a pointcloud . I need to find the highest value in the peak so I can fit the points from the 'biggest' arraylist to a plane, compute some parameters from it and than properly cluster most of the points from the peak.

He's not interested in estimating the global maximum using some sort of optimization heuristic - he just wants to find the maximum values within each of a number of separate clusters.
These peaks are always 'obvious' (there's always a gap between peaks)
Based on your images, I assume you mean there's always some 0-values separating clusters? If that's the case, you can use a simple flood-fill to identify the clusters. You can also keep track of each cluster's maximum while doing the flood-fill, so you both identify the clusters and find their maximum simultaneously.
This is also as fast as you can get, without relying on heuristics (which could return the wrong answer), since the maximum of each cluster could potentially be any value in the cluster, so you have to check them all at least once.
Note that this will iterate through every item in the array. This is also necessary, since (from the information you've given us) it's potentially possible for any single item in the array to be its own cluster (which would also make it a peak). With around 25 million items in the array, this should only take a few seconds on a modern computer.

This might not be an optimal solution, but since the problem sounds somewhat fluid too, I'll write it down.
Construct a list of all the values (and coordinates) that are over your minimum treshold.
Sort it in descending order of height.
The first element will be the biggest peak, add it to the peak list.
Then descend down the list, if the current element is further than the minimum distance from all the existing peaks, add it to the peak list.
This is a linear description but all the steps (except 3) can be trivially parallelised. In step 4 you can also use a coverage map: a 2D array of booleans that show which coordinates have been "covered" by a nearby peak.
(Caveat emptor: once you refine the criteria, this solution might become completely unfeasible, but in general it works.)

Simulated annealing, or hill climbing are what immediately comes to mind. These algorithms though will not guarantee that all peaks are found.
However if your "peaks" are separated by values of 0 as the gap, maybe a connected components analysis would help. You would label a region as "connected" if it is connected with values greater than 0(or if you have a certain threshold, label regions as connected that are over that threshold), then your number of components would be your number of peaks. You could also then do another pass of the array to find the max of each component.
I should note that connected components can be done in linear time, and finding the peak values can also be done in linear time.

Fast multi-body gravity algorithm?

I am writing a program to simulate an n-body gravity system, whose precision is arbitrarily good depending on how small a step of "time" I take between each step. Right now, it runs very quickly for up to 500 bodies, but after that it gets very slow since it has to run through an algorithm determining the force applied between each pair of bodies for every iteration. This is of complexity n(n+1)/2 = O(n^2), so it's not surprising that it gets very bad very quickly. I guess the most costly operation is that I determine the distance between each pair by taking a square root. So, in pseudo code, this is how my algorithm currently runs:
for (i = 1 to number of bodies - 1) {
for (j = i to number of bodies) {
(determining the force between the two objects i and j,
whose most costly operation is a square root)
}
}
So, is there any way I can optimize this? Any fancy algorithms to reuse the distances used in past iterations with fast modification? Are there any lossy ways to reduce this problem? Perhaps by ignoring the relationships between objects whose x or y coordinates (it's in 2 dimensions) exceed a certain amount, as determined by the product of their masses? Sorry if it sounds like I'm rambling, but is there anything I could do to make this faster? I would prefer to keep it arbitrarily precise, but if there are solutions that can reduce the complexity of this problem at the cost of a bit of precision, I'd be interested to hear it.
Thanks.

Take a look at this question. You can divide your objects into a grid, and use the fact that many faraway objects can be treated as a single object for a good approximation. The mass of a cell is equal to the sum of the masses of the objects it contains. The centre of mass of a cell can be treated as the centre of the cell itself, or more accurately the barycenter of the objects it contains. In the average case, I think this gives you O(n log n) performance, rather than O(n2), because you still need to calculate the force of gravity on each of n objects, but each object only interacts individually with those nearby.
Assuming you’re calculating the distance with r2 = x2 + y2, and then calculating the force with F = Gm1m2 / r2, you don’t need to perform a square root at all. If you do need the actual distance, you can use a fast inverse square root. You could also used fixed-point arithmetic.

One good lossy approach would be to run a clustering algorithm to cluster the bodies together.
There are some clustering algorithms that are fairly fast, and the trick will be to not run the clustering algorithm every tick. Instead run it every C ticks (C>1).
Then for each cluster, calculate the forces between all bodies in the cluster, and then for each cluster calculate the forces between the clusters.
This will be lossy but I think it is a good approach.
You'll have to fiddle with:
which clustering algorithm to use: Some are faster, some are more accurate. Some are deterministic, some are not.
how often to run the clustering algorithm: running it less will be faster, running it more will be more accurate.
how small/large to make the clusters: most clustering algorithms allow you some input on the size of the clusters. The larger you allow the clusters to be, the faster but less accurate the output will be.
So it's going to be a game of speed vs accuracy, but at least this way you will be able to sacrafice a bit of accuracy for some speed gains - with your current approach there's nothing you can really tweak at all.

You may want to try a less precise version of square root. You probably don't need a full double precision. Especially if the order of magnitude of your coordinate system is normally the same, then you can use a truncated taylor series to estimate the square root operation pretty quickly without giving up too much in efficiency.

There is a very good approximation to the n-body problem that is much faster (O(n log n) vs O(n²) for the naive algorithm) called Barnes Hut. Space is subdivided into a hierarchical grid, and when computing force contribution for distant masses, several masses can be considered as one. There is an accuracy parameter that can be tweaked depending on how much your willing to sacrifice accuracy for computation speed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.