R-Tree Implementation Java - java

I was searching the last few days for a stable implementation of the R-Tree with support of unlimited dimensions (20 or so would be enough). I only found this http://sourceforge.net/projects/jsi/ but they only support 2 dimensions.
Another Option would be a multidimensional implementation of an interval-tree.
Maybe I'm completly wrong with the idea of using an R-Tree or Intervall-Tree for my Problem so i state the Problem in short, that you can send me your thoughts about this.
The Problem I need to solve is some kind of nearest-neighbour search. I have a set of Antennas and rooms and for each antenna an interval of Integers. E.g. antenna 1, min -92, max -85. In fact it could be represented as room -> set of antennas -> interval for antenna.
The idea was that each room spans a box in the R-Tree over the dimension of the antennas and in each dimension by the interval.
If I get a query with N-Antennas and values for each antenna I then could just represent the Information as a query point in the room and retrieve the rooms "nearest" to the point.
Hope you got an Idea of the problem and my idea.

Be aware that R-Trees can degrade badly when you have discrete data. The first thing you really need to find out is an appropriate data representation, then test if your queries work on a subset of the data.
R-Trees will only make your queries faster. If they don't work in the first place, it will not help. You should test your approach without using R-Trees first. Unless you hit a large amount of data (say, 100.000 objects), a linear scan in-memory can easily outperform an R-Tree, in particular when you need some adapter layer because it is not well-intergrated with your code.
The obvious approach here is to just use bounding rectangles, and linearly scan over them. If they work, you can then store the MBRs in an R-Tree to get some performance improvements. But if it doesn't work with a linear scan, it won't work with an R-Tree either (it will not work faster.)

I'm not entirely clear on what your exact problem is, but an R-Tree or interval tree would not work well in 20 dimensions. That's not a huge number of dimensions, but it is large enough for the curse of dimensionality to begin showing up.
To see what I mean, consider just trying to look at all of the neighbors of a box, including ones off of corners and edges. With 20 dimensions, you'll have 320 - 1 or 3,486,784,400 neighboring boxes. (You get that by realizing that along each axis a neighbor can be -1 unit, 0 unit, or +1 unit, but (0,0,0) is not a neighbor because it represents the original box.)
I'm sorry, but you either need to accept brute force searching, or else analyze your problem better and come up with a cleverer solution.

I have found this R*-Tree implementation in Java which seems to offer many features:
https://github.com/davidmoten/rtree
You might want to check it out!

Another good implementation in Java is ELKI: https://elki-project.github.io/.

You can use PostgreSQL’s Generalized Search Tree indexing facility.
GiST
Quick demo

Related

Android Algorithm for find all Geopoints within a given distance

In my naive beginning Android mind I thought the way to do this would be to loop through each of the objects checking if proximity falls within X range and if so, include the object. This is being done with Google Maps and GeoPoints.
That said, I know this is probably the slowest way possibly. I did a search for Android Proxmity algorithm's and did not get much really. What I am looking for is best options with regard to this the more efficiently.
Are there any libraries I have not been able to find?
If not, should I load these Location objects into SQL then go from there or keep them in a JSONArray?
Once I establish my best datastructure, what is he best method to find all Locations located with X miles of user?
I am not asking for cut and paste code, rather the best method to this efficiently. Then, I can stumble through the code :)
My first gut feeling is to group the Locations by regions but I'm not exactly sure how to do this.
I could potentially have tens of thousands of datapoints.
Any help in simply heading in the right direction is greatly appreciated.
As a side note, I reach this juncture after discovering that a remote API I had been using was.. well.. just PLAIN WRONG and ommiting datapoints from my proximity search. I also realized that if just placed on the datapoints on the phone, then I could allow the user to run the App without internet connection, and only GPS and this would be a HUGE plus. So, with all setbacks come opportunnities!
The answer depends on the representation of the GeoPoints: If these are not sorted you need to scan all of them (this is done in linear time, sorting wrt. distance or clustering will be more expensive). Use Location.distanceTo(Location) or Location.distanceBetween(float, float, float, float, float[]) to calculate the distances.
If the GeoPoints were sorted wrt. distance to your position this task can be done much more efficiently, but since the supplier does not know your position, I assume that this cannot be done.
If the GeoPoints are clustered, i.e. if you have a set of clusters with some center and a radius select each cluster where the distance from your position to the cluster's center is within the limit plus the radius. For these clusters you need to check each GeoPoint contained in the cluster (some of them are possibly farther away from your position than the limit allows). Alternatively you might accept the error and include all points of the cluster (if the radius is relatively small I would recommend this).

Efficient 2D Radial Gravity Layout (Java)

I'm after an efficient 2D mapping algorithm, and I've tried a number of implementations, but they all seem lacking. I'm hoping the stackoverflow world can help out with some pointers to existing, tried-n-tested algorithms I could learn from.
My goal is to display articles based on the genre of writing; for the prototype, I am using Philosophy, Programming, Politics and Poetry, since those are the only four styles of writing I have.
Each article is weighted based on each category, and the home view will have each category as a header in each corner. The articles are then laid out in word-cloud-like format, with "artificial gravity" placing each item as-near-as-possible to its main category (or between its main categories), without overlapping.
Currently, I am using an inefficient algorithm which stores arrays of rectangles to perform hit-test-and-search every time an article is added to the view, (with A* search patterns to find empty space to fill). By approximating a single destination for all articles of the same weight, and by using a round-robin queue to pick off articles from each pool, I can achieve fresh results (arrays are sorted by weight, then timestamp), with positioning-by-relevance ("artificial gravity").
However, using A* to blindly search seems really wasteful, even with heuristics to make each article check closest to it's target marks first. I need a more efficient way to iterate over a 2D space.
I'm wondering if a Linked-List approach might work better; rather than go searching blindly in all directions for empty space, I can just iterate through connected nodes to ask each one if it has either a) nearby free space, or b) other connected nodes to ask (and always ask the closest node first).
If there are any better algorithms available, or critiques on my methods, any and all help would surely be appreciated.
I am using gwt elemental + java in this gui, but any 2D mapping algorithm in any language will surely help.
[EDIT (request for more details)] : The main problem here is the amount of work each new addition performs; it produces noticable glitches in the ui thread, especially when there is almost no space left, as I am searching many points in a given radius for enough free space to fit the article.
If I cut the algorithm off too soon, I get blank spots that could have been filled. If I let it run too long, the ui glitches pretty bad, and I'm sure users will hate it.
What is the fastest / most efficient way to store and modify collections of 2D space?
You haven't provided enough information to say what would make an algorithm "better." Faster? Produces layouts that are "nicer" by some metric for quality? Able to handler bigger data sources?
There is certainly nothing wrong with arrays, nor with A*. If they are giving acceptable results with the size of problem you are trying to solve, how can they be "wasteful?" Linked data structures are worthwhile only if they reduce cost of frequently needed operations.
If you sharpen the problem, you're more likely to get a useful answer.
At any rate, there is an enormous literature on "graph layout" and "graph drawing." Try searching on these terms. If you can represent your desired layout as a collection of nodes and edges, these might apply. Many are based on simulated spring systems, which seems akin to what you are doing.

Scattered data set in statistical data analysis

I have some number of statistical data. Some of the data are very scattered to the majority of data set as shown below. What I want to do is minimize the effect of highly scattered data in the data set. I want to compute mean of the data set which has minimized effect of the scattered data in my case.
My data set is as like this:
10.02, 11, 9.12, 7.89, 10.5, 11.3, 10.9, 12, 8.99, 89.23, 328.42.
As shown in figure below:
I need the mean value which is not 46.3 but closer to other data distribution.
Actually, I want to minimize the effect of 89.23 & 328.42 in mean calculation.
Thanks in advance
You might notice that you really dont want the mean. The problem here is that the distribution you've assumed for the data is different from the actual data. If you are trying to fit a normal distribution to this data you'll get bad results. You could try to fit a heavy tailed distribution like the cauchy to this data. If you want to use a normal distribution, then you need to filter out the non-normal samples. If you feel like you know what the standard deviation should be, you could remove everything from the sample above say 3 standard deviations away from the mean (the number 3 would have to depend on the sample size). This process can be done recursively to remove non-normal samples till you are happy with the size of the outlier in terms of the standard deviation.
Unfortunatley the mean of a set of data is just that - the mean value. Are you sure that the point is actually an outlier? Your data contains what appears to be a single outlier with regards to the clustering, but if you take a look at your plot, you will see that this data does seem to have a linear relationship and so is it truly an outlier?
If this reading is really causing you problems, you could remove it entirely. Other than that the only thing that I could suggest to you is to calculate some kind of weighted mean rather than the true mean http://en.wikipedia.org/wiki/Weighted_mean . This way you can assign a lower weighting to the point when calculating your mean (although how you choose a value for the weight is another matter). This is similar to weighted regression, where particular data points have less weight associated to the regression fitting (possibly due to unreliability of certain points for example) http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Weighted_linear_least_squares .
Hope this helps a little, or at least gives you some pointers to other avenues that you can try pursuing.

R-Tree implementation in Java [duplicate]

I was searching the last few days for a stable implementation of the R-Tree with support of unlimited dimensions (20 or so would be enough). I only found this http://sourceforge.net/projects/jsi/ but they only support 2 dimensions.
Another Option would be a multidimensional implementation of an interval-tree.
Maybe I'm completly wrong with the idea of using an R-Tree or Intervall-Tree for my Problem so i state the Problem in short, that you can send me your thoughts about this.
The Problem I need to solve is some kind of nearest-neighbour search. I have a set of Antennas and rooms and for each antenna an interval of Integers. E.g. antenna 1, min -92, max -85. In fact it could be represented as room -> set of antennas -> interval for antenna.
The idea was that each room spans a box in the R-Tree over the dimension of the antennas and in each dimension by the interval.
If I get a query with N-Antennas and values for each antenna I then could just represent the Information as a query point in the room and retrieve the rooms "nearest" to the point.
Hope you got an Idea of the problem and my idea.
Be aware that R-Trees can degrade badly when you have discrete data. The first thing you really need to find out is an appropriate data representation, then test if your queries work on a subset of the data.
R-Trees will only make your queries faster. If they don't work in the first place, it will not help. You should test your approach without using R-Trees first. Unless you hit a large amount of data (say, 100.000 objects), a linear scan in-memory can easily outperform an R-Tree, in particular when you need some adapter layer because it is not well-intergrated with your code.
The obvious approach here is to just use bounding rectangles, and linearly scan over them. If they work, you can then store the MBRs in an R-Tree to get some performance improvements. But if it doesn't work with a linear scan, it won't work with an R-Tree either (it will not work faster.)
I'm not entirely clear on what your exact problem is, but an R-Tree or interval tree would not work well in 20 dimensions. That's not a huge number of dimensions, but it is large enough for the curse of dimensionality to begin showing up.
To see what I mean, consider just trying to look at all of the neighbors of a box, including ones off of corners and edges. With 20 dimensions, you'll have 320 - 1 or 3,486,784,400 neighboring boxes. (You get that by realizing that along each axis a neighbor can be -1 unit, 0 unit, or +1 unit, but (0,0,0) is not a neighbor because it represents the original box.)
I'm sorry, but you either need to accept brute force searching, or else analyze your problem better and come up with a cleverer solution.
I have found this R*-Tree implementation in Java which seems to offer many features:
https://github.com/davidmoten/rtree
You might want to check it out!
Another good implementation in Java is ELKI: https://elki-project.github.io/.
You can use PostgreSQL’s Generalized Search Tree indexing facility.
GiST
Quick demo

How do I determine a best-fit distribution in java?

I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward.
Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis.
Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.
Looking for a distribution that fits is unlikely to give you good results in the absence of some a priori knowledge. You may find a distribution that coincidentally is a good fit but is unlikely to be the underlying distribution.
Do you have any metadata available that would hint at what the data means? E.g., "this is open-ended data sampled from a natural population, so it's some sort of normal distribution", vs. "this data is inherently bounded at 0 and discrete, so check for the best-fitting Poisson".
I don't know of any distribution solvers for Java off the top of my head, and I don't know of any that will guess which distribution to use. You could examine some statistical properties (skew/etc.) and make some guesses here--but you're more likely to end up with an accidentally good fit which does not adequately represent the underlying distribution. Real data is noisy and there are just too many degrees of freedom if you don't even know what distribution it is.
This may be above and beyond what you want to do, but it seems the most complete approach (and it allows access to the wealth of statistical knowledge available inside R):
use JRI to communicate with the R statistical language
use R, internally, as indicated in this thread
Look at Apache commons-math.
What you're looking for comes under the general heading of "goodness of fit." You could search on "goodness of fit test."
Donald Knuth describes a couple popular goodness of fit tests in Seminumerical Algorithms: the chi-squared test and the Kolmogorov-Smirnov test. But you've got to have some idea first what distribution you want to test. For example, if you have bell curve data, you might try normal or Cauchy distributions.
If all you really need the distribution for is to model the data you have sampled, you can make your own distribution based on the data you have:
1. Create a histogram of your sample: One method for selecting the bin size is here. There are other methods for selecting bin size, which you may prefer.
2. Derive the sample CDF: Think of the histogram as your PDF, and just compute the integral. It's probably best to scale the height of the bins so that the CDF has the right characteristics ... namely that the value of the CDF at +Infinity is 1.0.
To use the distribution for modeling purposes:
3. Draw X from your distribution: Make a draw Y from U(0,1). Use a reverse lookup on your CDF of the value Y to determine the X such that CDF(X) = Y. Since the CDF is invertible, X is unique.
I've heard of a package called Eureqa that might fill the bill nicely. I've only downloaded it; I haven't tried it myself yet.
You can proceed with a three steps approach, using the SSJ library:
Fit each distribution separately using maximum likelihood estimation (MLE). Using SSJ, this can be done with the static method getInstanceFromMLE(double[] x,
int n) available on each distribution.
For each distribution you have obtained, compute its goodness-of-fit with the real data, for example using Kolmogorov-Smirnov: static void kolmogorovSmirnov (double[] data, ContinuousDistribution dist, double[] sval,double[] pval), note that you don't need to sort the data before calling this function.
Pick the distribution having the highest p-value as your best fit distribution

Categories

Resources