I have some number of statistical data. Some of the data are very scattered to the majority of data set as shown below. What I want to do is minimize the effect of highly scattered data in the data set. I want to compute mean of the data set which has minimized effect of the scattered data in my case.
My data set is as like this:
10.02, 11, 9.12, 7.89, 10.5, 11.3, 10.9, 12, 8.99, 89.23, 328.42.
As shown in figure below:
I need the mean value which is not 46.3 but closer to other data distribution.
Actually, I want to minimize the effect of 89.23 & 328.42 in mean calculation.
Thanks in advance
You might notice that you really dont want the mean. The problem here is that the distribution you've assumed for the data is different from the actual data. If you are trying to fit a normal distribution to this data you'll get bad results. You could try to fit a heavy tailed distribution like the cauchy to this data. If you want to use a normal distribution, then you need to filter out the non-normal samples. If you feel like you know what the standard deviation should be, you could remove everything from the sample above say 3 standard deviations away from the mean (the number 3 would have to depend on the sample size). This process can be done recursively to remove non-normal samples till you are happy with the size of the outlier in terms of the standard deviation.
Unfortunatley the mean of a set of data is just that - the mean value. Are you sure that the point is actually an outlier? Your data contains what appears to be a single outlier with regards to the clustering, but if you take a look at your plot, you will see that this data does seem to have a linear relationship and so is it truly an outlier?
If this reading is really causing you problems, you could remove it entirely. Other than that the only thing that I could suggest to you is to calculate some kind of weighted mean rather than the true mean http://en.wikipedia.org/wiki/Weighted_mean . This way you can assign a lower weighting to the point when calculating your mean (although how you choose a value for the weight is another matter). This is similar to weighted regression, where particular data points have less weight associated to the regression fitting (possibly due to unreliability of certain points for example) http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Weighted_linear_least_squares .
Hope this helps a little, or at least gives you some pointers to other avenues that you can try pursuing.
Related
Server is receiving a certain rate(12 per minute) of monitoring data for some process via external source(web services, etc). Now process may run for a minute(or less than) or for an hour or a day. At the end of the process, I may be having 5 or 720 or 17280 data points. This data is being gathered for more than 40 parameters and stored into the database for future display via web. Imagine more than 1000 processes are running and the amount of data generated. I have to stick to RDBMS(MySQL specifically). Therefore, I want to process the data and decrease the amount the data by selecting only statistically significant points before storing the data to the database. The ultimate objective is to plot these data points over a graph where Y-axis will be time and X-axis will be represented by some parameter(part of data point).
I do not want to miss any significant fluctuation or nature but at the same time I cannot manage to plot all of the data points(in case the number is huge > 100).
Please note that I am aware of basic statistical terms like mean, standard deviation, etc.
If this is a constant process, you could plot the mean (should be a flat line) and any points that exceeded a certain threshold. Three standard deviations might be a good threshold to start with, then see whether it gives you the information you need.
If it's not a constant process, you need to figure out how it should be varying with time and do a similar thing: plot the points that substantially vary from your expectation at that point in time.
That should give you a pretty clean graph while still communicating the important information.
If you expect your process to be noisy, then doing some smoothing through a spline can help you reduce noise and compress your data (since to draw a spline you need only a few points, where "few" is arbitrary picked by you, depending on how much detail you want to get rid of).
However, if your process is not noisy, then outliers are very important, since they may represent errors or exceptional conditions. In this case, you are better off getting rid of the points that are close to the average (say less than 1 standard deviation), and keeping those that are far.
A little note: the term "statistically significant", describes a high enough level of certainty to discard the null hypothesis. I don't think it applies to your problem.
I was searching the last few days for a stable implementation of the R-Tree with support of unlimited dimensions (20 or so would be enough). I only found this http://sourceforge.net/projects/jsi/ but they only support 2 dimensions.
Another Option would be a multidimensional implementation of an interval-tree.
Maybe I'm completly wrong with the idea of using an R-Tree or Intervall-Tree for my Problem so i state the Problem in short, that you can send me your thoughts about this.
The Problem I need to solve is some kind of nearest-neighbour search. I have a set of Antennas and rooms and for each antenna an interval of Integers. E.g. antenna 1, min -92, max -85. In fact it could be represented as room -> set of antennas -> interval for antenna.
The idea was that each room spans a box in the R-Tree over the dimension of the antennas and in each dimension by the interval.
If I get a query with N-Antennas and values for each antenna I then could just represent the Information as a query point in the room and retrieve the rooms "nearest" to the point.
Hope you got an Idea of the problem and my idea.
Be aware that R-Trees can degrade badly when you have discrete data. The first thing you really need to find out is an appropriate data representation, then test if your queries work on a subset of the data.
R-Trees will only make your queries faster. If they don't work in the first place, it will not help. You should test your approach without using R-Trees first. Unless you hit a large amount of data (say, 100.000 objects), a linear scan in-memory can easily outperform an R-Tree, in particular when you need some adapter layer because it is not well-intergrated with your code.
The obvious approach here is to just use bounding rectangles, and linearly scan over them. If they work, you can then store the MBRs in an R-Tree to get some performance improvements. But if it doesn't work with a linear scan, it won't work with an R-Tree either (it will not work faster.)
I'm not entirely clear on what your exact problem is, but an R-Tree or interval tree would not work well in 20 dimensions. That's not a huge number of dimensions, but it is large enough for the curse of dimensionality to begin showing up.
To see what I mean, consider just trying to look at all of the neighbors of a box, including ones off of corners and edges. With 20 dimensions, you'll have 320 - 1 or 3,486,784,400 neighboring boxes. (You get that by realizing that along each axis a neighbor can be -1 unit, 0 unit, or +1 unit, but (0,0,0) is not a neighbor because it represents the original box.)
I'm sorry, but you either need to accept brute force searching, or else analyze your problem better and come up with a cleverer solution.
I have found this R*-Tree implementation in Java which seems to offer many features:
https://github.com/davidmoten/rtree
You might want to check it out!
Another good implementation in Java is ELKI: https://elki-project.github.io/.
You can use PostgreSQL’s Generalized Search Tree indexing facility.
GiST
Quick demo
I was searching the last few days for a stable implementation of the R-Tree with support of unlimited dimensions (20 or so would be enough). I only found this http://sourceforge.net/projects/jsi/ but they only support 2 dimensions.
Another Option would be a multidimensional implementation of an interval-tree.
Maybe I'm completly wrong with the idea of using an R-Tree or Intervall-Tree for my Problem so i state the Problem in short, that you can send me your thoughts about this.
The Problem I need to solve is some kind of nearest-neighbour search. I have a set of Antennas and rooms and for each antenna an interval of Integers. E.g. antenna 1, min -92, max -85. In fact it could be represented as room -> set of antennas -> interval for antenna.
The idea was that each room spans a box in the R-Tree over the dimension of the antennas and in each dimension by the interval.
If I get a query with N-Antennas and values for each antenna I then could just represent the Information as a query point in the room and retrieve the rooms "nearest" to the point.
Hope you got an Idea of the problem and my idea.
Be aware that R-Trees can degrade badly when you have discrete data. The first thing you really need to find out is an appropriate data representation, then test if your queries work on a subset of the data.
R-Trees will only make your queries faster. If they don't work in the first place, it will not help. You should test your approach without using R-Trees first. Unless you hit a large amount of data (say, 100.000 objects), a linear scan in-memory can easily outperform an R-Tree, in particular when you need some adapter layer because it is not well-intergrated with your code.
The obvious approach here is to just use bounding rectangles, and linearly scan over them. If they work, you can then store the MBRs in an R-Tree to get some performance improvements. But if it doesn't work with a linear scan, it won't work with an R-Tree either (it will not work faster.)
I'm not entirely clear on what your exact problem is, but an R-Tree or interval tree would not work well in 20 dimensions. That's not a huge number of dimensions, but it is large enough for the curse of dimensionality to begin showing up.
To see what I mean, consider just trying to look at all of the neighbors of a box, including ones off of corners and edges. With 20 dimensions, you'll have 320 - 1 or 3,486,784,400 neighboring boxes. (You get that by realizing that along each axis a neighbor can be -1 unit, 0 unit, or +1 unit, but (0,0,0) is not a neighbor because it represents the original box.)
I'm sorry, but you either need to accept brute force searching, or else analyze your problem better and come up with a cleverer solution.
I have found this R*-Tree implementation in Java which seems to offer many features:
https://github.com/davidmoten/rtree
You might want to check it out!
Another good implementation in Java is ELKI: https://elki-project.github.io/.
You can use PostgreSQL’s Generalized Search Tree indexing facility.
GiST
Quick demo
I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward.
Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis.
Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.
Looking for a distribution that fits is unlikely to give you good results in the absence of some a priori knowledge. You may find a distribution that coincidentally is a good fit but is unlikely to be the underlying distribution.
Do you have any metadata available that would hint at what the data means? E.g., "this is open-ended data sampled from a natural population, so it's some sort of normal distribution", vs. "this data is inherently bounded at 0 and discrete, so check for the best-fitting Poisson".
I don't know of any distribution solvers for Java off the top of my head, and I don't know of any that will guess which distribution to use. You could examine some statistical properties (skew/etc.) and make some guesses here--but you're more likely to end up with an accidentally good fit which does not adequately represent the underlying distribution. Real data is noisy and there are just too many degrees of freedom if you don't even know what distribution it is.
This may be above and beyond what you want to do, but it seems the most complete approach (and it allows access to the wealth of statistical knowledge available inside R):
use JRI to communicate with the R statistical language
use R, internally, as indicated in this thread
Look at Apache commons-math.
What you're looking for comes under the general heading of "goodness of fit." You could search on "goodness of fit test."
Donald Knuth describes a couple popular goodness of fit tests in Seminumerical Algorithms: the chi-squared test and the Kolmogorov-Smirnov test. But you've got to have some idea first what distribution you want to test. For example, if you have bell curve data, you might try normal or Cauchy distributions.
If all you really need the distribution for is to model the data you have sampled, you can make your own distribution based on the data you have:
1. Create a histogram of your sample: One method for selecting the bin size is here. There are other methods for selecting bin size, which you may prefer.
2. Derive the sample CDF: Think of the histogram as your PDF, and just compute the integral. It's probably best to scale the height of the bins so that the CDF has the right characteristics ... namely that the value of the CDF at +Infinity is 1.0.
To use the distribution for modeling purposes:
3. Draw X from your distribution: Make a draw Y from U(0,1). Use a reverse lookup on your CDF of the value Y to determine the X such that CDF(X) = Y. Since the CDF is invertible, X is unique.
I've heard of a package called Eureqa that might fill the bill nicely. I've only downloaded it; I haven't tried it myself yet.
You can proceed with a three steps approach, using the SSJ library:
Fit each distribution separately using maximum likelihood estimation (MLE). Using SSJ, this can be done with the static method getInstanceFromMLE(double[] x,
int n) available on each distribution.
For each distribution you have obtained, compute its goodness-of-fit with the real data, for example using Kolmogorov-Smirnov: static void kolmogorovSmirnov (double[] data, ContinuousDistribution dist, double[] sval,double[] pval), note that you don't need to sort the data before calling this function.
Pick the distribution having the highest p-value as your best fit distribution
I have data I would like to plot, and more importantly, do a least squares regression on using cosines (instead of using polynomials):
Any recommendations? Thanks.
Probably the following page solves the regression part of your aim:
http://www.teneighty.org/software/index.html?f=fft&c=e98b8
You might find this demo Least Squares & Data Fitting helpful since it solves a few of your problems.
Just a bit of cautionary advice. Using a Fourier series makes sense if you think your underlying function has a cosine series as a basis; however, if you are using it as a basis for an arbitrary function (with unknown shape), you may do better trying to guess at a more specific underlying function type (polynomial, exponential, etc).
I did some constrained optimization on such a series, and the function wiggled around so much it was hard to say if my fit was meaningfull; my fit function had great number of local maxima.
MathGL can plot, fit (by help of GSL) and show fitting result - see this sample