How do I determine a best-fit distribution in java? - java

I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward.
Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis.
Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.

Looking for a distribution that fits is unlikely to give you good results in the absence of some a priori knowledge. You may find a distribution that coincidentally is a good fit but is unlikely to be the underlying distribution.
Do you have any metadata available that would hint at what the data means? E.g., "this is open-ended data sampled from a natural population, so it's some sort of normal distribution", vs. "this data is inherently bounded at 0 and discrete, so check for the best-fitting Poisson".
I don't know of any distribution solvers for Java off the top of my head, and I don't know of any that will guess which distribution to use. You could examine some statistical properties (skew/etc.) and make some guesses here--but you're more likely to end up with an accidentally good fit which does not adequately represent the underlying distribution. Real data is noisy and there are just too many degrees of freedom if you don't even know what distribution it is.

This may be above and beyond what you want to do, but it seems the most complete approach (and it allows access to the wealth of statistical knowledge available inside R):
use JRI to communicate with the R statistical language
use R, internally, as indicated in this thread

Look at Apache commons-math.

What you're looking for comes under the general heading of "goodness of fit." You could search on "goodness of fit test."
Donald Knuth describes a couple popular goodness of fit tests in Seminumerical Algorithms: the chi-squared test and the Kolmogorov-Smirnov test. But you've got to have some idea first what distribution you want to test. For example, if you have bell curve data, you might try normal or Cauchy distributions.

If all you really need the distribution for is to model the data you have sampled, you can make your own distribution based on the data you have:
1. Create a histogram of your sample: One method for selecting the bin size is here. There are other methods for selecting bin size, which you may prefer.
2. Derive the sample CDF: Think of the histogram as your PDF, and just compute the integral. It's probably best to scale the height of the bins so that the CDF has the right characteristics ... namely that the value of the CDF at +Infinity is 1.0.
To use the distribution for modeling purposes:
3. Draw X from your distribution: Make a draw Y from U(0,1). Use a reverse lookup on your CDF of the value Y to determine the X such that CDF(X) = Y. Since the CDF is invertible, X is unique.

I've heard of a package called Eureqa that might fill the bill nicely. I've only downloaded it; I haven't tried it myself yet.

You can proceed with a three steps approach, using the SSJ library:
Fit each distribution separately using maximum likelihood estimation (MLE). Using SSJ, this can be done with the static method getInstanceFromMLE(double[] x,
int n) available on each distribution.
For each distribution you have obtained, compute its goodness-of-fit with the real data, for example using Kolmogorov-Smirnov: static void kolmogorovSmirnov (double[] data, ContinuousDistribution dist, double[] sval,double[] pval), note that you don't need to sort the data before calling this function.
Pick the distribution having the highest p-value as your best fit distribution

Related

Finding poisson distribution from data set in Java

I have large data set in excel. I want to find out whether the numbers follow Poisson Distribution or Binomial distribution in Java. Is there any open source library that would help me to get this done. I'm looking at Apache Common Math.
Any pointers would help?
It sounds like you have a (relatively simple) model fitting problem, and you are trying to choose between two distributions. The way that you would usually do this is as follows.
Estimate parameters p_poisson for the Poisson distribution on your data
Estimate parameters p_binomial for the binomial distribution on your data.
Compute p(data | p_poisson) and p(data | p_binomial) (the likelihood function) and choose the one that has higher probability.
For more generality, I would recommend looking at AIC, BIC, and general information on model selection. In this case, if you don't have a ton of data, the binomial distribution should be penalized slightly for the possibility of overfitting, because it has more parameters than the Poisson.

How to calculate max value of function in range?

I have some function (for example, double function(double value)), and some range (for example, from A to B). I need to calculate max value of function in this range. Are there existed libraries for it? Please, give me advice.
If the function needs to handle floating-point values, you're going to have to use something like Golden section search. Note that for this specific method, there are significant limitations regarding the functions that can be handled (specifically it must be unimodal). There are some adjustments you can make to the algorithm which extend it to more functions, specifically these modifications will allow it to work for continuous functions.
Is this a continuous function, or a set of discrete values? If discrete values, then you can either iterate over all values, and set max/min flags as 808sound suggests, or you can load all values into an array.
If it's a continuous function, then you can either populate an array with the function's value at discrete inputs, and find the max as above, or if it's differentiable, then you can use basic calculus to find the points at which df(x)/dx are 0. The latter case is a little more abstract, and probably more complicated than you want, though?
A quick google search led me to this:
http://code.google.com/p/javacalculus/
But I've never used it myself, so I don't know if that implements the required functionality. It does differential equations, though, so I assume they'd have "baby stuff" like basic differentiation.
I do not know if there are any librairies in Java for your problem.
But I know you can easily do that with MatLab (or Octave for the OpenSource equivalent).
If you do not have any indication of what the functions inner workings are (i.e. the function is a black box that accepts an input and produces an output), there is no "easy" way to find the global maximum.
There are an infinite number of points to choose for your input (technically) so "iterating over all possible inputs" is not feasible mathematically.
There are various algorithms that will give you estimated maximum values ina function like this:
The hill climbing algorithm, and the firefly algorithm are two, but there are many more. This is a fairly well documented/studied computer science problem and there is a lot of material online for you to look at. I suggest starting with the hill climbing algorithm, and maybe expanding out to other global optimization algorithms.
Note: These algorithms do not guarantee that the result is the maximum, but provide an estimate of its value.*

Scattered data set in statistical data analysis

I have some number of statistical data. Some of the data are very scattered to the majority of data set as shown below. What I want to do is minimize the effect of highly scattered data in the data set. I want to compute mean of the data set which has minimized effect of the scattered data in my case.
My data set is as like this:
10.02, 11, 9.12, 7.89, 10.5, 11.3, 10.9, 12, 8.99, 89.23, 328.42.
As shown in figure below:
I need the mean value which is not 46.3 but closer to other data distribution.
Actually, I want to minimize the effect of 89.23 & 328.42 in mean calculation.
Thanks in advance
You might notice that you really dont want the mean. The problem here is that the distribution you've assumed for the data is different from the actual data. If you are trying to fit a normal distribution to this data you'll get bad results. You could try to fit a heavy tailed distribution like the cauchy to this data. If you want to use a normal distribution, then you need to filter out the non-normal samples. If you feel like you know what the standard deviation should be, you could remove everything from the sample above say 3 standard deviations away from the mean (the number 3 would have to depend on the sample size). This process can be done recursively to remove non-normal samples till you are happy with the size of the outlier in terms of the standard deviation.
Unfortunatley the mean of a set of data is just that - the mean value. Are you sure that the point is actually an outlier? Your data contains what appears to be a single outlier with regards to the clustering, but if you take a look at your plot, you will see that this data does seem to have a linear relationship and so is it truly an outlier?
If this reading is really causing you problems, you could remove it entirely. Other than that the only thing that I could suggest to you is to calculate some kind of weighted mean rather than the true mean http://en.wikipedia.org/wiki/Weighted_mean . This way you can assign a lower weighting to the point when calculating your mean (although how you choose a value for the weight is another matter). This is similar to weighted regression, where particular data points have less weight associated to the regression fitting (possibly due to unreliability of certain points for example) http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Weighted_linear_least_squares .
Hope this helps a little, or at least gives you some pointers to other avenues that you can try pursuing.

I need a class to perform hypothesis testing on a normal population

In particular, I want to generate a tolerance interval, for which I would need to have the values of Zx for x some value on the standard normal.
Does the Java standard library have anything like this, or should I roll my own?
EDIT: Specifically, I'm looking to do something akin to linear regression on a set of images. I have two images, and I want to see what the degree of correlation is between their pixels. I suppose this might fall under computer vision as well.
Simply calculate Pearson correlation coefficient between those two images.
You will have 3 coefficients because of R,G,B channels needs to be analyzed separately.
Or you can calculate 1 coefficient just for intensity levels of images,... or you could calculate correlation between Hue values of images after converting to HSV or HSL color space.
Do whatever your see fits :-)
EDIT: Correlation coefficient may be maximized only after scaling and/or rotating some image. This may be a problem or not - depends on your needs.
You can use the complete statistical power of R using rJava/JRI. This includes correlations between pixels and so on.
Another option is to look around at imageJ, which contains libraries for many image manipulations, mathematics and statistics. It's an application allright, but the library is useable in development as well. It comes with an extensive developers manual. On a sidenote, imageJ can be combined with R as well.
imageJ allows you to use the correct methods for finding image similarity measures, based on fourier transformations or other methods. More info can be found in Digital Image Processing with Java an ImageJ. See also this paper.
Another one is the Commons-Math. This one also contains the basic statistical tools.
See also the answers on this question and this question.
It seems you want to compare to images to see how similar they are. In this case, the first two things to try are SSD (sum of squared differences) and normalized correlation (this is closely related to what 0x69 suggests, Pearson correlation) between the two images.
You can also try normalized correlation over small (corresponding) windows in the two images and add up the results over several (all) small windows in the image.
These two are very simple methods which you can write in a few minutes.
I'm not sure however what this has to do with hypothesis testing or linear regression, you might want to edit to clarify this part of your question.

Generate RTT values

I'm writing a Java applet where I should be able to simulate a connection between two hosts. Hence I have to generate packet round-trip times at random.
These RTTs can go from ~0 to infinity, but are typically oscillating around some average value (i.e. an extremely large or small value is very improbable but possible). I was wondering if anyone had an idea of how I could do this?
Thanks in advance
You're going to have to select a reasonable disribution from which to draw (pseudo) random values. A gamma distribuition might make some sense as it seems to satisfy your requirements. You could also use a (left) truncated normal distribution.
The Apache Commons-Math library for Java has code for the gamma and normal (aka Gaussian) distributions. When using a normal distribution RNG for picking values from a truncated normal distribution, simply reject undesired draws (i.e. when you pick x <= 0, pick again).

Categories

Resources