Statistical analysis of distributed data values in Java

Statistical analysis of distributed data values in Java - java

I am writing a program in Java that outputs a List<Double> of distances that roughly follow a bell curve distribution. From this data, I need to generate two values A and B that follow the distribution at a particular standard deviation from the mean X, one above the mean and one below the mean. The distribution may not be symmetrical but I am content to assume that it is for my purposes. These values A and B would be better than my current method of taking the min and max of the dataset, which is very vulnerable to be skewed by random outliers, and so is not always representative of a specific probability from the distribution. How would I generate these values, A and B? Should I be asking this in the Stats stack exchange? Any help is greatly appreciated!

Should I be asking this in the Stats stack exchange?
Nah, we can do it here!
The Statistics
First off, we need to establish what we want to do. A and B are the values on opposite sides of the mean, with a particular standard deviation from it.
Recall, the standard deviation, is simply the square root of the variance
The variance, is calculated by sum((x[i] - mean)^2) / x.length
Thus, we also need the mean, which is sum(x[i]) / x.length
With the standard deviation calculated, if you multiply it with 1, it will be the distance from the mean to B, so B would be that value plus the mean. Use negative for the value of A (if that's what's below the mean).
The code
So, we have established that the data type for the statistical data is a List, so I will adapt it to the use of Lists.
First we need to loop over the list of data, let's call that List x. And I'm assuming it is already populated with data.
We also need some variables, let's define the mean: double mean, the standard deviation: double stdev and two helper variables to keep the sums: double sqr_sum and double data_sum.
Now, we will compute the mean first:
for (int i; i < x.size(); i++){
data_sum += x[i];
}
mean = data_sum / x.size();
Finally, we should have everything to begin calculating the sum of squares, and eventually the variance! I will also define another variable "variance" (data_var) here to make it easier.
for (int i; i < x.size(); i++){
sqr_sum += Math.pow(x[i] - mean, 2);
}
data_var = sqr_sum / x.size(); // Note, in statistics, depending on what data this is, you should use x.size() for populations, but x.size()-1 for sample data.
stdev = Math.sqrt(data_var);
... and there you have it! The standard deviation of the x data.
If you want to get B (or A), you could simply use:
double dev_A = -1; // How far from the mean we want A to be.
double dev_B = 1; // How far from the mean we want B to be.
double a = dev_A * stdev + mean;
double b = dev_B * stdev + mean;
Hope this helps!

Related

What could Error in this java program to compute sine?

I have written this code to compute the sine of an angle. This works fine for smaller angles, say upto +-360. But with larger angles it starts giving faulty results. (When I say larger, I mean something like within the range +-720 or +-1080)
In order to get more accurate results I increased the number of times my loop runs. That gave me better results but still that too had its limitations.
So I was wondering if there is any fault in my logic or do I need to fiddle with the conditional part of my loop? How can I overcome this shortcoming of my code? The inbuilt java sine function gives correct results for all the angles I have tested..so where am I going wrong?
Also can anyone give me an idea as to how do I modify the condition of my loop so that it runs until I get a desired decimal precision?
import java.util.Scanner;
class SineFunctionManual
{
public static void main(String a[])
{
System.out.print("Enter the angle for which you want to compute sine : ");
Scanner input = new Scanner(System.in);
int degreeAngle = input.nextInt(); //Angle in degree.
input.close();
double radianAngle = Math.toRadians(degreeAngle); //Sine computation is done in terms of radian angle
System.out.println(radianAngle);
double sineOfAngle = radianAngle,prevVal = radianAngle; //SineofAngle contains actual result, prevVal contains the next term to be added
//double fractionalPart = 0.1; // This variable is used to check the answer to a certain number of decimal places, as seen in the for loop
for(int i=3;i<=20;i+=2)
{
prevVal = (-prevVal)*((radianAngle*radianAngle)/(i*(i-1))); //x^3/3! can be written as ((x^2)/(3*2))*((x^1)/1!), similarly x^5/5! can be written as ((x^2)/(5*4))*((x^3)/3!) and so on. The negative sign is added because each successive term has alternate sign.
sineOfAngle+=prevVal;
//int iPart = (int)sineOfAngle;
//fractionalPart = sineOfAngle - iPart; //Extracting the fractional part to check the number of decimal places.
}
System.out.println("The value of sin of "+degreeAngle+" is : "+sineOfAngle);
}
}

The polynomial approximation for sine diverges widely for large positive and large negative values. Remember, since varies from -1 to 1 over all real numbers. Polynomials, on the other hand, particularly ones with higher orders, can't do that.
I would recommend using the periodicity of sine to your advantage.
int degreeAngle = input.nextInt() % 360;
This will give accurate answers, even for very, very large angles, without requiring an absurd number of terms.

The further you get from x=0, the more terms you need, of the Taylor expansion for sin x, to get within a particular accuracy of the correct answer. You're stopping around the 20th term, which is fine for small angles. If you want better accuracy for large angles, you'll just need to add more terms.

Selecting a value proportionally based on its double key

I have a list of values, keyed with doubles between 0 and 1 that represent how likely I think it is for a thing to be useful to me. For example, for getting an answer to a question:
0.5 call your mom
0.25 go to the library
0.6 StackOverflow
0.9 just Google it
So, we think that Googling it is (about) twice as likely to be helpful as asking your mom. When I attempt to figure out the next thing to do, I'd like "just Google it" to be returned about twice as often as "call your mom".
I've been searching for solutions with little success. Most of the things that I've found rely on having integer keys (like How to randomly select a key based on its Integer value in a Map with respect to the other values in O(n) time?), which I don't have and which I can't easily generate.
I feel like there should be some Java datatype that can do this for me. Any suggestions?

You can think of a solution based on the java interface NavigableMap, and if you use the TreeMap implementation you will always get a O(logn) complexity.
You can use one of the following:
lowerEntry
ceilingEntry
floorEntry
higherEntry
Now you just need to extract random numbers with the right probability. For that I would refer to this post:
How to generate a random number from specified discrete distribution?

If I understood correctly, what you're looking for is a weighted random.
You should sum all your weights, and maybe normalize this to an integer value, so you will be able to use the rand.nextInt as suggested by comments.
Normalization can be done by multiplying by 100 for example, so your normalized weights are now:
50, 25, 60, 90 - The sum is 225.
You should define ranges:
0 - 49 is for "calling your mum"
50 - 74 - is for "go to library"
Now you need to perform this.rand.nextInt(sum) - and get a value,
and this value should be mapped to one of the defined ranges.

If you keep track of what the total value of the probabilities are, you can do something like this:
double interval = 100;
double counter = 0;
double totalProbabilities = 2.25;
int randInt = new Random().nextInt((int)interval);
for (Element e: list) {
counter += (interval * e.probability() / totalProbabilities);
if (randInt < counter) {
return e.activity();
}
}

Wondering why would one calculate the median this way?

I was wondering what may be the reason to use this median function, instead of just calculating the min + (max - min) / 2:
// used by the random number generator
private static final double M_E12 = 162754.79141900392083592475;
/**
* Return an estimate of median of n values distributed in [min,max)
* #param min the minimum value
* #param max the maximum value
* #param n
* #return an estimate of median of n values distributed in [min,max)
**/
private static double median(double min, double max, int n)
{
// get random value in [0.0, 1.0)
double t = (new Random()).nextDouble();
double retval;
if (t > 0.5) {
retval = java.lang.Math.log(1.0-(2.0*(M_E12-1)*(t-0.5)/M_E12))/12.0;
} else {
retval = -java.lang.Math.log(1.0-(2.0*(M_E12-1)*t/M_E12))/12.0;
}
// We now have something distributed on (-1.0,1.0)
retval = (retval+1.0) * (max-min)/2.0;
retval = retval + min;
return retval;
}
The only downside of my approach would maybe be its deterministic nature, I'd say?
The whole code can be found here, http://www.koders.com/java/fid42BB059926626852A0D146D54F7D66D7D2D5A28D.aspx?s=cdef%3atree#L8, btw.
Thanks

[trying to cover a range here because it's not clear to me what you're not understanding]
first, the median is the middle value. the median of [0,0,1,99,99] is 1.
and so we can see that the code given is not calculating the median (it's not finding a middle value). instead, it's estimating it from some theoretical distribution. as the comment says.
the forumla you give is for the mid-point. if many values are uniformly distributed between min and max then yes, that is a good estimation of the median. in this case (presumably) the values are not distributed in that way and so some other method is necessary.
you can see why this could be necessary by calculating the mid point of the numbers above - your formula would give 49.5.
the reason for using an estimate is probably that it is much faster than finding the median. the reason for making that estimate random is likely to avoid a bad worst case on multiple calls.
and finally, sorry but i don't know what the distribution is in this case. you probably need to search for the data structure and/or author name to see if you can find a paper or book reference (i thought it might be assuming a power law, but see edit below - it seems to be adding a very small correction) (i'm not sure if that is what you are asking, or if you are more generally confused).
[edit] looking some more, i think the log(...) is giving a central bias to the uniformly random t. so it's basically doing what you suggest, but with some spread around the 0.5. here's a plot of one case which shows that retval is actually a pretty small adjustment.

I can't tell you what this code is attempting to achieve; for a start it doesn't even use n!
But from the looks of it, it's simply generating some sort of exponentially-distributed random value in the range [min,max]. See http://en.wikipedia.org/wiki/Exponential_distribution#Generating_exponential_variates.
Interestingly, Googling for that magic number brings up lots of relevant hits, none of which are illuminating: http://www.google.co.uk/search?q=162754.79141900392083592475.

Simulating Poisson Waiting Times

I need to simulate Poisson wait times. I've found many examples of simulating the number of arrivals, but I need to simulate the wait time for one arrival, given an average wait time.
I keep finding code like this:
public int getPoisson(double lambda)
{
double L = Math.exp(-lambda);
double p = 1.0;
int k = 0;
do
{
k++;
p *= rand.nextDouble();
p *= Math.random();
} while (p > L);
return k - 1;
}
but that is for number of arrivals, not arrival times.
Efficieny is preferred to accuracy, more because of power consumption than time. The language I am working in is Java, and it would be best if the algorithm only used methods available in the Random class, but this is not required.

Time between arrivals is an exponential distribution, and you can generate a random variable X~exp(lambda) with the formula:
-ln(U)/lambda` (where U~Uniform[0,1]).
More info on generating exponential variable.
Note that time between arrival also matches time until first arrival, because exponential distribution is memoryless.

If you want to simulate earthquakes, or lightning or critters appearing on a screen, the usual method is to assume a Poisson Distribution with an average arrival rate λ.
The easier thing to do is to simulate inter-arrivals:
With a Poisson distribution, the arrivals get more likely as time passes. It corresponds to the cumulative distribution for that probability density function. The expected value of a Poisson-distributed random variable is equal to λ and so is its variance.
The simplest way is to 'sample' the cumulative distribution which has an exponential form (e)^-λt which gives t = -ln(U)/λ. You choose a uniform random number U and plug in the formula to get the time that should pass before the next event.
Unfortunately, because U usually belongs to [0,1[ that could cause issues with the log, so it's easier to avoid it by using t= -ln(1-U)/λ.
Sample code can be found at the link below.
https://stackoverflow.com/a/5615564/1650437

Need help in translating code from C to Java

From this article. Here's the code:
float InvSqrt(float x){ // line 0
float xhalf = 0.5f * x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(float*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}
...I can't even tell if that's C or C++. [okay apparently it's C, thanks] Could someone translate it to Java for me, please? It's (only, I hope) lines 2 and 4 that are confusing me.

You want to use these methods:
Float.floatToIntBits
Float.intBitsToFloat
And there may be issues with strictfp, etc.
It's roughly something like this: (CAUTION: this is untested!)
float InvSqrt(float x){ // line 0
float xhalf = 0.5f * x;
int i = Float.floatToIntBits(x); // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = Float.intBitsToFloat(i); // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}

Those lines are used to convert between float and int as bit patterns. Java has static methods in java.lang.Float for that - everything else is identical.
static float InvSqrt(float x) { // line 0
float xhalf = 0.5f * x;
int i = Float.floatToIntBits(x); // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = Float.intBitsToFloat(i); // convert new bits into float
x = x * (1.5f - xhalf * x * x); // One round of Newton's method
return x;
}

The code you quote is C, although the comments are C++-style.
What the code is doing involves knowledge of the way floating-point values are stored, at the bit level. The "magic number" 0x5f3759d5 has something do with a particular value.
The floating point x value's bits are accessed when i is initialized, because the address of x is dereferenced. So, i is loaded with the first 32 bits of the floating point value. On the next line, x is written with the contents of i, updating the working approximation value.
I have read that this code became popular when John Carmack released it with Id's open source Quake engine. The purpose of the code is to quickly calculate 1/Sqrt(x), which is used in lighting calculations of graphic engines.
I would not have been able to translate this code directly to Java because it uses "type punning" as described above -- when it accesses the float in memory as if it were an int. Java prevents that sort of activity, but as others have pointed out, the Float object provides methods around it.
The purpose of using this strange implementation in C was for it to be very fast. At the time it was written, I imagine a large improvement came from this method. I wonder if the difference is worth it today, when floating point operations have gotten faster.
Using the Java methods to convert float to integer bits and back may be slower than simply calculating the inverse square root directly using Java math function for square root.

Ok I'm going out on a limb here, because I know C but I don't know Java.
Literally rewriting this C code in Java is begging for trouble.
Even in C, the code is unportable.
Among other things it relies on:
The size of floating point numbers.
The size of integers.
The internal representation of floating point numbers.
The byte alignment of both floating point numbers and integers.
Right shift ( i.e. i>>1) being implemented using logical right shift
as opposed to arithmetic right shift (which would shift in a 1 on
integers with a high order 1 bit and thus no longer equate to divide
by 2).
I understand Java compiles to a bytecode rather than directly to
machine code. Implementers of byte code interpreters tune using
assumptions based on the spec for the byte code and an understanding
of what is output by the compiler from sensible input source
code.
Hacks like this don't fall under the umbrella "sensible input source".
There is no reason to expect the interpreter will perform
faster with your C hack, in fact there is a good chance
it will be slower.
My advice is: IGNORE The C code.
Look for a gain in efficiency that is Java centric.
The concept of the C hack is:
Approximate 1/square(x) by leveraging knowledge that the internal
representation of floating point numbers already has the
exponent broken out of the number, exponent(x)/2 is faster to
compute than root(x) if you already have exponent(x).
The hack then performs one iteration of newton's method
to reduce the error in the approximation. I Presume
one iteration reduced the error to something tolerable.
Perhaps the concept warrants investigation in Java,
but the details will depend on intimate knowledge of
how JAVA is implemented, not how C is implemented.

The lines you care about are pretty simple. Line 2 takes the bytes in float x, which are in some floating-point representation like IEEE754, and stores them in an integer, exactly the way they are. This will result in a totally different number, since integers and floats are represented differently in byte form. Line 4 does the opposite, and transfers the bytes in that int to the float again

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.