Generate RTT values - java

I'm writing a Java applet where I should be able to simulate a connection between two hosts. Hence I have to generate packet round-trip times at random.
These RTTs can go from ~0 to infinity, but are typically oscillating around some average value (i.e. an extremely large or small value is very improbable but possible). I was wondering if anyone had an idea of how I could do this?
Thanks in advance

You're going to have to select a reasonable disribution from which to draw (pseudo) random values. A gamma distribuition might make some sense as it seems to satisfy your requirements. You could also use a (left) truncated normal distribution.
The Apache Commons-Math library for Java has code for the gamma and normal (aka Gaussian) distributions. When using a normal distribution RNG for picking values from a truncated normal distribution, simply reject undesired draws (i.e. when you pick x <= 0, pick again).

Related

any function in C or Java (Android) could mess up srand()?

So I have some C code which calculate some results based on the number generated by srand(). If I use the same seed number, the result will always be the same.
Now I have an Android app load these C code via JNI. However, the results become different although the same seed number is being used. I have double checked the seed number to make sure it is the same. However, since both the Android program and the native code are pretty complicated, I am having a hard time to figure out what is causing this problem.
What I am sure is, we did not use function in the java program to generate random numbers. So presumably srand() is not called with a different seed number every time. Can other functions in Java or C change the random number generated by srand()?
Thanks!
Update:
I guess my question was a little confusing. To clarify, the results I am comparing are from the same platform, but different runs. The c code use rand() to get a number calculate a result based on that. So if the seed number of srand() is always the same, the number get by rand() should be the same and hence the results should be the same. but somehow even I use the same seed for srand(), the rand() give me different number... Any thought on that?
There are many different types of random number generators, and they are not all guaranteed to be the same from platform to platform. If having a cross platform 100% predictable solution is necessary for your project, you'll probably have to write your own.
It's really not as bad as it may sound...
I'd recommend looking up random number generation such as the Mersenne Twister algorithm (which is what I use in my projects), and write a small block of code that you can share amongst all your projects. This also gives you the benefit of being able to have multiple generators with varying seeds, which comes in really useful for something like a puzzle game, where you might want a predictably random set based on a specific seed to generate your puzzle, but another clock seeded generator for randomizing special FX or other game elements.
The pseudo-random algorithm implemented by rand() is determined by the C library, and there is no standard algorithm for it. You are absolutely not guaranteed to get the same sequence of numbers from one implementation to the next, and it sounds like the Android implementation differs from your development environment. If you need a predictable sequence across platforms, you should implement your own random number generator.

Scattered data set in statistical data analysis

I have some number of statistical data. Some of the data are very scattered to the majority of data set as shown below. What I want to do is minimize the effect of highly scattered data in the data set. I want to compute mean of the data set which has minimized effect of the scattered data in my case.
My data set is as like this:
10.02, 11, 9.12, 7.89, 10.5, 11.3, 10.9, 12, 8.99, 89.23, 328.42.
As shown in figure below:
I need the mean value which is not 46.3 but closer to other data distribution.
Actually, I want to minimize the effect of 89.23 & 328.42 in mean calculation.
Thanks in advance
You might notice that you really dont want the mean. The problem here is that the distribution you've assumed for the data is different from the actual data. If you are trying to fit a normal distribution to this data you'll get bad results. You could try to fit a heavy tailed distribution like the cauchy to this data. If you want to use a normal distribution, then you need to filter out the non-normal samples. If you feel like you know what the standard deviation should be, you could remove everything from the sample above say 3 standard deviations away from the mean (the number 3 would have to depend on the sample size). This process can be done recursively to remove non-normal samples till you are happy with the size of the outlier in terms of the standard deviation.
Unfortunatley the mean of a set of data is just that - the mean value. Are you sure that the point is actually an outlier? Your data contains what appears to be a single outlier with regards to the clustering, but if you take a look at your plot, you will see that this data does seem to have a linear relationship and so is it truly an outlier?
If this reading is really causing you problems, you could remove it entirely. Other than that the only thing that I could suggest to you is to calculate some kind of weighted mean rather than the true mean http://en.wikipedia.org/wiki/Weighted_mean . This way you can assign a lower weighting to the point when calculating your mean (although how you choose a value for the weight is another matter). This is similar to weighted regression, where particular data points have less weight associated to the regression fitting (possibly due to unreliability of certain points for example) http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Weighted_linear_least_squares .
Hope this helps a little, or at least gives you some pointers to other avenues that you can try pursuing.

R-Tree implementation in Java [duplicate]

I was searching the last few days for a stable implementation of the R-Tree with support of unlimited dimensions (20 or so would be enough). I only found this http://sourceforge.net/projects/jsi/ but they only support 2 dimensions.
Another Option would be a multidimensional implementation of an interval-tree.
Maybe I'm completly wrong with the idea of using an R-Tree or Intervall-Tree for my Problem so i state the Problem in short, that you can send me your thoughts about this.
The Problem I need to solve is some kind of nearest-neighbour search. I have a set of Antennas and rooms and for each antenna an interval of Integers. E.g. antenna 1, min -92, max -85. In fact it could be represented as room -> set of antennas -> interval for antenna.
The idea was that each room spans a box in the R-Tree over the dimension of the antennas and in each dimension by the interval.
If I get a query with N-Antennas and values for each antenna I then could just represent the Information as a query point in the room and retrieve the rooms "nearest" to the point.
Hope you got an Idea of the problem and my idea.
Be aware that R-Trees can degrade badly when you have discrete data. The first thing you really need to find out is an appropriate data representation, then test if your queries work on a subset of the data.
R-Trees will only make your queries faster. If they don't work in the first place, it will not help. You should test your approach without using R-Trees first. Unless you hit a large amount of data (say, 100.000 objects), a linear scan in-memory can easily outperform an R-Tree, in particular when you need some adapter layer because it is not well-intergrated with your code.
The obvious approach here is to just use bounding rectangles, and linearly scan over them. If they work, you can then store the MBRs in an R-Tree to get some performance improvements. But if it doesn't work with a linear scan, it won't work with an R-Tree either (it will not work faster.)
I'm not entirely clear on what your exact problem is, but an R-Tree or interval tree would not work well in 20 dimensions. That's not a huge number of dimensions, but it is large enough for the curse of dimensionality to begin showing up.
To see what I mean, consider just trying to look at all of the neighbors of a box, including ones off of corners and edges. With 20 dimensions, you'll have 320 - 1 or 3,486,784,400 neighboring boxes. (You get that by realizing that along each axis a neighbor can be -1 unit, 0 unit, or +1 unit, but (0,0,0) is not a neighbor because it represents the original box.)
I'm sorry, but you either need to accept brute force searching, or else analyze your problem better and come up with a cleverer solution.
I have found this R*-Tree implementation in Java which seems to offer many features:
https://github.com/davidmoten/rtree
You might want to check it out!
Another good implementation in Java is ELKI: https://elki-project.github.io/.
You can use PostgreSQL’s Generalized Search Tree indexing facility.
GiST
Quick demo

R-Tree Implementation Java

I was searching the last few days for a stable implementation of the R-Tree with support of unlimited dimensions (20 or so would be enough). I only found this http://sourceforge.net/projects/jsi/ but they only support 2 dimensions.
Another Option would be a multidimensional implementation of an interval-tree.
Maybe I'm completly wrong with the idea of using an R-Tree or Intervall-Tree for my Problem so i state the Problem in short, that you can send me your thoughts about this.
The Problem I need to solve is some kind of nearest-neighbour search. I have a set of Antennas and rooms and for each antenna an interval of Integers. E.g. antenna 1, min -92, max -85. In fact it could be represented as room -> set of antennas -> interval for antenna.
The idea was that each room spans a box in the R-Tree over the dimension of the antennas and in each dimension by the interval.
If I get a query with N-Antennas and values for each antenna I then could just represent the Information as a query point in the room and retrieve the rooms "nearest" to the point.
Hope you got an Idea of the problem and my idea.
Be aware that R-Trees can degrade badly when you have discrete data. The first thing you really need to find out is an appropriate data representation, then test if your queries work on a subset of the data.
R-Trees will only make your queries faster. If they don't work in the first place, it will not help. You should test your approach without using R-Trees first. Unless you hit a large amount of data (say, 100.000 objects), a linear scan in-memory can easily outperform an R-Tree, in particular when you need some adapter layer because it is not well-intergrated with your code.
The obvious approach here is to just use bounding rectangles, and linearly scan over them. If they work, you can then store the MBRs in an R-Tree to get some performance improvements. But if it doesn't work with a linear scan, it won't work with an R-Tree either (it will not work faster.)
I'm not entirely clear on what your exact problem is, but an R-Tree or interval tree would not work well in 20 dimensions. That's not a huge number of dimensions, but it is large enough for the curse of dimensionality to begin showing up.
To see what I mean, consider just trying to look at all of the neighbors of a box, including ones off of corners and edges. With 20 dimensions, you'll have 320 - 1 or 3,486,784,400 neighboring boxes. (You get that by realizing that along each axis a neighbor can be -1 unit, 0 unit, or +1 unit, but (0,0,0) is not a neighbor because it represents the original box.)
I'm sorry, but you either need to accept brute force searching, or else analyze your problem better and come up with a cleverer solution.
I have found this R*-Tree implementation in Java which seems to offer many features:
https://github.com/davidmoten/rtree
You might want to check it out!
Another good implementation in Java is ELKI: https://elki-project.github.io/.
You can use PostgreSQL’s Generalized Search Tree indexing facility.
GiST
Quick demo

How do I determine a best-fit distribution in java?

I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward.
Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis.
Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.
Looking for a distribution that fits is unlikely to give you good results in the absence of some a priori knowledge. You may find a distribution that coincidentally is a good fit but is unlikely to be the underlying distribution.
Do you have any metadata available that would hint at what the data means? E.g., "this is open-ended data sampled from a natural population, so it's some sort of normal distribution", vs. "this data is inherently bounded at 0 and discrete, so check for the best-fitting Poisson".
I don't know of any distribution solvers for Java off the top of my head, and I don't know of any that will guess which distribution to use. You could examine some statistical properties (skew/etc.) and make some guesses here--but you're more likely to end up with an accidentally good fit which does not adequately represent the underlying distribution. Real data is noisy and there are just too many degrees of freedom if you don't even know what distribution it is.
This may be above and beyond what you want to do, but it seems the most complete approach (and it allows access to the wealth of statistical knowledge available inside R):
use JRI to communicate with the R statistical language
use R, internally, as indicated in this thread
Look at Apache commons-math.
What you're looking for comes under the general heading of "goodness of fit." You could search on "goodness of fit test."
Donald Knuth describes a couple popular goodness of fit tests in Seminumerical Algorithms: the chi-squared test and the Kolmogorov-Smirnov test. But you've got to have some idea first what distribution you want to test. For example, if you have bell curve data, you might try normal or Cauchy distributions.
If all you really need the distribution for is to model the data you have sampled, you can make your own distribution based on the data you have:
1. Create a histogram of your sample: One method for selecting the bin size is here. There are other methods for selecting bin size, which you may prefer.
2. Derive the sample CDF: Think of the histogram as your PDF, and just compute the integral. It's probably best to scale the height of the bins so that the CDF has the right characteristics ... namely that the value of the CDF at +Infinity is 1.0.
To use the distribution for modeling purposes:
3. Draw X from your distribution: Make a draw Y from U(0,1). Use a reverse lookup on your CDF of the value Y to determine the X such that CDF(X) = Y. Since the CDF is invertible, X is unique.
I've heard of a package called Eureqa that might fill the bill nicely. I've only downloaded it; I haven't tried it myself yet.
You can proceed with a three steps approach, using the SSJ library:
Fit each distribution separately using maximum likelihood estimation (MLE). Using SSJ, this can be done with the static method getInstanceFromMLE(double[] x,
int n) available on each distribution.
For each distribution you have obtained, compute its goodness-of-fit with the real data, for example using Kolmogorov-Smirnov: static void kolmogorovSmirnov (double[] data, ContinuousDistribution dist, double[] sval,double[] pval), note that you don't need to sort the data before calling this function.
Pick the distribution having the highest p-value as your best fit distribution

Categories

Resources