Using exponential smoothing with NaN values

Using exponential smoothing with NaN values - java

I have a sample of some kind that can create somewhat noisy output. The sample is the result of some image processing from a camera, which indicates the heading of a blob of a certain color. It is an angle from around -45° to +45°, or a NaN, which means that the blob is not actually in view.
In order to combat the noisy data, I felt that exponential smoothing would do the trick. However, I'm not sure how to handle the NaN values.
On the one hand, involving them in the math would result in a NaN average, which would then prevent any meaningful results. On the other hand, ignoring NaN values completely would mean that a "no-detection" scenario would never be reported. And just to complicate things, the data is also noisy in that it can get false NaN value, which ideally would be smoothed somehow to prevent random noise.
Any ideas about how I could implement such an exponential smoother?

How about keeping two distributions? The first one can be your smoothed blob heading as usual, except if you get a NaN you instead just enter whatever the last seen non-NaN value was (or some other default); the other is a "NaN-distribution", which simply gets a 0 for every non-NaN value and 1 for every NaN (or something like that).
This way, even if it gets obscured, your primary distribution will keep predicting based on "last known heading", without getting garbage data or messing up the smoothing, but you'll also get a simultaneous spike on the NaN-distribution letting you know that something's up.

Well, it really depends on what you are doing with the smoothed data. One thing you might try is to have an exponentially weighted smoothing of the blob velocity in addition to its location, where NaNs contribute a value of zero. When you encounter a NaN, you can then replace it with the projected position based on the previous position and the smoothed velocity. By smoothing the velocity you can prevent a whole sequence of NaNs from producing a completely crazily large or small value. This can result in values being out of [-45,45], which should capture that it is out of view and the side to which it left the view. Now, you will have to actually verify that this gives good results in your computer vision algorithm. If not, you might also want to try replacing NaNs with the previous value, or with zero, or to simply ignore the NaNs and see what works best.

Related

How to estimate elevation based on sample set of elevations for other coordinates

Given a set of (x, y, z) points in 3D space, I want to be able to estimate the z for a new (x, y) pair.
For example: I am given a height map of a geographical feature, for example some hills in the countryside. That means that for some latitudes and longitudes, I know the elevation of the ground at that point. I would like to estimate the elevation of a person standing at (latitude, longitude) that is most likely not in the sample set.
How can I do that in Java?
I have already researched splines but am struggling to make any progress that way, and I also just tried using graphhopper's ElevationInterpolator but it gives clearly wrong results — it doesn't seem to give accurate estimations at all, unless the provided (lat, long) are in the sample set, then it is correct, but if it's just slightly offset it gives a wildly different elevation, and it gives the same elevation for all positions that aren't in the sample set.

In case you have elevations surrounding the point in question, the best way I can see would be to find a closest enclosing triangle or a quad and interpolate linearly between those. I don’t think you can get much better than that.
In case you only have elevations on the side of a point, all you can do is assume it is relatively flat or maybe try calculating some sort of gradient from the points you have but basically that’ll only be a wild guess.
Depending on how big an area you’re covering, there’s also a geoid that you might want to take into account.
As for your “intuitive” formula it uses too much data around the point in question so it will definitely produce wrong results. The fact of the matter is that the further points have nothing to do with elevation, all you need to estimate is the few closest ones and since you don’t know anything about the surface anyway it doesn’t really matter what the other points are. Well maybe unless you’re into ML, then maybe you can get something out of them...

I found a MicrosphereInterpolator in Apache Commons
I guess this is actually a really hard problem and has no standard solution. This solution seems to be based on William Dudziak's 2007 MS thesis.
The MicrosphereInterpolator works well for me at the moment.
Other solutions I have tried, for example the intuitive formula, don't give intuitive results.
The only downside is that the MicrosphereInterpolator is quite slow.

How do I apply FFT onto an audio recording to get a frequency?

This is supposed to be for an android app, so the language in question is obviously Java.
I'm trying to record some audio and get the dominant frequency. This is for a very specific purpose, and the frequencies I need to be detected are pure sounds made by another device. I have the recording part done, so the only thing that I need to do is calculate the frequency from the buffer it generates.
I know I'm supposed to use something called FFT, so I put these into my project: http://introcs.cs.princeton.edu/java/97data/FFT.java, and http://introcs.cs.princeton.edu/java/97data/Complex.java.html
I know there are many questions about this, but none of them give an answer that I can understand. Others have broken links.
Anyone know how to do this, and explain in a relatively simple manner?

Generally a DFT (FFT included) implementation will take N time-domain samples (your recording) and produce N/2 complex values in the frequency domain. The angle of the complex value represents the phase and the absolute value of it represents the amplitude. Usually the values output will be ordered from lowest frequency to highest frequency.
Some implementations may output N complex values, but the extra values are redundant unless your input contains complex values. It should not in your case. This is why many implementations input real values and output N/2 complex values, as this is the most common use of FFT.
So, you will want to calculate the absolute value of the output since the amplitude is what you are interested in. The absolute value of a complex number is the square root of the sum of the square of it's real and the square of it's complex component.
The exact frequencies of each value will depend on the number of samples of input and the interval between the samples. The frequency of value at position i (assuming i goes from 0 to N/2 - 1) will be i * (sampling frequency) / N.
This is assuming your N is even, rather than trying to explain the case of N being odd I'll recommend you keep N even for simplicity. For the case of FFT N will always be a power of two so N will always be even anyway.
If you're looking for a tone over a minimum time T then I'd also recommend processing the input in blocks of T/2 size.

Fourier transforms are a mathematical technique that lets you go back and forth between time and frequency domains for time-dependent signals.
FFT is a computer algorithm for calculating discrete transforms quickly and efficiently.
You'll take a sample of your time signal and apply FFT to it to get the amplitude versus frequency for the sample.
It's not an easy topic if you don't have the mathematical background. It assumes a good knowledge of trigonometry (sines and cosines), functions, and calculus. If you don't have that, it'll be difficult to read and understand any reference you can find.
If you don't have that background, do your best to treat a library FFT function as a black box and use what it gives back.

How do I select statistically significant points from the set of points?

Server is receiving a certain rate(12 per minute) of monitoring data for some process via external source(web services, etc). Now process may run for a minute(or less than) or for an hour or a day. At the end of the process, I may be having 5 or 720 or 17280 data points. This data is being gathered for more than 40 parameters and stored into the database for future display via web. Imagine more than 1000 processes are running and the amount of data generated. I have to stick to RDBMS(MySQL specifically). Therefore, I want to process the data and decrease the amount the data by selecting only statistically significant points before storing the data to the database. The ultimate objective is to plot these data points over a graph where Y-axis will be time and X-axis will be represented by some parameter(part of data point).
I do not want to miss any significant fluctuation or nature but at the same time I cannot manage to plot all of the data points(in case the number is huge > 100).
Please note that I am aware of basic statistical terms like mean, standard deviation, etc.

If this is a constant process, you could plot the mean (should be a flat line) and any points that exceeded a certain threshold. Three standard deviations might be a good threshold to start with, then see whether it gives you the information you need.
If it's not a constant process, you need to figure out how it should be varying with time and do a similar thing: plot the points that substantially vary from your expectation at that point in time.
That should give you a pretty clean graph while still communicating the important information.

If you expect your process to be noisy, then doing some smoothing through a spline can help you reduce noise and compress your data (since to draw a spline you need only a few points, where "few" is arbitrary picked by you, depending on how much detail you want to get rid of).
However, if your process is not noisy, then outliers are very important, since they may represent errors or exceptional conditions. In this case, you are better off getting rid of the points that are close to the average (say less than 1 standard deviation), and keeping those that are far.
A little note: the term "statistically significant", describes a high enough level of certainty to discard the null hypothesis. I don't think it applies to your problem.

How do i apply the fft to an image, display the image, and then do the same with the ifft?

The question is a bit broad.
Here is what I have done:
I have a method for applying the fft. I'm not going to post it because whether it is correct or incorrect is not really the point here.
I run an image through the method and then try to display what comes out as two images of the sames size, one for the real part and one for the imaginary part.
This seems to work fine except that the grayscale values that come out of my method are usually much larger than 255 and therefore I'm not sure what I'm seeing.
I then take the raw result (not whatever the pixel values I display are, since I assume they are modified somehow to fit between 0 and 255) and run it through the same method as before but with a sign change to achieve the ifft.
I then try to display this as well. Again, the raw values are much larger than 255 for the most part.
My question boils down to:
a.) do i have to do some scaling on the fft to get it to fit between 0 and 255?
b.) do i have to reverse this scaling when I do the ifft?
c.) Is there any translation I have to do on the fft before I apply the ifft?
Part c arises from the fact that I have read some things which talk about centering the corners of the fft but I'm not really certain what this means.
A lesser question, part d, would be that if I apply the 2d fft on the original image by first applying the 1d fft to all the rows and then again to all the columns, do i need to apply the ifft in the same order or do i need to reverse the order.
I think that's all for now. I have been doing a lot of looking for answers but cant seem to find much so any help is appreciated.
EDIT: I added some images, maybe they will help. The first is the original image, the second is the result of my fft method (magnitude and imaginary component) and the third is the result of the ifft on the intermediate image.
EDIT2: Updated the images to ones from newer method.

People usually don't find it very useful to view the real and imaginary parts separately, but instead view the magnitude, and possibly the phase, but usually just the magnitude.
a) In general, yes, you will need to apply a scaling regardless of which components you're viewing. There are scaling relations between the total power of the image and it's FFT, but not the individual components. Also, you'll often want to do something like take the log of the data, or ignore the zero component, etc, so it's best just to do the scaling on your own.
b) In part a, you should do the scaling for visualizing, and don't scale the actually FFT. You should take the IFFT of the original FFT.
c) Depending on your FFT routines, you may need to divide by a factor of 2pi or the number of points in the sample, but this depends on how your FFT routines work. The docs should clarify this. As a start, just see if there's a factor of 2pi between what you start with and end with.

Answers to your four questions:
a. Do you have to scale the results of FFT to view them? Yes. You need to take magnitude then scale down to values between 0 and 255.
b. Do you have to reverse scaling before IFFT. Answer to A is only if you want to view the results of FFT. You cannot IFFT the scaled numbers. Use the original numbers.
c. Do translation between FFT and IFFT? No.
d. Does the order of Row vs Col during FFT matter? No. The results of FFT are a set of real and imaginary numbers. It is a deterministic result. You can IFFT in either order.
One of the key aspects that you may be having trouble with is the difference between the math and the visualization. The IFFT work on float or double real and imaginary numbers. The image expects integers between 0 and 255. You have to handle this conversion in code. You indicated that you thought it was "modified somehow". Safer to perform this conversion yourself.
Finally ditto on the tom10 answer. You may have to scale the results of the IFFT. It depends on the implementation of the FFT and IFFT.

Scattered data set in statistical data analysis

I have some number of statistical data. Some of the data are very scattered to the majority of data set as shown below. What I want to do is minimize the effect of highly scattered data in the data set. I want to compute mean of the data set which has minimized effect of the scattered data in my case.
My data set is as like this:
10.02, 11, 9.12, 7.89, 10.5, 11.3, 10.9, 12, 8.99, 89.23, 328.42.
As shown in figure below:
I need the mean value which is not 46.3 but closer to other data distribution.
Actually, I want to minimize the effect of 89.23 & 328.42 in mean calculation.
Thanks in advance

You might notice that you really dont want the mean. The problem here is that the distribution you've assumed for the data is different from the actual data. If you are trying to fit a normal distribution to this data you'll get bad results. You could try to fit a heavy tailed distribution like the cauchy to this data. If you want to use a normal distribution, then you need to filter out the non-normal samples. If you feel like you know what the standard deviation should be, you could remove everything from the sample above say 3 standard deviations away from the mean (the number 3 would have to depend on the sample size). This process can be done recursively to remove non-normal samples till you are happy with the size of the outlier in terms of the standard deviation.

Unfortunatley the mean of a set of data is just that - the mean value. Are you sure that the point is actually an outlier? Your data contains what appears to be a single outlier with regards to the clustering, but if you take a look at your plot, you will see that this data does seem to have a linear relationship and so is it truly an outlier?
If this reading is really causing you problems, you could remove it entirely. Other than that the only thing that I could suggest to you is to calculate some kind of weighted mean rather than the true mean http://en.wikipedia.org/wiki/Weighted_mean . This way you can assign a lower weighting to the point when calculating your mean (although how you choose a value for the weight is another matter). This is similar to weighted regression, where particular data points have less weight associated to the regression fitting (possibly due to unreliability of certain points for example) http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Weighted_linear_least_squares .
Hope this helps a little, or at least gives you some pointers to other avenues that you can try pursuing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.