I'm currently using the JTransforms-library to calculate the DFT of a one-second signal at a FS of 44100 Hz. The code is quite simple:
DoubleFFT_1D fft = new DoubleFFT_1D(SIZE);
fft.complexForward(signal); // signal: 44100 length array with audio bytes.
See this page for the documentation of JTransform's DoubleFFT_1D class.
http://incanter.org/docs/parallelcolt/api/edu/emory/mathcs/jtransforms/fft/DoubleFFT_1D.html
The question is: what is SIZE supposed to be? I know it's probably the window size, but can't seem to get it to work with the most common values I've come across, such as 1024 and 2048.
At the moment I'm testing this function by generating a signal of a 1kHz sinusoid. However, when I use the code above and I'm comparing the results with MATLAB's fft-function, they seem to be from a whole different magnitude. E.g. MATLAB gives results such as 0.0004 - 0.0922i, whereas the above code results in results like -1.7785E-11 + 6.8533E-11i, with SIZE set to 2048. The contents of the signal-array are equal however.
Which value for SIZE would give a similar FFT-function as MATLAB's built-in fft?
According to the documentation, SIZE looks like it should be the number of samples in signal. If it's truly a 1 s signal at 44.1 kHz, then you should use SIZE = 44100. Since you're using complex data, signal should be an array twice this size (real/imaginary in sequence).
If you don't use SIZE = 44100, your results will not match what Matlab gives you. This is because of the way Matlab (and probably JTransforms) scales the fft and ifft functions based on the length of the input - don't worry that the amplitudes don't match. By default, Matlab calculates the FFT using the full signal. You can provide a second argument to fft (in Matlab) to calculate the N-point FFT and it should match your JTransforms result.
From your comments, it sounds like you're trying to create a spectrogram. For this, you will have to figure out your tradeoff between: spectral resolution, temporal resolution, and computation time. Here is my (Matlab) code for a 1-second spectrogram, calculated for each 512-sample chunk of a 1s signal.
fs = 44100; % Hz
w = 1; % s
t = linspace(0, w, w*fs);
k = linspace(-fs/2, fs/2, w*fs);
% simulate the signal - time-dependent frequency
f = 10000*t; % Hz
x = cos(2*pi*f.*t);
m = 512; % SIZE
S = zeros(m, floor(w*fs/m));
for i = 0:(w*fs/m)-1
s = x((i*m+1):((i+1)*m));
S(:,i+1) = fftshift(fft(s));
end
For this image we have 512 samples along the frequency axis (y-axis), ranging from [-22050 Hz 22050 Hz]. There are 86 samples along the time axis (x-axis) covering about 1 second.
For this image we now have 4096 samples along the frequency axis (y-axis), ranging from [-22050 Hz 22050 Hz]. The time axis (x-axis) again covers about 1 second, but this time with only 10 chunks.
Whether it's more important to have fast time resolution (512-sample chunks) or high spectral resolution (4096-sample chunks) will depend on what kind of signal you're working with. You have to make a decision about what you want in terms of temporal/spectral resolution, and what you can achieve in reasonable computation time. If you use SIZE = 4096, for example, you will be able to calculate the spectrum ~10x/s (based on your sampling rate) but the FFT may not be fast enough to keep up. If you use SIZE = 512 you will have poorer spectral resolution, but the FFT will calculate much faster and you can calculate the spectrun ~86x/s. If the FFT is still not fast enough, you could then start skipping chunks (e.g. use SIZE=512 but only calculate for every other chunk, giving ~43 spectrums per 1s signal). Hopefully this makes sense.
Related
I'm trying to add audio to a video, where I need a single short array representing the audio. I don't know how to get the length of this array.
I've found an estimate of 91 shorts per millisecond, but I don't how how to get an exact value instead of guessing and checking.
Here's the relevant code:
IMediaWriter writer = ToolFactory.makeWriter(file.getAbsolutePath());
writer.addVideoStream(0, 0, IRational.make(fps, 1), animation.getWidth(), animation.getHeight());
writer.addAudioStream(1, 0, 2, 44100);
...
int scale = 1000 / 11; // TODO
short[] audio = new short[animation.getLength() * scale];
animation.getLength() is the length of the video in milliseconds
What's the formula for calculating the scale variable?
The reason a list of shorts is needed is since this is an animation library that supports adding lots of sounds into the outputted video. Thus, I loop through all the requested sounds, turn the sounds into short lists, and then add their values to the correct spot in the main audio list. Not using a short list would make it so I can't stack several sounds on top of each other and make the timing more difficult.
The scale is what's known as the audio sampling rate which is normally measured in Hertz (Hz) which corresponds to "samples per second".
Assuming each element in your array is a single audio sample, you can estimate the array size by multiplying the audio sampling rate by the animation duration in seconds.
For example, if your audio sampling rate is 48,000 Hz:
int audioSampleRate = 48000;
double samplesPerMillisecond = (double) audioSampleRate / 1000;
int estimatedArrayLength = (int) (animation.getLength() * samplesPerMillisecond);
I have an array (size 128) of data that I am using FFT on. I am trying to find the frequency of the data through the FFT spectrum. The problem is that the formula freq = i * Fs / N doesn't seem to be working. My data is quite noisy and I don't know if it is because of my noisy data or because I am doing something else wrong. Below is my raw data:
And this is the spectrum that results from the transform:
I am getting two maximum peaks of equal magnitude at index 4 and 128 in the output array. The frequency of the data should be around 1.1333 Hz, but I am getting 5-6 or completely wrong values when I use the formula:
freq = i * Fs / N;
where i is the array index of the largest magnitude peak, Fs is the sampling rate in Hz, and N is the data size.
Using my data, you get freq = (4 * 11.9) / 128 = 0.37 Hz, which is very off from what is expected.
If my calculation is correct, are there any ways to improve my data? Or, are my calculations for frequency incorrect?
Lets first make sure you are looking at the actual magnitudes. FFTs would return complex values associated with each frequency bins. These complex values are typically represented by two values: one for the real part and another for the imaginary part. The magnitude of a frequency component can then be obtained by computing sqrt(real*real+imaginary*imaginary).
This should give you half as many values and the corresponding spectrum (with the magnitudes expressed in decibels):
As you can see there is a strong peak near 0Hz. Which is consistent with your raw data having a large average value, as well as an increasing trend from time 0 to ~4.2s (both of which being larger than the oscillations' amplitude). If we were to remove these low frequency contributions (with for example a high-pass filter with cutoff frequency around 0.25Hz) we would get the following fluctuation data:
with the corresponding spectrum:
As you can see the oscillation frequency can much more readily be observed in bin 11, which gives you freq = (11 * 11.9) / 128 = 1Hz.
Note however that whether removing these frequency components bellow 0.25HZ is an improvement of your data depends on whether those frequency components are of interest for your application (possibly not since you seem to be interested in relatively faster fluctuations).
You need to remove the DC bias (average of all samples) before the FFT to measure frequencies near 0 Hz (or near the FFT's 0th result bin). Applying a window function (von Hann or Hamming window) after removing the DC bias and before the FFT may also help.
I know that there are a lot of resources online explaining how to deinterleave PCM data. In the course of my current project I have looked at most of them...but I have no background in audio processing and I have had a very hard time finding a detailed explanation of how exactly this common form of audio is stored.
I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]...
What I don't understand is what exactly this means. I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB]. Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?
Thank you everyone. Any help is appreciated.
Edit: If you choose to give examples please refer to the following.
Method Context
Specifically what I have to do is convert an interleaved short[] to two float[]'s each representing the left or right channel. I will be implementing this in Java.
public static float[][] deinterleaveAudioData(short[] interleavedData) {
//initialize the channel arrays
float[] left = new float[interleavedData.length / 2];
float[] right = new float[interleavedData.length / 2];
//iterate through the buffer
for (int i = 0; i < interleavedData.length; i++) {
//THIS IS WHERE I DON'T KNOW WHAT TO DO
}
//return the separated left and right channels
return new float[][]{left, right};
}
My Current Implementation
I have tried playing the audio that results from this. It's very close, close enough that you could understand the words of a song, but is still clearly not the correct method.
public static float[][] deinterleaveAudioData(short[] interleavedData) {
//initialize the channel arrays
float[] left = new float[interleavedData.length / 2];
float[] right = new float[interleavedData.length / 2];
//iterate through the buffer
for (int i = 0; i < left.length; i++) {
left[i] = (float) interleavedData[2 * i];
right[i] = (float) interleavedData[2 * i + 1];
}
//return the separated left and right channels
return new float[][]{left, right};
}
Format
If anyone would like more information about the format of the audio the following is everything I have.
Format is PCM 2 channel interleaved big endian linear int16
Sample rate is 44100
Number of shorts per short[] buffer is 2048
Number of frames per short[] buffer is 1024
Frames per packet is 1
I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]... What I don't understand is what exactly this means.
Interleaved PCM data is stored one sample per channel, in channel order before going on to the next sample. A PCM frame is made up of a group of samples for each channel. If you have stereo audio with left and right channels, then one sample from each together make a frame.
Frame 0: [left sample][right sample]
Frame 1: [left sample][right sample]
Frame 2: [left sample][right sample]
Frame 3: [left sample][right sample]
etc...
Each sample is a measurement and digital quantization of pressure at an instantaneous point in time. That is, if you have 8 bits per sample, you have 256 possible levels of precision that the pressure can be sampled at. Knowing that sound waves are... waves... with peaks and valleys, we are going to want to be able to measure distance from the center. So, we can define center at 127 or so and subtract and add from there (0 to 255, unsigned) or we can treat those 8 bits as signed (same values, just different interpretation of them) and go from -128 to 127.
Using 8 bits per sample with single channel (mono) audio, we use one byte per sample meaning one second of audio sampled at 44.1kHz uses exactly 44,100 bytes of storage.
Now, let's assume 8 bits per sample, but in stereo at 44.1.kHz. Every other byte is going to be for the left, and every other is going to be for the R.
LRLRLRLRLRLRLRLRLRLRLR...
Scale it up to 16 bits, and you have two bytes per sample (samples set up with brackets [ and ], spaces indicate frame boundaries)
[LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR]...
I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB].
Not necessarily. The audio can be stored in any endianness. Little endian is the most common, but that isn't a magic rule. I do think though that all channels go in order always, and front left would be channel 0 in most cases.
Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?
Each value (16-bit integer in this case) is destined for a single channel. Never would you have two multi-byte values smashed into each other.
I hope that's helpful. I can't run your code but given your description, I suspect you have an endian problem and that your samples aren't actual big endian.
Let's start by getting some terminology out of the way
A channel is a monaural stream of samples. The term does not necessarily imply that the samples are contiguous in the data stream.
A frame is a set of co-incident samples. For stereo audio (e.g. L & R channels) a frame contains two samples.
A packet is 1 or more frames, and is typically the minimun number of frames that can be processed by a system at once. For PCM Audio, a packet often contains 1 frame, but for compressed audio it will be larger.
Interleaving is a term typically used for stereo audio, in which the data stream consists of consecutive frames of audio. The stream therefore looks like L1R1L2R2L3R3......LnRn
Both big and little endian audio formats exist, and depend on the use-case. However, it's generally ever an issue when exchanging data between systems - you'll always use native byte-order when processing or interfacing with operating system audio components.
You don't say whether you're using a little or big endian system, but I suspect it's probably the former. In which case you need to byte-reverse the samples.
Although not set in stone, when using floating point samples are usually in the range -1.0<x<+1.0, so you want to divide the samples by 1<<15. When 16-bit linear types are used, they are typically signed.
Taking care of byte-swapping and format conversions:
int s = (int) interleavedData[2 * i];
short revS = (short) (((s & 0xff) << 8) | ((s >> 8) & 0xff))
left[i] = ((float) revS) / 32767.0f;
Actually your are dealing with an almost typical WAVE file at Audio CD quality, that is to say :
2 channels
sampling rate of 44100 kHz
each amplitude sample quantized on a 16-bits signed integer
I said almost because big-endianness is usually used in AIFF files (Mac world), not in WAVE files (PC world). And I don't know without searching how to deal with endianness in Java, so I will leave this part to you.
About how the samples are stored is quite simple:
each sample takes 16-bits (integer from -32768 to +32767)
if channels are interleaved: (L,1),(R,1),(L,2),(R,2),...,(L,n),(R,n)
if channels are not: (L,1),(L,2),...,(L,n),(R,1),(R,2),...,(R,n)
Then to feed an audio callback, it is usually required to provide 32-bits floating point, ranging from -1 to +1. And maybe this is where something may be missing in your aglorithm. Dividing your integers by 32768 (2^(16-1)) should make it sound as expected.
I ran into a similar issue with de-interleaving the short[] frames that came in through Spotify Android SDK's onAudioDataDelivered().
The documentation for onAudioDelivered was poorly written a year ago. See Github issue. They've updated the docs with a better description and more accurate parameter names:
onAudioDataDelivered(short[] samples, int sampleCount, int sampleRate, int channels)
What can be confusing is that samples.length can be 4096. However, it contains only sampleCount valid samples. If you're receiving stereo audio, and sampleCount = 2048 there are only 1024 frames (each frame has two samples) of audio in samples array!
So you'll need to update your implementation to make sure you're working with sampleCount and not samples.length.
I would like to be able to take a frequency (eg. 1000hz, 250hz, 100hz) and play it out through the phone hardware.
I know that Android's AudioTrack will allow me to play a 16-bit PCM if I can calculate an array of bits or shorts. I would like to calculate only a single period so that later I can loop it without any issues, and so I can keep the calculation time down.
How could this be achieved?
Looping a single period isn't necessarily a good idea - the cycle may not fit nicely into an exact number of samples so you might get an undesirable discontinuity at the end of each cycle, or worse, the audible frequency may end up slightly off.
That said, the math isn't hard:
float sample_rate = 44100;
float samples_per_cycle = sample_rate / frequency;
int samples_to_produce = ....
for (int i = 0; i < samples_to_produce; ++i) {
sample[i] = Math.floor(32767.0 * Math.sin(2 * Math.PI * i / samples_per_cycle));
}
To see what I meant above about the frequency, take the standard tuning pitch of 440 Hz.
Strictly this needs 100.227 samples, but the code above would produce 100. So if you repeat your 100 samples over and over you'll actually play the sample 441 times per second, so your pitch will be off by 1 Hz.
To avoid the problem you'd really need to calculate several periods of the waveform, although I don't know many is needed to fool the ear into hearing the right pitch.
Ideally it would be as many as are needed such that:
i / samples_per_cycle
is a whole number, so that the last sample (technically the one after the last sample) ends exactly on a cycle boundary. I think if your input frequencies are all whole numbers then producing one second's worth exactly would work.
I'm able to display waveform but I don't know how to implement zoom in on the waveform.
Any idea?
Thanks piccolo
By Zoom, I presume you mean horizontal zoom rather than vertical. The way audio editors do this is to scan the wavform breaking it up into time windows where each pixel in X represents some number of samples. It can be a fractional number, but you can get away with dis-allowing fractional zoom ratios without annoying the user too much. Once you zoom out a bit the max value is always a positive integer and the min value is always a negative integer.
for each pixel on the screen, you need to have to know the minimum sample value for that pixel and the maximum sample value. So you need a function that scans the waveform data in chunks and keeps track of the accumulated max and min for that chunk.
This is slow process, so professional audio editors keep a pre-calculated table of min and max values at some fixed zoom ratio. It might be at 512/1 or 1024/1. When you are drawing with a zoom ration of > 1024 samples/pixel, then you use the pre-calculated table. if you are below that ratio you get the data directly from the file. If you don't do this you will find that you drawing code gets to be too slow when you zoom out.
Its worthwhile to write code that handles all of the channels of the file in an single pass when doing this scanning, slowness here will make your whole program feel sluggish, it's the disk IO that matters here, the CPU has no trouble keeping up, so straightforward C++ code is fine for building the min/max tables, but you don't want to go through the file more than once and you want to do it sequentially.
Once you have the min/max tables, keep them around. You want to go back to the disk as little as possible and many of the reasons for wanting to repaint your window will not require you to rescan your min/max tables. The memory cost of holding on to them is not that high compared to the disk io cost of building them in the first place.
Then you draw the waveform by drawing a series of 1 pixel wide vertical lines between the max value and the min value for the time represented by that pixel. This should be quite fast if you are drawing from pre built min/max tables.
Answered by https://stackoverflow.com/users/234815/John%20Knoeller
Working on this right now, c# with a little linq but should be easy enough to read and understand. The idea here is to have a array of float values from -1 to 1 representing the amplitude for every sample in the wav file. Then knowing how many samples per second, we then need a scaling factor - segments per second. At this point you simply are reducing the datapoints and smoothing them out. to zoom in really tight give a samples per second of 1000, to zoom way out maybe 5-10. Note right now im just doing normal averaing, where this needs to be updated to be much more efficent and probably use RMS (root-mean-squared) averaging to make it perfect.
private List<float> BuildAverageSegments(float[] aryRawValues, int iSamplesPerSecond, int iSegmentsPerSecond)
{
double nDurationInSeconds = aryRawValues.Length/(double) iSamplesPerSecond;
int iNumSegments = (int)Math.Round(iSegmentsPerSecond*nDurationInSeconds);
int iSamplesPerSegment = (int) Math.Round(aryRawValues.Length/(double) iNumSegments); // total number of samples divided by the total number of segments
List<float> colAvgSegVals = new List<float>();
for(int i=0; i<iNumSegments-1; i++)
{
int iStartIndex = i * iSamplesPerSegment;
int iEndIndex = (i + 1) * iSamplesPerSegment;
float fAverageSegVal = aryRawValues.Skip(iStartIndex).Take(iEndIndex - iStartIndex).Average();
colAvgSegVals.Add(fAverageSegVal);
}
return colAvgSegVals;
}
Outside of this you need to get your audio into a wav format, you should be able to find source everywhere to read that data, then use something like this to convert the raw byte data to floats - again this is horribly rough and inefficent but clear
public float[] GetFloatData()
{
//Scale Factor - SignificantBitsPerSample
if (Data != null && Data.Length > 0)
{
float nMaxValue = (float) Math.Pow((double) 2, SignificantBitsPerSample);
float[] aryFloats = new float[Data[0].Length];
for (int i = 0; i < Data[0].Length; i++ )
{
aryFloats[i] = Data[0][i]/nMaxValue;
}
return aryFloats;
}
else
{
return null;
}
}