Hi I need to downsample a wav audio file's sample rate from 44.1kHz to 8kHz. I have to do all the work manually with a byte array...it's for academic purposes.
I am currently using 2 classes, Sink and Source, to pop and push arrays of bytes. Everything goes well until I reach the part where I need to downsample the data chunk using a linear interpolation.
Since I'm downsampling from 44100 to 8000 Hz, how do I interpolate a byte array containing something like 128 000 000 bytes? Right now I'm popping 5, 6 or 7 bytes depending on i%2 == 0, i%2 == 1 and i%80 == 0 and push the average of these 5, 6 or 7 bytes into the new file.
The result is indeed a smaller audio file than the original but it cannot be played on windows media player (says there is an error while reading the file) and there is a lot of noise although I can hear the right track behind the noise.
So, to sum things up, I need help concerning the linear interpolation part. Thanks in advance.
I think you shouldn't use the average of those samples as that would be a median filter, not exactly downsampling. Just use every 5th/6th/7th sample and write that to the new file.
That will probably have some aliasing artifacts but might overall be recognizable.
Another, more complex solution but probably one with better results, quality-wise, would be to first convert your samples into a frequency distribution using a FFT or DFT and then convert it back with the appropriate sample rate. It's been a while since I have done such a thing but it's definitely doable. You may need to fiddle around a bit to get it working properly, though.
Also when not taking a FT of the complete array but rather in segments you have the problem of the segment boundaries being 0. A few years ago when I played with those things I didn't come up with a viable solution to this (since it generates artifacts as well) but there probably is one if you read the right books :-)
As for WMP complaining about the file: You did modify the header you write accordingly, right?
Related
I have managed to play a sound file with a different speed using answers from here, but I need to be able to adjust the speed as it plays. There's two methods I've thought of using. The first is to split the audio file into short clips and play each one after the last ends. I haven't tried that yet, but it seems like it could easily end with the file playing over itself or having short gaps.
The other method is to take the original file as a stream and then make a stream using that that speeds it up or slows it down as needed. This seems like it would work well, but in order to construct an AudioInputStream, I either need an InptutStream of known length, which is impossible to figure out ahead of time, or a TargetDataLine, which is an interface that has way more methods than I'd care to implement.
Is there a better way of doing this? Also, why does AudioInputStream need to know the length of the stream?
Alternately, is there an external library I could use?
If you are simply playing back an audio file (e.g., a .wav) and are okay with the pitch of the sound being shifted, a simple possibility is to read the data from an AudioInputStream, translate to PCM, interpolate though that data at the desired rate, translate back to bytes an ship out via a SourceDataLine.
To speed up or slow down in real time, loosely couple inputs to the variable holding the increment being used to progress through the incoming frames. To minimize discontinuities, you can smooth out the transitions from one pitch to another over a given number of frames.
This is done to achieve real-time frequency changes in the open source library AudioCue, on github. Smoothing there between frequency changes is set to occur over 1028 frames (approx 1/40th of a second). But quicker changes are certainly possible. The sound data in that library is take from an internal float array of PCM values. But a good example of code needed to read the data as a line rather than a fixed array can be seen in the first code example in the Sound Trail, Using File Filters and Converters. You might be wanting to use an InputStream as the argument for the AudioInputStream. At the point in the example where it says "Here, do something useful.." you would convert to PCM and then cursor through the resulting PCM with the desired frequency rate, using linear interpolation, and then repackage and send out via a SourceDataLine.
If you wish to preserve pitch (time stretch or compress only) then this starts to require more heavy duty DSP. This thread at the StackExchange Digital Processing site has some info on that. I've had some success with making granules with a Hamming Window to aid cross-fading between them, but some of the other solutions were over my head (and I haven't been back to this problem in a long while). But it was possible to change the spacing of the granules in real time, if I remember correctly. Didn't sound as good as the Audacity tool's algorithm, though, but that's probably more on me than not. I'm pretty much self-taught and experimenting, not working in the field professionally.
(I believe Phil's answer will get you going nicely. I'm just posting this to add my two cents about resampling.)
Short answer: Create an AudioInputStream that either drops samples or adds zero samples. As length you can set AudioSystem.NOT_SPECIFIED.
Long answer: If you add zero samples, you might want to interpolate, but not linearly. The reason you have to interpolate for upsampling is aliasing, which you might want to avoid. You do so, by applying a lowpass filter. The reason for this is simple. The Nyquist-Shannon theorem states that when a signal is sampled at X Hz, you can only unambiguously represent frequencies up to X/2 Hz. When you upsample, you increase the sample frequency, so in theory you can represent a larger frequency range. Indeed, when simply adding zeros you see some energy in those additional frequency ranges—which shouldn't be there, because you have no information about it. So you need to "cut them off" using a low pass filter. More about upsampling can be found on Wikipedia.
Long story short, there is a proper way to do it. You seem to be OK with distortions, so doing it the right way may not be necessary, but a waste of time.
Shameless plug: If you nevertheless want to do it somewhat right, you might find the Resample class of jipes useful. It's not a universal resampler, i.e., it only supports a limited number of factors, like 2, 4, ..., but it may prove useful for you.
import com.tagtraum.jipes.math.MultirateFilters.Resampler;
[...]
float[] original = ... ; // original signal as float
Resampler downsampler2 = new MultirateFilters.Resampler(1, 2);
float[] downsampled = downsampler2.map(original);
Resampler upsampler2 = new MultirateFilters.Resampler(2, 1);
float[] upsampled = upsampler2.map(original);
If you want to time-scale modification (TSM), i.e., changing the tempo without changing the frequencies, you might want to use Rubberband for Java.
This question already has answers here:
Detect silence when recording
(2 answers)
Closed 9 years ago.
I am starting a project which would allow me to use Java to read sound samples, and depending on the properties of each sample (I'm thinking focusing on decibels at the moment for the sake of simplification, or finding some way to compute the overall 'volume' of a specific sample or set of samples), return a value from 0-255 where 0 would be silence and 255 would be the highest sound pressure (Compared to a reference point, I suppose? I have no idea how to word this). I want to then have these values returned as bytes and sent to an Arduino in order to control the intensity of LED's using PWM, and visually 'see' the music.
I am not any sort of audio file format expert, and have no particular understanding of how the data is stored in a music file. As such, I am having trouble finding out how to read a sample and find a way to represent its overall volume level as a byte. I have looked through the javax.sound.sampled package and it is all very confusing to me. Any insight as to how I could accomplish this would be greatly appreciated.
First i suggest you to read Pulse-code modulation which is the format use to store data on a .wav file (the simplest to begin with).
Next there is a post on how to get PCM data from a wav file in java here.
Finally to get the "volume" (which is actually more the energy) apply this energy equation.
wish it could help you,
As Bastyen (+1 from me) indicates, calculating decibels is actually NOT simple, but requires looking at a large number of samples. However, since sound samples run MUCH more frequently than visual frames in an animation, making an aggregate measure works out rather neatly.
A nice visual animation rate, for example, updates 60 times per second, and the most common sampling rate for sound is 44100 times per second. So, 735 samples (44100 / 60 = 735) might end up being a good choice for interfacing with a visualizer.
By the way, of all the official Java tutorials I've read (I am a big fan), I have found the ones that accompany the javax.sound.sampled to be the most difficult. http://docs.oracle.com/javase/tutorial/sound/TOC.html
But they are still worth reading. If I were in charge of a rewrite, there would be many more code examples. Some of the best code examples are in several sections deep, e.g., the "Using Files and Format Converters" discussion.
If you don't wish to compute the RMS, a hack would be to store the local high and/or low value for the given number of samples. Relating these numbers to decibels would be dubious, but MAYBE could be useful after giving it a mapping of your choice to the visualizer. Part of the problem is that values for a single point on given wave can range wildly. The local high might be more due to the phase of the constituent harmonics happening to line up than about the energy or volume.
Your PCM top and bottom values would probably NOT be 0 and 256, more likely -128 to 127 for 8-bit encoding. More common still is 16-bit encoding (-32768 to 32767). But you will get the hang of this if you follow Bastyen's links. To make your code independent of the bit-encoding, you would likely normalize the data (convert to floats between -1 and 1) before doing any other calculations.
I'm trying to write some basic sound-editing programs in Java, but I've been having a huge amount of trouble with my 16-bit WAVE file format.
When I asked Java how many samples it thought my sound file had, it gave me a number twice as big as I expected. When I told Java to generate a sine wave of a 80000 byte samples, it played for 1 second instead of 2 (even though the sample rate was about 40000 per second).
After some more searching, I realized the the "frame size" of my file was 2, that a "sample" was actually 2 bytes instead of one, and that this was called a 16-bit audio file. As an experiment, I wrote my sound file to an array of bytes, set every other byte to 0, and played back the result. When I kept only the odd samples, the sound file played back with a tiny bit of static noise. When I kept only the even ones, that static noise played back on its own without the sound file. This makes me think that the even bytes contain the exact inverse of the static in the odd bytes, which contain the actual sound to be played. When played back together, the even bytes silence the static in the odd bytes, which increases the sounds fidelity.
This website has a pretty good explanation of the basics of 16-bit sound encodings. However, it's not quite good enough for me to go ahead and start editing the file byte by byte. How can I do byte-by-byte editing of a 16-bit (or larger) sound file while still preserving its higher fidelity? What's the formula for encoding sound with 16 bits per sample instead of just 8?
How can I do byte-by-byte editing of a 16-bit (or larger) sound file...?
That question does not make any sense. When you say "byte-by-byte editing", you really should be saying "sample-by-sample". In this case, every sample is 16 bits (or two bytes), and it does not make sense to split the samples apart. That would be like trying to edit only the top halves of each letter in a text editor.
A single channel of a digital audio stream is a sequence of numbers (a.k.a., samples). Each sample is a representation of the pressure exerted on a microphone diaphragm by the sound wave at some instant in time. In an eight bit sound file, there are only 256 possible values, whereas in a 16-bit sound file, there are 65536 possible values. A 16-bit file has much greater resolution.
This makes me think that the even bytes contain the exact inverse of the static in the odd bytes, which contain the actual sound to be played.
There's a kernel of truth to that. The definition of "noise" in signal processing is the difference between what you hear and what you wanted to hear. When you zeroed out all of odd-numbered bytes, you were stomping on the low-order halves of each sample. By changing the samples, you were introducing something you didn't want to hear (i.e., noise). When you zeroed out the even-numbered bytes, you killed all of the high-order bits and therefore most of the signal. What remained in the low-order bytes was the exact inverse of the noise that you had introduced in your first experiment. (your ears can't tell the difference between a given sound wave and the inverse of the same sound wave.)
There is no absolute mapping between sample values and pressure, but there are a couple of things you should know:
1) Are the samples signed or are they unsigned? Every sample has a value that must lie between some minimum and some maximum. If the (16-bit) samples are signed, then the minimum value is -32768 (0x8000), the maximum is 32767 (0x7FFF), and 0 is right in the middle. If the samples are unsigned, then the minimum is 0, and the maximum is 65535 (0xFFFF). Get it wrong, and you will know immediately because all you will hear is massive noise.
2) Are the samples linear? The sample values are always proportional to something. If they are directly proportional to the sound pressure level, that's called "linear encoding." But they may be proportional to the logarithm of the sound pressure or, to some other function of the sound pressure. Non-linear encodings are almost always 8-bit, and they usually are only encountered in specialized applications like telephony. If you are dealing with 16-bit or larger samples, then they are almost certainly linear.
I am working on a small example application for my fourth year project (dealing with Functional Reactive Programming). The idea is to create a simple program that can play a .wav file and then shows a 'bouncing' animation of the current volume of the playing song (like in audio recording software). I'm building this in Scala so have mainly been looking at Java libraries and existing solutions.
Currently, I have managed to play a .wav file easily but I can't seem to achieve the second goal. Basically is there a way I can decode a .wav file so I can have someway of accessing
the 'volume' at any given time? By volume I think I means its amplitude but I may be wrong about this - Higher Physics was a while ago....
Clearly, I don't know much about this at all so it would be great if someone could point me in the right direction!
In digital audio processing you typically refer to the momentary peak amplitude of the signal (this is also called PPM -- peak programme metering). Depending on how accurate you want to be or if you wish to model some standardised metering or not, you could either
just use a sliding window of sample frames (find the maximum absolute value per window)
implement some sort of peak-hold mechanism that retains the last peak value for a given duration and then start to have the value 'fall' by a given amount of decibels per second.
The other measuring mode is RMS which is calculated by integrating over a certain time window (add the squared sample values, divide by the window length, and take the square-root, thus root-mean-square RMS). This gives a better idea of the 'energy' of the signal, moving smoother than peak measurements, but not capturing the maximum values observed. This mode is sometimes called VU meter as well. You can approximate this with a sort of lagging (lowpass) filter, e.g. y[i] = y[i-1]*a + |x[i]|*(a-1), for some value 0 < a < 1
You typically display the values logarithmically, i.e. in decibels, as this corresponds better with our perception of signal strength and also for most signals produces a more regular coverage of your screen space.
Three projects I'm involved with may help you:
ScalaAudioFile which you can use to read the sample frames from an AIFF or WAVE file
ScalaAudioWidgets which is a still young and incomplete project to provide some audio application widgets on top of scala-swing, including a PPM view -- just use a sliding window and set the window's current peak value (and optionally RMS) at a regular interval, and the view will take care of peak-hold and fall times
(ScalaCollider, a client for the SuperCollider sound synthesis system, which you might use to play back the sound file and measure the peak and RMS amplitudes in real time. The latter is probably an overkill for your project and would involve some serious learning curve if you have never heard of SuperCollider. The advantage would be that you don't need to worry about synchronising your sound playback with the meter display)
In a wav file, the data at a given point in the stream IS the volume (shifted by half of the dynamic range). In other words, if you know what type of wav file (for example 8 bit, mono) each byte represents a single sample. If you know the sample rate (say 44100 HZ) then multiply the time by 44100 and that is the byte you want to look at.
The value of the byte is the volume (distance from the middle.. 0 and 255 are the peaks, 127 is zero). This is assuming that the encoding is not mu-law encoding. I found some good info on how to tell the difference, or better yet, convert between these formats here:
http://www.gnu.org/software/octave/doc/interpreter/Audio-Processing.html
You may want to average these samples though over a window of some fixed number of samples.
What I'm willing to do is to convert a text string into a wav file format in high frequencies (18500Hz +): this will be the encoder.
And create an engine to decode this text string from a wav formatted recording that will support error control as I will not use the same file obviously, to read, but a recording of this sound.
Thanks
An important consideration will be whether or not you want to hide the string into an existing audio file (so it sounds like a normal file, but has an encoded message -- that is called steganography), or whether you will just be creating a file that sounds like gibberish, for the purpose of encoding data only. I'm assuming the latter since you didn't ask to hide a message in an existing file.
So I assume you are not looking for low-level details on writing WAV files (I am sure you can find documentation on how to read and write individual samples to a WAV file). Obviously, the simplest approach would be to simply take each byte of the source string, and store it as a sample in the WAV file (assuming an 8-bit recording. If it's a 16-bit recording, you can store two bytes per sample. If it's a stereo 16-bit recording, you can store four bytes per sample). Then you can just read the WAV file back in and read the samples back as bytes. That's the simple approach but as you say, you want to be able to make a (presumably analog) recording of the sound, and then read it back into a WAV file, and still be able to read the data.
With the approach above, if the analog recording is not exactly perfect (and how could it be), you would lose bytes of the message. This means you need to store the message in such a way that missing bytes, or bytes that have a slight error, are not going to be a problem. How you do this will depend highly upon exactly what sort of "damage" will be happening to the sound file. I would expect two major forms of damage:
"Vertical" damage: A sample (byte) would have a slightly higher or lower value than it originally had.
"Horizontal" damage: Samples may be averaged, stretched or squashed horizontally. From a byte perspective, this means some samples may be repeated, while others may be missing.
To combat this, you need some redundancy in the message. More redundancy means the message will take up more space (be longer), but will be more reliable.
I would recommend thinking about how old (pre-mobile) telephone dial tones worked: each key generated a unique tone and sent it across the wire. The tones are long enough, and far enough apart pitch-wise that they can be distinguished even given the above forms of damage. So, choose two parameters: a) length and b) frequency-delta. For each byte of data, select a frequency, spacing the 256 byte values frequency-delta Hertz apart. Then, generate a sine wave for length milliseconds of that frequency. This encodes a lot more redundancy than the above one-byte-per-sample approach, since each byte takes up many samples, and if you lose some samples, it doesn't matter.
When you read them back in, read every length milliseconds of audio data and then estimate the frequency of the sine wave. Map this onto the byte value with the nearest frequency.
Obviously, longer values of length and further-apart frequency-delta will make the signal more reliable, but require the sound to be longer and higher-frequency, respectively. So you will have to play around with these values to see what works.
Some last thoughts, since your title says "hidden" binary data:
If you really want the data to be "hidden", consider encrypting it before encoding it to audio.
If you want to take the steganography approach, you will have to read up on audio steganography (I imagine you can use the above techniques, but you will have to insert them as extremely low-volume signals on top of the existing sound).