I'm trying to write some basic sound-editing programs in Java, but I've been having a huge amount of trouble with my 16-bit WAVE file format.
When I asked Java how many samples it thought my sound file had, it gave me a number twice as big as I expected. When I told Java to generate a sine wave of a 80000 byte samples, it played for 1 second instead of 2 (even though the sample rate was about 40000 per second).
After some more searching, I realized the the "frame size" of my file was 2, that a "sample" was actually 2 bytes instead of one, and that this was called a 16-bit audio file. As an experiment, I wrote my sound file to an array of bytes, set every other byte to 0, and played back the result. When I kept only the odd samples, the sound file played back with a tiny bit of static noise. When I kept only the even ones, that static noise played back on its own without the sound file. This makes me think that the even bytes contain the exact inverse of the static in the odd bytes, which contain the actual sound to be played. When played back together, the even bytes silence the static in the odd bytes, which increases the sounds fidelity.
This website has a pretty good explanation of the basics of 16-bit sound encodings. However, it's not quite good enough for me to go ahead and start editing the file byte by byte. How can I do byte-by-byte editing of a 16-bit (or larger) sound file while still preserving its higher fidelity? What's the formula for encoding sound with 16 bits per sample instead of just 8?
How can I do byte-by-byte editing of a 16-bit (or larger) sound file...?
That question does not make any sense. When you say "byte-by-byte editing", you really should be saying "sample-by-sample". In this case, every sample is 16 bits (or two bytes), and it does not make sense to split the samples apart. That would be like trying to edit only the top halves of each letter in a text editor.
A single channel of a digital audio stream is a sequence of numbers (a.k.a., samples). Each sample is a representation of the pressure exerted on a microphone diaphragm by the sound wave at some instant in time. In an eight bit sound file, there are only 256 possible values, whereas in a 16-bit sound file, there are 65536 possible values. A 16-bit file has much greater resolution.
This makes me think that the even bytes contain the exact inverse of the static in the odd bytes, which contain the actual sound to be played.
There's a kernel of truth to that. The definition of "noise" in signal processing is the difference between what you hear and what you wanted to hear. When you zeroed out all of odd-numbered bytes, you were stomping on the low-order halves of each sample. By changing the samples, you were introducing something you didn't want to hear (i.e., noise). When you zeroed out the even-numbered bytes, you killed all of the high-order bits and therefore most of the signal. What remained in the low-order bytes was the exact inverse of the noise that you had introduced in your first experiment. (your ears can't tell the difference between a given sound wave and the inverse of the same sound wave.)
There is no absolute mapping between sample values and pressure, but there are a couple of things you should know:
1) Are the samples signed or are they unsigned? Every sample has a value that must lie between some minimum and some maximum. If the (16-bit) samples are signed, then the minimum value is -32768 (0x8000), the maximum is 32767 (0x7FFF), and 0 is right in the middle. If the samples are unsigned, then the minimum is 0, and the maximum is 65535 (0xFFFF). Get it wrong, and you will know immediately because all you will hear is massive noise.
2) Are the samples linear? The sample values are always proportional to something. If they are directly proportional to the sound pressure level, that's called "linear encoding." But they may be proportional to the logarithm of the sound pressure or, to some other function of the sound pressure. Non-linear encodings are almost always 8-bit, and they usually are only encountered in specialized applications like telephony. If you are dealing with 16-bit or larger samples, then they are almost certainly linear.
Related
This question already has answers here:
Detect silence when recording
(2 answers)
Closed 9 years ago.
I am starting a project which would allow me to use Java to read sound samples, and depending on the properties of each sample (I'm thinking focusing on decibels at the moment for the sake of simplification, or finding some way to compute the overall 'volume' of a specific sample or set of samples), return a value from 0-255 where 0 would be silence and 255 would be the highest sound pressure (Compared to a reference point, I suppose? I have no idea how to word this). I want to then have these values returned as bytes and sent to an Arduino in order to control the intensity of LED's using PWM, and visually 'see' the music.
I am not any sort of audio file format expert, and have no particular understanding of how the data is stored in a music file. As such, I am having trouble finding out how to read a sample and find a way to represent its overall volume level as a byte. I have looked through the javax.sound.sampled package and it is all very confusing to me. Any insight as to how I could accomplish this would be greatly appreciated.
First i suggest you to read Pulse-code modulation which is the format use to store data on a .wav file (the simplest to begin with).
Next there is a post on how to get PCM data from a wav file in java here.
Finally to get the "volume" (which is actually more the energy) apply this energy equation.
wish it could help you,
As Bastyen (+1 from me) indicates, calculating decibels is actually NOT simple, but requires looking at a large number of samples. However, since sound samples run MUCH more frequently than visual frames in an animation, making an aggregate measure works out rather neatly.
A nice visual animation rate, for example, updates 60 times per second, and the most common sampling rate for sound is 44100 times per second. So, 735 samples (44100 / 60 = 735) might end up being a good choice for interfacing with a visualizer.
By the way, of all the official Java tutorials I've read (I am a big fan), I have found the ones that accompany the javax.sound.sampled to be the most difficult. http://docs.oracle.com/javase/tutorial/sound/TOC.html
But they are still worth reading. If I were in charge of a rewrite, there would be many more code examples. Some of the best code examples are in several sections deep, e.g., the "Using Files and Format Converters" discussion.
If you don't wish to compute the RMS, a hack would be to store the local high and/or low value for the given number of samples. Relating these numbers to decibels would be dubious, but MAYBE could be useful after giving it a mapping of your choice to the visualizer. Part of the problem is that values for a single point on given wave can range wildly. The local high might be more due to the phase of the constituent harmonics happening to line up than about the energy or volume.
Your PCM top and bottom values would probably NOT be 0 and 256, more likely -128 to 127 for 8-bit encoding. More common still is 16-bit encoding (-32768 to 32767). But you will get the hang of this if you follow Bastyen's links. To make your code independent of the bit-encoding, you would likely normalize the data (convert to floats between -1 and 1) before doing any other calculations.
What is the low level actual format of sound data when read from a stream in Java? For example, use the following dataline with 44.1khz sample rate, 16 bit sample depth, 2 channels, signed data, bigEndian format.
TargetDataLine tdLine = new TargetDataLine(new AudioFormat(44100,16,2,true,true));
I understand that it is sampling 44100 times a second and each sample is 16bits. What I don't understand is what the 16 bits, or each of the 16 bits, represent. Also, does each channel have its own 16bit sample?
I'll start with your last question first, yes, each channel has its own 16-bit sample for each of the 44100 samples each second.
As for your first question, you have to know about the hardware inside of a speaker. There is a diaphragm and an electormagnet. The diaphragm is the big round part you can see if you take the cover off. When the electromagnet is charged it pulls or pushes a ferrous plate that is attached to the diaphragm, causing it to move. That movement becomes a sound.
The value of each sample is how much electricity is sent to the speaker. So when a sample is zero, the diaphragm is at rest. When it is positive it is pushed one way and when it is negative, the other way. The larger the sample, the more the diaphragm is moved.
If you graph all of the samples in your data, you would have a graph of the movement of the speaker over time.
You should learn about the Digital Audio Basics (Wiki gives you a start and lots of links with further reads). After that 44.1khz sample rate, 16 bit sample depth, 2 channels, signed data, bigEndian format should immediately tell you the low level format.
In this case it means 44100 samples/sec, 16 bit signed integers representing each sample and finally endianess determines in which order the bytes of a 16bit int are put into the stream (big endian = most significant byte first).
What I'm willing to do is to convert a text string into a wav file format in high frequencies (18500Hz +): this will be the encoder.
And create an engine to decode this text string from a wav formatted recording that will support error control as I will not use the same file obviously, to read, but a recording of this sound.
Thanks
An important consideration will be whether or not you want to hide the string into an existing audio file (so it sounds like a normal file, but has an encoded message -- that is called steganography), or whether you will just be creating a file that sounds like gibberish, for the purpose of encoding data only. I'm assuming the latter since you didn't ask to hide a message in an existing file.
So I assume you are not looking for low-level details on writing WAV files (I am sure you can find documentation on how to read and write individual samples to a WAV file). Obviously, the simplest approach would be to simply take each byte of the source string, and store it as a sample in the WAV file (assuming an 8-bit recording. If it's a 16-bit recording, you can store two bytes per sample. If it's a stereo 16-bit recording, you can store four bytes per sample). Then you can just read the WAV file back in and read the samples back as bytes. That's the simple approach but as you say, you want to be able to make a (presumably analog) recording of the sound, and then read it back into a WAV file, and still be able to read the data.
With the approach above, if the analog recording is not exactly perfect (and how could it be), you would lose bytes of the message. This means you need to store the message in such a way that missing bytes, or bytes that have a slight error, are not going to be a problem. How you do this will depend highly upon exactly what sort of "damage" will be happening to the sound file. I would expect two major forms of damage:
"Vertical" damage: A sample (byte) would have a slightly higher or lower value than it originally had.
"Horizontal" damage: Samples may be averaged, stretched or squashed horizontally. From a byte perspective, this means some samples may be repeated, while others may be missing.
To combat this, you need some redundancy in the message. More redundancy means the message will take up more space (be longer), but will be more reliable.
I would recommend thinking about how old (pre-mobile) telephone dial tones worked: each key generated a unique tone and sent it across the wire. The tones are long enough, and far enough apart pitch-wise that they can be distinguished even given the above forms of damage. So, choose two parameters: a) length and b) frequency-delta. For each byte of data, select a frequency, spacing the 256 byte values frequency-delta Hertz apart. Then, generate a sine wave for length milliseconds of that frequency. This encodes a lot more redundancy than the above one-byte-per-sample approach, since each byte takes up many samples, and if you lose some samples, it doesn't matter.
When you read them back in, read every length milliseconds of audio data and then estimate the frequency of the sine wave. Map this onto the byte value with the nearest frequency.
Obviously, longer values of length and further-apart frequency-delta will make the signal more reliable, but require the sound to be longer and higher-frequency, respectively. So you will have to play around with these values to see what works.
Some last thoughts, since your title says "hidden" binary data:
If you really want the data to be "hidden", consider encrypting it before encoding it to audio.
If you want to take the steganography approach, you will have to read up on audio steganography (I imagine you can use the above techniques, but you will have to insert them as extremely low-volume signals on top of the existing sound).
I'm trying to invert a sound wave (phase shift 180 degrees), but I'm not exactly sure how I would go about doing this. Can any audio programmers point me in the right direction?
Inverting a sound wave should be generally easy if you have access to the byte array that makes up the sound. You simply need to take the negative of each value in the stream.
Audio streams come in many different flavors so it's impossible to be specific. However, if it was a 16bit PCM stream, which is full of 2-byte values, you'd loop over the data and for each two bytes in the stream: cast it to a short, take the negative of it, and put it back into the byte stream.
Hi I need to downsample a wav audio file's sample rate from 44.1kHz to 8kHz. I have to do all the work manually with a byte array...it's for academic purposes.
I am currently using 2 classes, Sink and Source, to pop and push arrays of bytes. Everything goes well until I reach the part where I need to downsample the data chunk using a linear interpolation.
Since I'm downsampling from 44100 to 8000 Hz, how do I interpolate a byte array containing something like 128 000 000 bytes? Right now I'm popping 5, 6 or 7 bytes depending on i%2 == 0, i%2 == 1 and i%80 == 0 and push the average of these 5, 6 or 7 bytes into the new file.
The result is indeed a smaller audio file than the original but it cannot be played on windows media player (says there is an error while reading the file) and there is a lot of noise although I can hear the right track behind the noise.
So, to sum things up, I need help concerning the linear interpolation part. Thanks in advance.
I think you shouldn't use the average of those samples as that would be a median filter, not exactly downsampling. Just use every 5th/6th/7th sample and write that to the new file.
That will probably have some aliasing artifacts but might overall be recognizable.
Another, more complex solution but probably one with better results, quality-wise, would be to first convert your samples into a frequency distribution using a FFT or DFT and then convert it back with the appropriate sample rate. It's been a while since I have done such a thing but it's definitely doable. You may need to fiddle around a bit to get it working properly, though.
Also when not taking a FT of the complete array but rather in segments you have the problem of the segment boundaries being 0. A few years ago when I played with those things I didn't come up with a viable solution to this (since it generates artifacts as well) but there probably is one if you read the right books :-)
As for WMP complaining about the file: You did modify the header you write accordingly, right?