Format of the bytes read through AudioSystem.read()

Format of the bytes read through AudioSystem.read() - java

I want to know the format of the bytes returned from TargetDataLine.read(). I get numbers but I don't know what they represent. Any idea? I've Googled to try and find out and read the API documentation but can't work it out...

It depends on the bit-depth and encoding of the audio stream you're trying to read from. Assuming each data stream represents a single channel of audio then there won't be interleaved channels to worry about (i.e. the stream is mono rather than stereo or surround sound). The audio format of the object whose class implements TargetDataLine will determine how the data should be interpreted when read.
If the stream represents audio in the most common audio format, which is Red Book Audio (16-bit samples at a sample rate of 44.1 kHz), then every two bytes will represent a sample; that sample will be a signed 16-bit integer of linearly-distributed amplitude values ranging from positive-displacement 0 dBFS (maximum positive integer value), through negative infinity decibels (integer value 0) to negative-displacement 0 dBFS (maximum negative integer value). Similarly, if the bit-depth is 24-bits then every 3 bytes will represent a single LPCM sample.
If the bit-depth is 32-bit or higher then the format will be signed floating-point PCM samples.

Related

How is a variable actually stored in memory for Java?

I'm learning about Text I/O and Binary I/O in java right now. I read that each value that you write to a file is initially stored in binary. For text I/O, the individual digits are converted to it's corresponding Unicode values and then encoded to the file-specific encoding such as ASCII. For binary I/O, the binary value is directly represented in the file. For example, 199 would be represented as 0xC7 which in binary is 11000111. Now I'm confused on one part. If a variable is initially stored as a binary format, does each digit represent a separate byte that is stored or is the entirety of the number stored as a single byte. For example, is 199 originally stored as 0xc7 which would be 11000111 in binary? Or would it be stored in 3 bytes with each byte representing the binary value for the digit. If it was stored in 3 separate bytes, does binary I/O convert that 3 byte number to a single byte? If it's stored in a single byte, how does text I/O translate that single byte into 3 separate byte values. I'm just confused on how to word this. Hope you can understand what I'm getting at. Thanks

The only thing which a computer is capable of dealing with are sets of 0/1 bits which are stored in memory or, if you wish on a storage device. Those bits can be streamed to monitors and converted to characters by graphical hardware. Sams story with keyboards, you type a key and a few bits of data will be send to the computer.
Bits are stored in memory and are accessible by memory addresses. The addresses are also sets of bits.
For practical reasons the bits are grouped into bytes, words, long words, ... A byte used to be the smallest addressable unit of bits and historically ended up as a group of 8 bits, which is currently used in most of the hardware. Modern memory can store data in multiple byte addressable chunks. Same for the disk, you store data there, using specific addressing mechanisms. But in any case those are just sets of bits.
What you are confused about is the interpretation of those bits. They can represent integer numbers, floating point numbers, characters, addresses, ... The way they are interpreted only depends on the program which uses them.
Characters do not exist in the computer. They are just an abstraction which is provided by programming languages. The programs interpret the bits stored on the computer. There are standards. For example the ASCII encoding maps English characters plus a few special characters into numbers from 0 to 127. Those fit into a single byte (leaving number 128 to 255 for special use). A print command will read those bytes one by one and send them to graphics to form letters on the screen as specified in the encoding standard. Different encoding scheme will display the same bytes differently.
If you write a program wit the "hello world" sting in it, the program will convert the symbols between quotes into a set of 11 ascii bytes. (In 'c' it will add yet another byte which is equal to '0' and ends the string this way). Unicode is yet another way to represent characters. Every unicode character is represented by multiple bytes of data. There are other schemes as well. One thing to pay attention to. If you write strings on the disk using certain encoding, you should read them with the same encoding, or your prints will give you garbage. But you can always read and copy then as binary data without interpretation.
So, any variable of any type is just an abstraction and always consists of bytes of data which your program knows how to interpret based on the data type and/or operations it wants to perform. Variables of type int, double, any java object, including String, are just sets of bytes of different sizes. Only the program (and java interpreter is a program) knows what to do with them, use them in calculations or display as characters.

What does interleaved stereo PCM linear Int16 big endian audio look like?

I know that there are a lot of resources online explaining how to deinterleave PCM data. In the course of my current project I have looked at most of them...but I have no background in audio processing and I have had a very hard time finding a detailed explanation of how exactly this common form of audio is stored.
I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]...
What I don't understand is what exactly this means. I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB]. Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?
Thank you everyone. Any help is appreciated.
Edit: If you choose to give examples please refer to the following.
Method Context
Specifically what I have to do is convert an interleaved short[] to two float[]'s each representing the left or right channel. I will be implementing this in Java.
public static float[][] deinterleaveAudioData(short[] interleavedData) {
//initialize the channel arrays
float[] left = new float[interleavedData.length / 2];
float[] right = new float[interleavedData.length / 2];
//iterate through the buffer
for (int i = 0; i < interleavedData.length; i++) {
//THIS IS WHERE I DON'T KNOW WHAT TO DO
}
//return the separated left and right channels
return new float[][]{left, right};
}
My Current Implementation
I have tried playing the audio that results from this. It's very close, close enough that you could understand the words of a song, but is still clearly not the correct method.
public static float[][] deinterleaveAudioData(short[] interleavedData) {
//initialize the channel arrays
float[] left = new float[interleavedData.length / 2];
float[] right = new float[interleavedData.length / 2];
//iterate through the buffer
for (int i = 0; i < left.length; i++) {
left[i] = (float) interleavedData[2 * i];
right[i] = (float) interleavedData[2 * i + 1];
}
//return the separated left and right channels
return new float[][]{left, right};
}
Format
If anyone would like more information about the format of the audio the following is everything I have.
Format is PCM 2 channel interleaved big endian linear int16
Sample rate is 44100
Number of shorts per short[] buffer is 2048
Number of frames per short[] buffer is 1024
Frames per packet is 1

I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]... What I don't understand is what exactly this means.
Interleaved PCM data is stored one sample per channel, in channel order before going on to the next sample. A PCM frame is made up of a group of samples for each channel. If you have stereo audio with left and right channels, then one sample from each together make a frame.
Frame 0: [left sample][right sample]
Frame 1: [left sample][right sample]
Frame 2: [left sample][right sample]
Frame 3: [left sample][right sample]
etc...
Each sample is a measurement and digital quantization of pressure at an instantaneous point in time. That is, if you have 8 bits per sample, you have 256 possible levels of precision that the pressure can be sampled at. Knowing that sound waves are... waves... with peaks and valleys, we are going to want to be able to measure distance from the center. So, we can define center at 127 or so and subtract and add from there (0 to 255, unsigned) or we can treat those 8 bits as signed (same values, just different interpretation of them) and go from -128 to 127.
Using 8 bits per sample with single channel (mono) audio, we use one byte per sample meaning one second of audio sampled at 44.1kHz uses exactly 44,100 bytes of storage.
Now, let's assume 8 bits per sample, but in stereo at 44.1.kHz. Every other byte is going to be for the left, and every other is going to be for the R.
LRLRLRLRLRLRLRLRLRLRLR...
Scale it up to 16 bits, and you have two bytes per sample (samples set up with brackets [ and ], spaces indicate frame boundaries)
[LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR]...
I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB].
Not necessarily. The audio can be stored in any endianness. Little endian is the most common, but that isn't a magic rule. I do think though that all channels go in order always, and front left would be channel 0 in most cases.
Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?
Each value (16-bit integer in this case) is destined for a single channel. Never would you have two multi-byte values smashed into each other.
I hope that's helpful. I can't run your code but given your description, I suspect you have an endian problem and that your samples aren't actual big endian.

Let's start by getting some terminology out of the way
A channel is a monaural stream of samples. The term does not necessarily imply that the samples are contiguous in the data stream.
A frame is a set of co-incident samples. For stereo audio (e.g. L & R channels) a frame contains two samples.
A packet is 1 or more frames, and is typically the minimun number of frames that can be processed by a system at once. For PCM Audio, a packet often contains 1 frame, but for compressed audio it will be larger.
Interleaving is a term typically used for stereo audio, in which the data stream consists of consecutive frames of audio. The stream therefore looks like L1R1L2R2L3R3......LnRn
Both big and little endian audio formats exist, and depend on the use-case. However, it's generally ever an issue when exchanging data between systems - you'll always use native byte-order when processing or interfacing with operating system audio components.
You don't say whether you're using a little or big endian system, but I suspect it's probably the former. In which case you need to byte-reverse the samples.
Although not set in stone, when using floating point samples are usually in the range -1.0<x<+1.0, so you want to divide the samples by 1<<15. When 16-bit linear types are used, they are typically signed.
Taking care of byte-swapping and format conversions:
int s = (int) interleavedData[2 * i];
short revS = (short) (((s & 0xff) << 8) | ((s >> 8) & 0xff))
left[i] = ((float) revS) / 32767.0f;

Actually your are dealing with an almost typical WAVE file at Audio CD quality, that is to say :
2 channels
sampling rate of 44100 kHz
each amplitude sample quantized on a 16-bits signed integer
I said almost because big-endianness is usually used in AIFF files (Mac world), not in WAVE files (PC world). And I don't know without searching how to deal with endianness in Java, so I will leave this part to you.
About how the samples are stored is quite simple:
each sample takes 16-bits (integer from -32768 to +32767)
if channels are interleaved: (L,1),(R,1),(L,2),(R,2),...,(L,n),(R,n)
if channels are not: (L,1),(L,2),...,(L,n),(R,1),(R,2),...,(R,n)
Then to feed an audio callback, it is usually required to provide 32-bits floating point, ranging from -1 to +1. And maybe this is where something may be missing in your aglorithm. Dividing your integers by 32768 (2^(16-1)) should make it sound as expected.

I ran into a similar issue with de-interleaving the short[] frames that came in through Spotify Android SDK's onAudioDataDelivered().
The documentation for onAudioDelivered was poorly written a year ago. See Github issue. They've updated the docs with a better description and more accurate parameter names:
onAudioDataDelivered(short[] samples, int sampleCount, int sampleRate, int channels)
What can be confusing is that samples.length can be 4096. However, it contains only sampleCount valid samples. If you're receiving stereo audio, and sampleCount = 2048 there are only 1024 frames (each frame has two samples) of audio in samples array!
So you'll need to update your implementation to make sure you're working with sampleCount and not samples.length.

Reading amplitude at the beginning and end of a 16 bit WAV file in Java

I'm reading in 16 bit WAV files in Java (apparently always little-endian). I need to check the amplitude at the beginning and end of a WAV file. I'm hoping for silence at the start and end of clips but need to report on a scale if not. The files are always accessible locally. I've read about converting the file to a byte array and that converting each byte to a signed integer representation of the hex gives the amplitude but (if this is the case) I'm confused about how to apply this to audio that would need to be split across 2 bytes per sample. I've also read about bit-shifting but I'm unsure if it's relevant if I use a byte array.
To clarify, I'd rather not use unnecessary imports if possible but could and I don't have to use bytes to divide up the WAV, I only need a reliable way to present the amplitude at particular points in the array (start and end).

If I need 44000 bytes per second, how many shorts do I need per 1/10th of a second

I have a thread that takes screen shots using java.awt.Robot and then encodes them into a video using Xuggler in a loop.
The loop encodes the image, then makes the thread sleep for some time depending upon the frame rate.
All good so far. The problems arise when I try to encode audio.
Specifically, maintaining the sample rate and size
I am using TargetDataLine to read data into a byte[]. This data is already BigEndian formatted.
The magic is in providing proper amount of data at proper time.
My AudioFormat looks like this:
Sample Rate: 44000Hz
Sample Size In Bits: 16
Signed: true
BigEndian: true
Assuming 10fps and 44000Hz sample rate, I will need to provide
what should be the size of the byte[]?
how much data? (measured in short because that is what Xuggler wants)
and at what time do I call the encodeAudio() method? I mean after 10 passes of the loop or 5 passes, etc.
Misc:
Community member Alex I gave me this formula:
shortArray.length == ((timeStamp - lastTimeStamp) / 1e+9) * sampleRate * channels;
a rough calculation got me the answer of 4782 shorts for one second.
I know when you pass the audio to be encoded it must be for one full second
So I must capture 480 shorts per pass, and then encode it finally after the 10th pass.
Please tell me if this deduction is correct?

If you have 16 bits (two bytes) at a sample rate of 44,000 Hz that is 88,000 bytes per second. If you have stereo it is double that. In 1/10 second you need a 1/10th of that. i.e. 8,800 bytes per deci-second (1/10th of a second)
How do i retrieve short[] from ByteBuffer that has a byte[] as backing array
byte[] bytes = { };
ByteBuffer bb = ByteBuffer.wrap(bytes);
short[] shorts = new short[bb.remaining()/2];
bb.order(ByteOrder.LITTLE_ENDIAN).asShortBuffer().get(shorts);
If the order is BigEndian you don't need to change it.

How is audio stored in Java

This is a quick question, how is audio stored when it's in a byte array, like when a image is stored in a byte array there is three(Red, green, blue) bytes per pixel. So how is a audio stored in a byte array?
Thanks,
Liam.

There are various possible encodings that are supported in Java. See:
http://docs.oracle.com/javase/1.5.0/docs/api/javax/sound/sampled/AudioFormat.html
http://docs.oracle.com/javase/1.5.0/docs/api/javax/sound/sampled/AudioFormat.Encoding.html
The most simple form is PCM coding, in which each sample is a linear number that represents the sound waveform (which could be 1 byte for 8-bit encoding).
You also have to consider the number of channels (1 for mono, 2 for stereo). So 16-bit PCM-encoded stereo sound will require 4 bytes per sample, for example.

It's a combination of signals (analog / digital) having a unique frequency for each and every tone. And as it's said in the previous answer, yes, Pulse Code Modulation (PCM) is supported in java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.