I've been working on implementing a system for real-time audio capture and analysis within an existing music software project. The goal of this system is to begin capturing audio when the user presses the record button (or after a specified count-in period), determine the notes the user sings or plays, and notate these notes on a musical staff. The gist of my method is to use one thread to capture chunks of audio data and put them into a queue, and another thread to remove the data from the queue and perform the analysis.
This scheme works well, but I am having trouble quantifying the latency between the onset of audio capture and playback of the MIDI backing instruments. Audio capture begins before the MIDI instruments begin playing back, and the user is presumably going to be synchronizing his or her performance with the MIDI instruments. Therefore, I need to ignore audio data captured before the backing MIDI instruments begin playing and only analyze audio data collected after that point.
Playback of the backing tracks is handled by a body of code that has been in place for quite a while and maintained by someone else, so I would like to avoid refactoring the whole program if possible. Audio capture is controlled with a Timer object and a class that extends TimerTask, instances of which are created in a lumbering (~25k lines) class called Notate. Notate also keeps tabs on the objects that handle playback of the backing tracks, by the way. The Timer’s .scheduleAtFixedRate() method is used to control periods of audio capture, and the TimerTask notifies the capture thread to begin by calling .notify() on the queue (ArrayBlockingQueue).
My strategy for calculating the time gap between the initialization of these two processes has been to subtract the timestamp taken just before capture begins (in milliseconds) from the timestamp taken at the moment playback begins, which I'm defining this as when the .start() method is called on the Java Sequencer object that is in charge of the MIDI backing tracks. I then use the result to determine the number of audio samples that I expect to have been captured during this interval (n) and ignore the first n * 2 bytes in the array of captured audio data (n * 2 because I am capturing 16-bit samples, whereas the data is stored as a byte array… 2 bytes per sample).
However, this method is not giving me accurate results. The calculated offset is always less than I expect it to be, such that there remains a non-trivial (and unfortunately varied) amount of “empty” space in the audio data after beginning analysis at the designated position. This causes the program to attempt to analyze audio data collected when the user had not yet begun to play along with the backing MIDI instruments, effectively adding rests - the absence of musical notes - at the begging of the user’s musical passage and ruining the rhythm values calculated for all subsequent notes.
Below is the code for my audio capture thread, which also determines the latency and corresponding position offset for the array of captured audio data. Can anyone offer insight into why my method for determining latency is not working correctly?
public class CaptureThread extends Thread
{
public void run()
{
//number of bytes to capture before putting data in the queue.
//determined via the sample rate, tempo, and # of "beats" in 1 "measure"
int bytesToCapture = (int) ((SAMPLE_RATE * 2.) / (score.getTempo()
/ score.getMetre()[0] / 60.));
//temporary buffer - will be added to ByteArrayOutputStream upon filling.
byte tempBuffer[] = new byte[target.getBufferSize() / 5];
int limit = (int) (bytesToCapture / tempBuffer.length);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(bytesToCapture);
int bytesRead;
try
{ //Loop until stopCapture is set.
while (!stopCapture)
{ //first, wait for notification from TimerTask
synchronized (thisCapture)
{
thisCapture.wait();
}
if (!processingStarted)
{ //the time at which audio capture begins
startTime = System.currentTimeMillis();
}
//start the TargetDataLine, from which audio data is read
target.start();
//collect 1 captureInterval's worth of data
for (int n = 0; n < limit; n++)
{
bytesRead = target.read(tempBuffer, 0, tempBuffer.length);
if (bytesRead > 0)
{ //Append data to output stream.
outputStream.write(tempBuffer, 0, bytesRead);
}
}
if (!processingStarted)
{
long difference = (midiSynth.getPlaybackStartTime()
+ score.getCountInTime() * 1000 - startTime);
positionOffset = (int) ((difference / 1000.)
* SAMPLE_RATE * 2.);
if (positionOffset % 2 != 0)
{ //1 sample = 2 bytes, so positionOffset must be even
positionOffset += 1;
}
}
if (outputStream.size() > 0)
{ //package data collected in the output stream into a byte array
byte[] capturedAudioData = outputStream.toByteArray();
//add captured data to the queue for processing
processingQueue.add(capturedAudioData);
synchronized (processingQueue)
{
try
{ //notify the analysis thread that data is in the queue
processingQueue.notify();
} catch (Exception e)
{
//handle the error
}
}
outputStream.reset(); //reset the output stream
}
}
} catch (Exception e)
{
//handle error
}
}
}
I am looking into using a Mixer object to synchronize the TargetDataLine which is accepting data from the microphone and the Line that handles playback from the MIDI instruments. Now to find the Line that handles playback... Any ideas?
Google has a good open source app called AudioBufferSize that you are probably familiar with. I modified this app the test one way latency- that is to say, the time between when a user presses a button and the sound is played by the Audio API. Here is the code I added to AudioBufferSize to achieve this. Could you use such an approach to provide the timing delta between the event and when the user perceives it?
final Button latencyButton = (Button) findViewById(R.id.latencyButton);
latencyButton.setOnClickListener(new OnClickListener() {
public void onClick(View v) {
mLatencyStartTime = getCurrentTime();
latencyButton.setEnabled(false);
// Do the latency calculation, play a 440 hz sound for 250 msec
AudioTrack sound = generateTone(440, 250);
sound.setNotificationMarkerPosition(count /2); // Listen for the end of the sample
sound.setPlaybackPositionUpdateListener(new OnPlaybackPositionUpdateListener() {
public void onPeriodicNotification(AudioTrack sound) { }
public void onMarkerReached(AudioTrack sound) {
// The sound has finished playing, so record the time
mLatencyStopTime = getCurrentTime();
diff = mLatencyStopTime - mLatencyStartTime;
// Update the latency result
TextView lat = (TextView)findViewById(R.id.latency);
lat.setText(diff + " ms");
latencyButton.setEnabled(true);
logUI("Latency test result= " + diff + " ms");
}
});
sound.play();
}
});
There is a reference to generateTone which looks likes this:
private AudioTrack generateTone(double freqHz, int durationMs) {
int count = (int)(44100.0 * 2.0 * (durationMs / 1000.0)) & ~1;
short[] samples = new short[count];
for(int i = 0; i < count; i += 2){
short sample = (short)(Math.sin(2 * Math.PI * i / (44100.0 / freqHz)) * 0x7FFF);
samples[i + 0] = sample;
samples[i + 1] = sample;
}
AudioTrack track = new AudioTrack(AudioManager.STREAM_MUSIC, 44100,
AudioFormat.CHANNEL_OUT_STEREO, AudioFormat.ENCODING_PCM_16BIT,
count * (Short.SIZE / 8), AudioTrack.MODE_STATIC);
track.write(samples, 0, count);
return track;
}
Just realized, this question is multi-years old. Sorry, maybe someone will find it useful.
Related
I am managing audio capturing and playing using java sound API (targetDataLine and sourceDataLine). Now suppose in a conference environment, one participant's audio queue size got greater than jitter size (due to processing or network) and I want to fast forward the audio bytes I have of that participant to make it shorter than jitter size.
How can I fast forward the audio byte array of that participant?
I can't do it during playing as normally Player thread just deque 1 frame from every participant's queue and mix it for playing. The only way I can get that is if I deque more than 1 frame of that participant and mix(?) it for fast-forwarding before mixing it with other participants 1 dequeued frame for playing?
Thanks in advance for any kind of help or advice.
There are two ways to speed up the playback that I know of. In one case, the faster pace creates a rise in pitch. The coding for this is relatively easy. In the other case, pitch is kept constant, but it involves a technique of working with sound granules (granular synthesis), and is harder to explain.
For the situation where maintaining the same pitch is not a concern, the basic plan is as follows: instead of advancing by single frames, advance by a frame + a small increment. For example, let's say that advancing 1.1 frames over a course of 44000 frames is sufficient to catch you up. (That would also mean that the pitch increase would be about 1/10 of an octave.)
To advance a "fractional" frame, you first have to convert the bytes of the two bracketing frames to PCM. Then, use linear interpolation to get the intermediate value. Then convert that intermediate value back to bytes for the output line.
For example, if you are advancing from frame[0] to frame["1.1"] you will need to know the PCM for frame[1] and frame[2]. The intermediate value can be calculated using a weighted average:
value = PCM[1] * 9/10 + PCM[2] * 1/10
I think it might be good to make the amount by which you advance change gradually. Take a few dozen frames to ramp up the increment and allow time to ramp down again when returning to normal dequeuing. If you suddenly change the rate at which you are reading the audio data, it is possible to introduce a discontinuity that will be heard as a click.
I have used this basic plan for dynamic control of playback speed, but I haven't had the experience of employing it for the situation that you are describing. Regulating the variable speed could be tricky if you also are trying to enforce keeping the transitions smooth.
The basic idea for using granules involves obtaining contiguous PCM (I'm not clear what the optimum number of frames would be for voice, 1 to 50 millis is cited as commonly being used with this technique in synthesis), and giving it a volume envelope that allows you to mix sequential granules end-to-end (they must overlap).
I think the envelopes for the granules make use of a Hann function or Hamming window--but I'm not clear on the details, such as the overlapping placement of the granules so that they mix/transition smoothly. I've only dabbled, and I'm going to assume folks at Signal Processing will be the best bet for advice on how to code this.
I found a fantastic git repo (sonic library, mainly for audio player) which actually does exactly what I wanted with so much controls. I can input a whole .wav file or even chunks of audio byte arrays and after processing, we can get speed up play experience and so more. For real time processing I actually called this on every chunk of audio byte array.
I found another way/algo to detect whether a audio chunk/byte array is voice or not and after depending on it's result, I can simply ignore playing non voice packets which gives us around 1.5x speedup with less processing.
public class DTHVAD {
public static final int INITIAL_EMIN = 100;
public static final double INITIAL_DELTAJ = 1.0001;
private static boolean isFirstFrame;
private static double Emax;
private static double Emin;
private static int inactiveFrameCounter;
private static double Lamda; //
private static double DeltaJ;
static {
initDTH();
}
private static void initDTH() {
Emax = 0;
Emin = 0;
isFirstFrame = true;
Lamda = 0.950; // range is 0.950---0.999
DeltaJ = 1.0001;
}
public static boolean isAllSilence(short[] samples, int length) {
boolean r = true;
for (int l = 0; l < length; l += 80) {
if (!isSilence(samples, l, l+80)) {
r = false;
break;
}
}
return r;
}
public static boolean isSilence(short[] samples, int offset, int length) {
boolean isSilenceR = false;
long energy = energyRMSE(samples, offset, length);
// printf("en=%ld\n",energy);
if (isFirstFrame) {
Emax = energy;
Emin = INITIAL_EMIN;
isFirstFrame = false;
}
if (energy > Emax) {
Emax = energy;
}
if (energy < Emin) {
if ((int) energy == 0) {
Emin = INITIAL_EMIN;
} else {
Emin = energy;
}
DeltaJ = INITIAL_DELTAJ; // Resetting DeltaJ with initial value
} else {
DeltaJ = DeltaJ * 1.0001;
}
long thresshold = (long) ((1 - Lamda) * Emax + Lamda * Emin);
// printf("e=%ld,Emin=%f, Emax=%f, thres=%ld\n",energy,Emin,Emax,thresshold);
Lamda = (Emax - Emin) / Emax;
if (energy > thresshold) {
isSilenceR = false; // voice marking
} else {
isSilenceR = true; // noise marking
}
Emin = Emin * DeltaJ;
return isSilenceR;
}
private static long energyRMSE(short[] samples, int offset, int length) {
double cEnergy = 0;
float reversOfN = (float) 1 / length;
long step = 0;
for (int i = offset; i < length; i++) {
step = samples[i] * samples[i]; // x*x/N=
// printf("step=%ld cEng=%ld\n",step,cEnergy);
cEnergy += (long) ((float) step * reversOfN);// for length =80
// reverseOfN=0.0125
}
cEnergy = Math.pow(cEnergy, 0.5);
return (long) cEnergy;
}
}
Here I can convert my byte array to short array and detect whether it is voice or non voice by
frame.silence = DTHVAD.isSilence(encodeShortBuffer, 0, shortLen);
I am writing some code that intends to take a Wave file, and write it out to and AudioTrack in mode stream. This is a minimum viable test to get AudioTrack stream mode working.
But once I write some buffer of audio to the AudioTrack, and subsequently call play(), the method getPlaybackHeadPosition() continually returns 0.
EDIT: If I ignore my available frames check, and just continually write buffers to the AudioTrack, the write method returns 0 (after the the first buffer write), indicating that it simply did not write any more audio. So it seems that the AudioTrack just doesn't want to start playing.
My code is properly priming the audiotrack. The play method is not throwing any exceptions, so I am not sure what is going wrong.
When stepping through the code, everything on my end is exactly how I anticipate it, so I am thinking somehow I have the AudioTrack configured wrong.
I am running on an emulator, but I don't think that should be an issue.
The WavFile class I am using is a vetted class that I have up and running reliably in lots of Java projects, it is tested to work well.
Observe the following log write, which is a snippet from the larger chunk of code. This log write is never hitting...
if (headPosition > 0)
Log.e("headPosition is greater than zero!!");
..
public static void writeToAudioTrackStream(final WavFile wave)
{
Log.e("writeToAudioTrackStream");
Thread thread = new Thread()
{
public void run()
{
try {
final float[] data = wave.getData();
int format = -1;
if (wave.getChannel() == 1)
format = AudioFormat.CHANNEL_OUT_MONO;
else if (wave.getChannel() == 2)
format = AudioFormat.CHANNEL_OUT_STEREO;
else
throw new RuntimeException("writeToAudioTrackStatic() - unsupported number of channels value = "+wave.getChannel());
final int bufferSizeInFrames = 2048;
final int bytesPerSmp = wave.getBytesPerSmp();
final int bufferSizeInBytes = bufferSizeInFrames * bytesPerSmp * wave.getChannel();
AudioTrack audioTrack = new AudioTrack(AudioManager.STREAM_MUSIC, wave.getSmpRate(),
format,
AudioFormat.ENCODING_PCM_FLOAT,
bufferSizeInBytes,
AudioTrack.MODE_STREAM);
int index = 0;
float[] buffer = new float[bufferSizeInFrames * wave.getChannel()];
boolean started = false;
int framesWritten = 0;
while (index < data.length) {
// calculate the available space in the buffer
int headPosition = audioTrack.getPlaybackHeadPosition();
if (headPosition > 0)
Log.e("headPosition is greater than zero!!");
int framesInBuffer = framesWritten - headPosition;
int availableFrames = bufferSizeInFrames - framesInBuffer;
// once the buffer has no space, the prime is done, so start playing
if (availableFrames == 0) {
if (!started) {
audioTrack.play();
started = true;
}
continue;
}
int endOffset = availableFrames * wave.getChannel();
for (int i = 0; i < endOffset; i++)
buffer[i] = data[index + i];
int samplesWritten = audioTrack.write(buffer , 0 , endOffset , AudioTrack.WRITE_BLOCKING);
// could return error values
if (samplesWritten < 0)
throw new RuntimeException("AudioTrack write error.");
framesWritten += samplesWritten / wave.getChannel();
index = endOffset;
}
}
catch (Exception e) {
Log.e(e.toString());
}
}
};
thread.start();
}
Per the documentation,
For portability, an application should prime the data path to the maximum allowed by writing data until the write() method returns a short transfer count. This allows play() to start immediately, and reduces the chance of underrun.
With a strict reading, this might be seen to contradict the earlier statement:
...you can optionally prime the data path prior to calling play(), by writing up to bufferSizeInBytes...
(emphasis mine), but the intent is clear enough: You're supposed to get a short write first.
This is just to get play started. Once that takes place, you can, in fact, use
getPlaybackHeadPosition() to determine when more space is available. I've used that technique successfully in my own code, on many different devices/API levels.
As an aside: You should be prepared for getPlaybackHeadPosition() to change only in large increments (if I remember correctly, it's getMinBufferSize()/2). This is the max resolution available from the system; onMarkerReached() cannot be used to do any better.
I've been working in a project where I need to manipulate each instrument in a MIDI file in java.
Then I decided to get each MIDI Event from each track from the Sequence and send it to a Receiver. After that the thread waits the time each tick lasts then do it again with the next tick.
The problem is: the sound of the instruments gets very messed, as well as their order.
I tried to execute each track alone too, but it's still messed!
The code:
Sequence sequence = MidiSystem.getSequence(new File(source));
Synthesizer synth = MidiSystem.getSynthesizer();
//Gets a MidiMessage and send it to Synthesizer
Receiver rcv = synth.getReceiver();
//Contains all tracks and events from MIDI file
Track[] tracks = sequence.getTracks();
synth.open();
//If there are tracks
if(tracks != null)
{
//Verify the division type of the sequence (PPQ, SMPT)
if(sequence.getDivisionType() == Sequence.PPQ)
{
int ppq = sequence.getResolution();
//Do the math to get the time (in miliseconds) each tick takes
long tickTime = TicksToMiliseconds(BPM,ppq);
//Returns the number of ticks from the longest track
int longestTrackTicks = LongestTrackTicks(tracks);
//Each iteration sends a new message to 'receiver'
for(int tick = 0; tick < maiorTick ; tick++)
{
//Iteration of each track
for(int trackNumber = 0; trackNumber < tracks.length; trackNumber++)
{
//If the number of ticks from a track isn't already finished
//continue
if(tick < tracks[trackNumber].size())
{
MidiEvent ev = tracks[trackNumber].get(tick);
rcv.send(ev.getMessage(),-1);
}
}
Thread.sleep(tickTime);
}
}
}
synth.close();
As ntabee said, Track.get(n) returns the nth event in the track; to get events by time, you have to compare the events' times manually.
Furthermore, Thread.sleep() is not very precise and can wait for a longer time than desired.
These errors will add up.
To change MIDI messages in real time, tell the sequencer to play to your own Receiver, then do whatever you want to the events and pass them on to the 'real' Receiver.
At least, your code looks like ringing on/off something at every tickTime msec. period.
track.get(tick) just returns the tick-th event in the track, not the event(s) at the moment of tick.
If your goal is just playing a sound, Java provides a high level API for it, see e.g. http://www.jsresources.org/examples/midi_playback_and_recording.html
I decided to use the sequencer. I didn't know, but the "start" method runs in a new thread and while it's still running I can mute each instrument I want to, and that is exactly what I wanted.
Thanks for the answers!
I'm working on a voice recording app. In it, I have a Seekbar to change the input voice gain.
I couldn't find any way to adjust the input voice gain.
I am using the AudioRecord class to record voice.
recorder = new AudioRecord(MediaRecorder.AudioSource.MIC,
RECORDER_SAMPLERATE, RECORDER_CHANNELS,
RECORDER_AUDIO_ENCODING, bufferSize);
recorder.startRecording();
I've seen an app in the Google Play Store using this functionality.
As I understand you don't want any automatic adjustments, only manual from the UI. There is no built-in functionality for this in Android, instead you have to modify your data manually.
Suppose you use read (short[] audioData, int offsetInShorts, int sizeInShorts) for reading the stream. So you should just do something like this:
float gain = getGain(); // taken from the UI control, perhaps in range from 0.0 to 2.0
int numRead = read(audioData, 0, SIZE);
if (numRead > 0) {
for (int i = 0; i < numRead; ++i) {
audioData[i] = (short)Math.min((int)(audioData[i] * gain), (int)Short.MAX_VALUE);
}
}
Math.min is used to prevent overflow if gain is greater than 1.
Dynamic microphone sensitivity is not a thing that the hardware or operating system is capable of as it requires analysis on the recorded sound. You should implement your own algorithm to analyze the recorded sound and adjust (amplify or decrease) the sound level on your own.
You can start by analyzing last few seconds and find a multiplier that is going to "balance" the average amplitude. The multiplier must be inversely proportional to the average amplitude to balance it.
PS: If you still want to do it, the mic levels are accessible when you have a root access, but I am still not sure -and don't think it is possible- if you can change the settings while recording. Hint: "/system/etc/snd_soc_msm" file.
Solution by OP.
I have done it using
final int USHORT_MASK = (1 << 16) - 1;
final ByteBuffer buf = ByteBuffer.wrap(data).order(
ByteOrder.LITTLE_ENDIAN);
final ByteBuffer newBuf = ByteBuffer.allocate(
data.length).order(ByteOrder.LITTLE_ENDIAN);
int sample;
while (buf.hasRemaining()) {
sample = (int) buf.getShort() & USHORT_MASK;
sample *= db_value_global;
newBuf.putShort((short) (sample & USHORT_MASK));
}
data = newBuf.array();
os.write(data);
This is working implementation based on ByteBuffer for 16bit audio. It's important to clamp the increased value from both sides since short is signed. It's also important to set the native byte order to ByteBuffer since audioRecord.read() returns native endian bytes.
You may also want to perform audioRecord.read() and following code in a loop, calling data.clear() after each iteration.
double gain = 2.0;
ByteBuffer data = ByteBuffer.allocateDirect(SAMPLES_PER_FRAME).order(ByteOrder.nativeOrder());
int audioInputLengthBytes = audioRecord.read(data, SAMPLES_PER_FRAME);
ShortBuffer shortBuffer = data.asShortBuffer();
for (int i = 0; i < audioInputLengthBytes / 2; i++) { // /2 because we need the length in shorts
short s = shortBuffer.get(i);
int increased = (int) (s * gain);
s = (short) Math.min(Math.max(increased, Short.MIN_VALUE), Short.MAX_VALUE);
shortBuffer.put(i, s);
}
I am having trouble understanding how I should pass PCM data from the mic to this FFT class I am using made by Piotr Wendykier (it's the DoubleFFT_1D class in JTransforms).
I think I have to return a real and imaginary number and then double the real number to eventually obtain Frequency = 8000 * i / 1024 where i is the index of the highest magnitude.
Can someone help me in finding the frequency of a note played in?
I have a recording class as follows:
import edu.emory.mathcs.jtransforms.fft.DoubleFFT_1D;
...other various imports...
class recorderThread {
...public variables...
public static void getFFtresult(){
AudioRecord recorder;
short[] audioData;
int bufferSize;
int samplerate = 8000;//or 8192?
bufferSize= AudioRecord.getMinBufferSize(samplerate,AudioFormat.CHANNEL_CONFIGURATION_MONO,
AudioFormat.ENCODING_PCM_16BIT)*2; //get the buffer size to use with this audio record
recorder = new AudioRecord (AudioSource.MIC,samplerate,AudioFormat.CHANNEL_CONFIGURATION_MONO,
AudioFormat.ENCODING_PCM_16BIT,bufferSize); //instantiate the AudioRecorder
recording=true; //variable to use start or stop recording
audioData = new short [bufferSize]; //short array that pcm data is put into.
int recordingLoops = 0;
while (recordingLoops < 4) { //loop while recording is needed
if (recorder.getState()==android.media.AudioRecord.STATE_INITIALIZED) // check to see if the recorder has initialized yet.
if (recorder.getRecordingState()==android.media.AudioRecord.RECORDSTATE_STOPPED)
recorder.startRecording(); //check to see if the Recorder has stopped or is not recording, and make it record.
else {
recorder.read(audioData,0,bufferSize); //read the PCM audio data into the audioData array
DoubleFFT_1D fft = new DoubleFFT_1D(1023); //instance of DoubleFFT_1D class
double[] audioDataDoubles = new double[1024];
for (int j=0; j <= 1023; j++) { // get audio data in double[] format
audioDataDoubles[j] = (double)audioData[j];
}
fft.complexForward(audioDataDoubles); //this is where it falls
for (int i = 0; i < 1023; i++) {
Log.v(TAG, "audiodata=" + audioDataDoubles[i] + " no= " + i);
}
recordingLoops++;
}//else recorder started
} //while recording
if (recorder.getState()==android.media.AudioRecord.RECORDSTATE_RECORDING) recorder.stop(); //stop the recorder before ending the thread
recorder.release(); //release the recorders resources
recorder=null; //set the recorder to be garbage collected
}//run
}//recorderThread
Thanks so much!
Ben
If you are looking for the pitch of a musical note, you will find that pitch is often different from the spectral frequency peak produced by an FFT, especially for lower notes.
To find the frequency peak from a complex FFT, you need to calculate the vector magnitude of both the real and imaginary results.
mag(i) = sqrt(real[i]*real[i] + imag[i]*imag[i]);
Since you're using real audio data as the input you should use realForward function:
fft.realForward(audioDataDoubles);
Then you can compute the energy on a frequency by computing the magnitude of real and imaginary parts:
magn[i] = audioDataDoubles[2*i]*audioDataDoubles[2*i] + audioDataDoubles[2*i+1]*audioDataDoubles[2*i+1]